The Aurora supercomputer topped the IO500 production list breaking the bar of 20TiB/s with only a subset of the storage and client nodes available for the run.
Some interesting facts about this feat:
Number of MPI tasks/processes | 63k |
Number of compute endpoints | 24k |
Number of storage nodes | 642 |
Number of DAOS engines | 1284 |
DAOS pool size | 160PiB |
Largest file fize | 8.5PiB |
Total number of files | 177 Billion |
Number of files in a single directory | 33 Billion |
Data protection schemes | 16+1 erasure code 2-way replication |
Unlike the IO500 research list, the production list imposes strict requirements on reproducibility and redundancy of the storage system that should have no single point of failure.
The second entry in the production list is the DAOS deployment of the SuperMUC-NG Phase 2 system at LRZ in Germany that demonstrated an incredible efficiency with 59 score points per storage node (vs 50 for Aurora, although this efficiency should go higher when the system will eventually use more compute nodes). No other parallel filesystem from the production list comes anywhere close to this efficiency.
We can’t wait to see how far the Aurora DAOS deployment will go once running at full capacity!