(Image credit: AMD)

Today, the Oak Ridge Leadership Computing Facility (OLCF) announced that Crusher, a small iteration of the $600 million Frontier supercomputer that will be the United States’ first exascale machine, is now online and generating impressive results. Crusher’s 192 HPE Cray EX blades are crammed into 1.5 cabinets that consume 1/100th the size of the previous 4,352 square foot Titan Supercomputer, yet the new system delivers faster overall performance. 

Crusher features the same architectural components as the 1.5-exaflop Frontier supercomputer, which each HPE Cray EX blade packing one 64-core AMD EPYC “Trento” 7A53 CPU and four AMD “Aldebaran” MI250X GPUs, but Frontier won’t be available to researchers until January 1, 2023.

However, researchers are now using Crusher to ready their scientific code for Frontier today, and with impressive results. Highlights include a 15-fold speedup over the Nvidia and IBM-powered Summit supercomputer with the Cholla astrophysics code that has been rewritten for Frontier, with 3-fold of the improvement chalked up to hardware improvements while another five-fold of improvement comes from software optimizations. Meanwhile, the NuCCOR nuclear physics code has seen an 8-fold speedup with the MI250X GPUs compared to the Nvidia V100 GPUs used in Summit. Additionally, the OLCF announced that LSMS materials code that crunches through large-scale simulations up to 100,000 atoms has also been successfully run on Crusher and will scale to run on the full Frontier system. The OLCF also touts an 80% increase over previous unspecified systems with Transformer deep learning model workloads. 

It isn’t surprising that Crusher’s new hardware outperforms the Titan Supercomputer — that old sprawling supercomputer came online in 2013 with 200 cabinets that housed 18,688 AMD Opteron 6274 16-core CPUs, 18,688 Nvidia Tesla K20X GPUs, and the Gemini interconnect, all of which consumed a total of 8.2 MW of power. The system was spread out over 4,352 square feet and delivered 17.6 petaFLOPS of sustained performance in Linpack and a theoretical peak of 27 petaFLOPS.

(Image credit: OLCF)

In contrast, Crusher only spans 1.5 cabinets, one with 128 nodes and the other with 64, for a total of 192 nodes that consume 44 square feet of space. Each water-cooled node comes with a single 64-core custom Zen 3 chip, the “Trento” EPYC 7A53 processor that AMD hasn’t shared much detail about, though we do know it is an EPYC Milan derivative. The chip’s I/O die is rumored to employ Infinity Fabric 3.0 to enable a coherent memory interface with GPUs.

The Trento chip is paired with 512GB of DDR4 memory (205 GB/s) and four AMD MI250X accelerators, each of which comes armed with two ~790mm^2 Graphics Compute Dies (GCDs) that wield the CDNA2 architecture and communicate across a 200 GB/s bus. In effect, these four 550W GPUs serve as the equivalent of eight GPUs in each node.

Each Trento CPU is carved up into four NUMA domains. Each domain (and its affiliated two banks of L3 cache) connects to two GCDs (one GPU) with a coherent memory interface at 36+36 GB/s over the Infinity Fabric, yielding 288 GB/s of total CPU-to-GPU bandwidth spread among the eight GCDs in the node.

Meanwhile, each MI250X GPU houses an HPE Slingshot 200 GBps (25 GB/s) Ethernet NIC (via a PCIe root complex) that connects to the HPE Slingshot network, for 100 GB/s of network bandwidth per node. All of this is compute horsepower is connected to a 250 PB storage appliance that offers a peak of 2.5 TB/s of throughput and uses the IBM Spectrum Scale filesystem.

(Image credit: AMD)

The OLCF hasn’t yet released power consumption figures, or peak performance in Linpack, for the Crusher system. However, we know that each 768 MI250X delivers a peak of 53 TFLOPS of double-precision, meaning a theoretical peak of roughly 40 PetaFLOPS (assuming linear scaling). 

Source link


World News