Accelerators tag team in Top500 supercomputer CPU smackdown
Titan cuts down Sequoia with Tesla K20X GPUs
SC12 This fall's Top500 supercomputer sites ranking continues the smackdown between massively parallel supercomputers based on CPUs and somewhat less-massive machines using a mix of CPUs and accelerators. In this round, the hybrid ceepie-geepies, which mix CPUs and GPU coprocessors, have the CPU-only boxes on the ropes – and the ceepie-xeephies, which mix CPUs and Intel's Xeon Phi x86 coprocessors, are hitting them with the metal chair.
As many had been expecting, the "Titan" ceepie-geepie that was fired up at the end of October at the Oak Ridge National Laboratory, one of the US Department of Energy's big supercomputer centers, is the top machine on the list. And, as it turns out, it is more powerful than many (including El Reg) had been anticipating, because the "Kepler" Tesla K20 GPU coprocessors from Nvidia pack a bit more number-crunching oomph than expected.
Technically speaking, the Titan supercomputer is a Cray XK7, which marries sixteen-core Opteron 6200 processors one-for-one with the top-end Tesla K20X coprocessor – see our separate story for more details on the K20 and K20X GPU coprocessors, which are launched today at the SC12 supercomputing conference in Salt Lake City along with the Top 500 rankings.
The Titan machine has Nvidia's hottest new GPU, the Tesla K20X, but is not paired up with AMD's just-announced sixteen-core Opteron 6300 processors. The machines does not use that snazzy new "Aries" interconnect launched last week, either, which is at the heart of Cray's XC30 next-generation super. It doesn't matter much, however, because even with CPUs and an interconnect that are back one generation, Titan has 18,688 processors and 18,688 GPUs for a total of 560,640 cores, and has a peak theoretical performance of 27.1 petaflops.
All that Oak Ridge would say last week was that the Titan behemoth would have more than 20 petaflops of aggregate peak floating-point performance. On the Linpack Fortran benchmark test that is used to rank the Top500 supers, the Titan machine is still not particularly efficient, with a 64.88 per cent computational efficiency delivering 17.58 petaflops of aggregate performance. But that was enough to beat out the former number one machine, the "Sequoia" BlueGene/Q machine built by IBM for Lawrence Livermore National Laboratory, another DOE lab in the States.
The Titan supercomputer at Oak Ridge
The Sequoia machine is based on IBM's 18-core PowerPC AS processor, and has the exact same performance as it had when on the June 2012 Top500 list, at just over 20 petaflops peak and 16.32 petaflops sustained on the Linpack test. The significant thing is that the BlueGene/Q system, which does not use coprocessors, runs at 81.1 per cent efficiency.
The thing that Nvidia and Cray will be focusing on is thermal efficiency, which for exascale systems might be just as important as computational efficiency. The Sequoia machine and its BlueGene/Q siblings on the list have ruled the energy efficiency roost, delivering 2,069 megaflops per watt sustained on the Linpack test, on a system that burns 7.89 megawatts. However, Titan is the new energy efficiency leader as well as performance leader, since it burns 8.21 megawatts and delivers 2,143 megaflops per watt.
In third place after Sequoia is the K machine built using Sparc64 processors by Fujitsu for the Japanese government. It has had the exact same configuration for a year: 705,024 cores lashed together with the "Tofu" 5D torus interconnect, providing 10.51 petaflops sustained. The "Mira" baby BlueGene/Q box that IBM installed at Argonne National Laboratory as a companion to Sequoia among DOE machines, shifts down by one spot to number four, and remains the same at 816.2 teraflops.
The JuQueen BlueGene/Q machine at Forschungszentrum Juelich (FZJ) in Germany moves up three spots; this machine has been expanded, and now delivers 4.14.petaflops of sustained performance on the Linpack benchmark. The SuperMUC machine at Leibniz Rechenzentrum (LRZ), a water-cooled iDataPlex server based on Intel's Xeon E5 processors, slips two spots with its 2.9 petaflops Linpack rating.
The "Stampede" machine at the University of Texas, built by Dell and Intel using a mix of Xeon E5 processors and the Xeon Phi multicore x86 coprocessors – which debut Monday after many years of development, and as the main competitor to Nvidia's Tesla GPU coprocessors – is the new big, bad box on the Top 500 list.
Stampede is based on quasi-custom C8220 hyperscale servers. Stampede has a total of 204,900 cores, with 112,500 of them tucked inside the Xeon Phi coprocessors, and delivers 3.95 petaflops peak aggregate performance and 2.66 petaflops sustained on the Linpack test. That's a computational efficiency of 67.2 per cent, which like other coprocessors is not all that great compared to massively parallel CPU machines.
There's no power rating on the Stampede machine in the Top500 list, but tutorial documentation available online says that a 10-petaflops variant of Stampede due in 2013 will burn around 6 megawatts in its over 100,000 Xeon cores and nearly 500,000 Xeon Phi cores, and deliver around 1,667 megaflops per watt peak and who knows what sustained on Linpack.
Clusters dominate the Top 500 supercomputers these days
Number seven on the list is the "Tianhe-1A" ceepie-geepie that topped the list two years ago and marked the rise of coprocessing in parallel supercomputing.
Tianhe-1A is one of several petaflops-class machines under construction in China, and is located at the National Supercomputer Center in Tianjin. This particular machine paired Intel's Xeon 5600 processors with Nvidia's Tesla M2050 fanless GPUs to deliver 4.7 petaflops of peak performance and 2.57 petaflops of sustained performance on the Linpack test. This machine burns just over 4 megawatts and delivers 635 megaflops per watt, which is simply not efficient enough if you don't want to get killed by your power bill.
The eighth-ranked box on the list is the "Fermi" BlueGene/Q machine at Cineca in Italy, which has 163,840 cores and delivers 1.73 petaflops sustained performance and 2,099 megaflops per watt.
IBM has upgraded a Power 775 cluster installed at the US Defense Research Projects Agency, which moves up to number ten on the list. The Power 775 cluster is a chunk of what was supposed to be the "Blue Waters" super at the National Center for Supercomputing Applications at the University of Illinois before IBM pulled the plug on the deal in August 2011. That left Cray able to sell NCSA a variant of the XK7 ceepie-geepie that will be installed before the end of the year and will have at least 10 petaflops of raw number-crunching power.
Anyway, the Power 775 machine at DARPA has 63,360 Power7 cores and a peak performance of 1.94 petaflops and 1.52 petaflops sustained on the Linpack test – a 77.9 per cent computational efficiency. This machine, which has very sophisticated packaging and networking, but burns a staggering 3.58 megawatts, delivers only 424 megaflops per watt.
At list price, El Reg estimated back in the summer of 2011 that it would cost $1.5bn to build a 10 petaflops Power 775 cluster, which is very expensive. The additional cost of electricity would make it unfeasible, as well. IBM killed the project because it would cost too much money to manufacture – and honestly, it probably did itself and NCSA a favor. The Top500 list organizers say this DARPA machine is only partially assembled; it is unclear how big IBM will build it.
That said, IBM's Blue Waters project has lots of interesting technologies that can be brought to bear in future supers; that's why you do the research in the first place. The original BlueGene effort in 1999 was a gamble, and one that IBM conceded might not pay off. The question now is whether new IBM CEO Ginni Rometty wants to make the same investment in HPC that CEO Lou Gerstner – not exactly a man to throw money at things he didn't think would pan out – did just before handed the reins to Sam Palmisano at the turn of the century.