Top 500 supers – The Dawning of the GPUs
Coiled for the 10 petaflops spring
With the International Super Computing conference underway this week, the Top 500 ranking of the world's most powerful supercomputers is out, and the bi-annual is just starting to be transformed by the advent of cheap flops embodied in graphics co-processing engines from Nvidia and Advanced Micro Devices.
While the 1.76 petaflops "Jaguar" Opteron cluster built by Cray for the Oak Ridge National Laboratory held onto its top spot on the list without any changes since last November, the "Nebulae" cluster made by Dawning for the National Supercomputing Center in Shenzhen in China is nipping at Jaguar's tail with a blade machine that marries Intel's Xeons to Nvidia's GPU co-processors.
As you will see from perusing the Top 500 list , Jaguar is an XT5 massively parallel cluster with a 3D torus interconnect that currently has six-core Opteron 8400 processors and uses Cray's SeaStar2+ interconnect. It has 224,162 cores to deliver a peak theoretical performance of 2.33 petaflops and delivers 1.76 petaflops of sustained performance on the Linpack Fortran matrix math benchmark. Jaguar could be upgraded with the twelve-core XT6 Opteron blades and the new "Gemini" interconnect, which Cray debuted last week  as the XE6 super, formerly code-named "Baker" and easily doubling performance.
Thus far, Oak Ridge has not divulged its plans, but is monkeying around  with x64 clusters and Nvidia next-generation "Fermi" GPUs. It would be interesting to see what a next-generation "Cascades" super from Cray, using the "Aries" interconnect (a kicker to the just-announced Gemini), Intel Xeon processors (very likely "Sandy Bridge" Xeons with eight or more cores each), and Nvidia GPUs might do in terms of sustained performance. We'll have to wait a few years to see that, and it may be at Oak Ridge and it may not.
But for the moment, China's NSCS is enthusiastically adopting Dawning's TC3600 blade servers, equipped with Intel's six-core X5650 processors and Nvidia's C2050 GPUs. The exact configuration of the Nebulae machine at NSCS was not available at press time, but the TC3600 blade server is a 10U chassis that holds ten two-socket blades. The C2050s are PCI-Express GPU co-processors with 448 cores and 3 GB of their own GDDR5 memory, rated at 515 gigaflops doing double-precision floating point math and 1.03 teraflops doing single-precision. The Top 500 ranking for Nebulae does not provide blade or GPU count, but the word on the street is that it has 4,700 nodes. What the Top 500 does say the machine has 120,640 cores in total for a peak theoretical performance of 2.98 petaflops and 1.27 petaflops sustained running the Linpack test. All of the nodes in the Dawning blade cluster are linked by quad data rate (40 Gb/sec) InfiniBand switches.
The first thing to notice about the Jaguar and Nebulae supers is the difference between peak and sustained performance. For the Cray Jaguar Opteron cluster, 75.5 per cent of the flops contained in the box end up doing real Linpack work, while on the Dawning Xeon-Tesla hybrid, only 42.6 per cent of the peak performance embodied in the CPUs and GPUs actually push Linpack math. So it would seem that the all-X64 machine has the edge, right? Wrong. Jaguar cost around $200m to build and burns around 7 megawatts of juice, while the Nebulae machine probably costs on the order of $50m (that's an El Reg estimate) and burns only 2.55 megawatts of juice.
When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)
The power efficiency from using math co-processors is, of course, why Los Alamos National Laboratory had IBM build the "Roadrunner" hybrid Opteron-Cell massively parallel super, which marries blades using dual-core Opterons with blades using IBM's PowerXCell 8i co-processors to create what is now a one petaflops sustained super. (A year and a half ago, Roadrunner had a few more nodes, was rated at 1.1 petaflops, and it was the fastest super in the world, but that was before most of the machine was taken out of public view and began its classified nuclear simulations).
Number four on the Top 500 is another Cray Opteron cluster called "Kracken" that is sitting at Oak Ridge, which is owned by the University of Tennessee but operated by the US Department of Energy. Kracken is an XT5 parallel box with 98,928 cores and comprised of AMD's six-core Opteron 8400s; it weighs in at just over 1 petaflops of peak performance and is rated at 831.7 teraflops on the Linpack test. Number five on the list is the "Jugene" BlueGene/P cluster built by IBM for the Forschungszentrum Juelich in Germany, which has 294,912 PowerPC cores and is rated at 825.5 teraflops sustained and has a peak theoretical performance of just over a petaflops, too.
Two other petaflops-class machines are on the list. At number seven is the "Tianhe-1" supercomputer build by the Chinese government and that entered the Top 500 last fall. The box has not changed at all in the past six months. It is comprised of Xeon server nodes using a mix of E5540 and E5450 processors, with each node configured with two of AMD's Radeon HD 4870 graphics cards to be used as co-processors. The machine has 71,680 cores, and it's rated at 563.1 of sustained teraflops and 1.2 petaflops of peak theoretical performance. Again, there's that wide gap between peak and sustained performance with CPU-GPU combos — a gap that has to close. Number six on the list this time around is the "Pleiades" Altix ICE cluster at NASA Ames, which has lower peak performance at 973.3 teraflops, but bests the Tianhe-1 (short for "River in the Sky" in Chinese) on the Linpack test, with 772.7 sustained teraflops of performance.
Rounding out the top ten at number eight is IBM's BlueGene/L super at Lawrence Livermore National Laboratory, which ruled the roost for a number of years, with its 478.2 teraflops of sustained performance, followed by the "Intrepid" BlueGene/P box at Argonne National Laboratory, rated at 458.6 teraflops. Number ten was made by Oracle (well, really Sun Microsystems back when it cared about supercomputing), the "Red Sky" 433.5 teraflops blade super at Sandia National Laboratory. The "Ranger" Sun blade super at the University of Texas, rated at nearly the same speed on Linpack (but with 62,976 cores and a slower interconnect), was pushed down to number eleven.
The politics of the Top 500
The Top 500 list of supercomputers is put together twice a year to pit the fastest 500 supercomputers in the world against each other regardless of processor, interconnect technology, operating system, or whatever. Erich Strohmaier and Horst Simon, computer scientists at Lawrence Berkeley National Laboratory, Jack Dongarra of the University of Tennessee, and Hans Meuer of the University of Manheim make the list, which is based on the Linpack Fortran benchmark test created by Dongarra and colleagues Jim Bunch, Cleve Moler, and Pete Stewart back in the 1970s to gauge the relative number-crunching performance of computers. The official Top 500 list came out in 1993, and this June 2010 compilation is the 35th edition of the list.
The Top 500 is not a particularly good gauge of what is going on in the totality of the HPC market, but it most definitely is a good indicator of technology shifts among the cutting-edge buyers of supercomputers that will in many cases trickle their way into the mainstream of HPC computing.
Supercomputing is often about politics as much as it is about actually doing simulations, and it is hard to miss how fast China is becoming a force in the petascale era. Which stands to reason, given the strength of the Chinese economy and its desire to excel in the sciences and bend science to industry, just like its peers in the more established economies have done for decades. There are now 24 systems on the Top 500 list, and the two mentioned above - Nebulae and Tianhe-1 - are not only Top 10 machines, they have enough Linpack performance to catapult China ahead of all the other countries in the world in terms of sustained performance installed. China has as many supers on the list as Germany, and it now ranks second in terms of aggregate computing power, with 9.2 per cent of the 32.4 petaflops of total power accounted for on the list.
The United States is still the biggest investor in Top 500-class machines, with 282 of the 500 machines (56.4 per cent) and just under 18 petaflops of floating point oomph in those boxes (55.4 per cent of the total math power on the list). The United Kingdom has 38 machines with a total computing power of 1.7 petaflops, giving it a little more than half the share of flops as China has. France has 29 machines with a total of 1.76 petaflops (5.4 per cent of the total power), while Germany's 24 boxes have 2.25 petaflops (6.9 per cent).
Japan, once a high flyer in the HPC realm, has backed off from its formerly aggressive stance, mainly because it does not have the billions of dollars or the political will to sustain a supercomputing program that can compete with the US and now China. There are 18 Japanese supers on the June 2010 Top 500 list, which have a total of 1.25 petaflops of aggregate performance.
In terms of architecture, 74 of the machines on the June 2010 list are massively parallel boxes with some kind of sophisticated interconnect, while two are constellation configurations and 424 are more generic clusters using InfiniBand or Ethernet. There are 242 machines that use plain old Gigabit Ethernet, and only two using 10 Gigabit Ethernet. There are 205 machines that use one speed or another of InfiniBand, with the rest being a mix of custom and proprietary interconnects such as IBM's Federation, Cray's SeaStar, and SGI's NUMAlink. There's only one vector machine still on the list, which is the 122.4 teraflops parallel NEC SX-9 super known as Earth Simulator, which was at the top of the list a decade ago. (Ranked 37 today). All of the remaining machines use scalar processors, although more and more of them are being augmented with co-processors.
As is typical on the Top 500 list, old gear doesn't stay afloat for long. Of the 500 boxes on the list, 62 were installed in 2008, 229 in 2009, and 183 in 2010. There are only 26 machines older than that on the list. But the turnover on the list has slowed, thanks to the economic slowdown and despite plenty of stimulus money being shelled out by governments in the United States, Europe, and China.
To get on the list this time around, your machine had to demonstrate at least 52.8 teraflops of punch on the Linpack test, up from 47.7 teraflops only six months ago. The aggregate computing power on the list continues to swell, too, to 32.4 petaflops, up from 27.6 petaflops six months ago and 22.6 petaflops in the June 2009 list.
Intel's processors continue to dominate the list, with 408 machines using either Xeon (403 systems) or Itanium (five systems) processors. AMD's Opterons are used in 47 machines, IBM's various Power chips are used in 42 machines, with the remaining machines being comprised of two Sparc boxes and that one NEC Earth Simulator behemoth. For Intel-based supers, there are still 182 machines using the old "Harpertown" quad-core Xeon 5400s in their L, E, and X variants as well as a bunch of older Xeon 5100 and 5300 processors in 30 other machines. None of these machines can last for very long on the list, given the energy efficiency of new servers based on six-core or twelve-core x64 processors.
The "Nehalem-EP" quad-core Xeon 5500 processors are in 184 machines, and there are already seven boxes using the new "Westmere-EP" six-core Xeon 5600s. There are already two boxes on the list using the high-end, eight-core "Nehalem-EX" Xeon 7500 processors, but they are relatively tiny. There are 31 boxes using quad-core Opterons, five using six-core Opterons, and five with the twelve-core Opterons. There are ten boxes using IBM's PowerPC chips in BlueGene boxes and another 18 using machines using Power6 or Power6+ chips. There's a smattering of Power5, Sparc, and Itanium in there, too.
By manufacturer, IBM is once again at the top of the lost in terms of system count and aggregate flops installed. IBM has 198 machines on the list (39.6 per cent of the total) and the IBM label is associated with 10.9 petaflops of performance (33.6 per cent of the total). Hewlett-Packard, which hasn't had a Top 10 system in a long time, still sells lots of clusters of modest size, and has 185 machines on the list (37 per cent of machines) for a total of 6.62 petaflops (20.4 per cent of the flops pie).
Cray has 21 systems on the list, with a total of 4.8 petaflops (4.2 per cent of machines, but 14.8 per cent of capacity), with Silicon Graphics having 17 boxes (3.4 per cent of machines and 6.6 per cent of capacity). Sun (now Oracle) has a dozen machines on the list, but it is hard to imagine that Oracle will be interested in pursuing HPC for the sake of being on the Top 500 list. If Oracle has plans for HPC beyond data analytics, it sure hasn't communicated this to the IT community. ®