Top 500 supers: China rides GPUs to world domination
The People's Republic of Petaflops
SC10 If the June edition  of the bi-annual ranking of the Top 500 supercomputers in the world represented the dawning of the GPU co-processor as a key component in high performance computing, then the November list is breakfast time. The super centers of the world are smacking their lips for some flop-jacks with OpenCL syrup and some x64 bacon on the side.
China has the most voracious appetite for GPU co-processors, and as expected two weeks ago when the Tianhe-1A super was booted up  for the first time, this hybrid CPU-GPU machine installed at the National Supercomputer Center in Tianjin has taken the top spot on the Top 500 list with a comfortable margin. Tianhe-1A's final rating on the Linpack Fortran matrix math benchmark test is 4.7 petaflops of peak theoretical performance spread across its CPUs and GPUs (with about about 70 per cent of that coming from the GPUs) and 2.56 petaflops of sustained performance on the Linpack test.
The Tianhe-1A machine is comprised of 7,168 servers, each equipped with two sockets using Intel's X5670 processors running at 2.93 GHz and one Nvidia Tesla M2050 fanless GPU co-processor. The resulting machine spans 112 racks, and it would make a hell of a box on which to play Crysis.
While 47 per cent of the floating-point oomph in Tianhe-1A disappears into the void where all missed clock cycles go (it's also where missing socks from the dryer cavort), the GPU's flops are relatively inexpensive and the overall machine should offer excellent bang for the buck - provided workloads can scale across the ceepie-geepie of course. The Tianhe-1A super uses a proprietary interconnect called Arch, which was developed by the Chinese government. The Arch switch links the server nodes together using optical-electric cables in a hybrid fat tree configuration and has a bi-directional bandwidth of 160 Gb/sec, a latency for a node hop of 1.57 microseconds, and an aggregate bandwidth of more than 61 Tb/sec.
The Tianhe-1A GPU-GPU hybrid super
This is not the first ceepie-geepie machine that the National Supercomputer Center has put together. A year ago, the Tianhe-1 machine broke onto the Top 500 list using Intel Xeon chips and Advanced Micro Devices Radeon HD 4870 GPUs (no Tesla GPUs, but actual graphics cards). This initial "Milky Way" box (that's what "Tianhe" translates to in English) had 71,680 cores and had a peak theoretical performance of 1.2 petaflops and a sustained performance of 563.1 teraflops. The efficiency of this cluster was 53 per cent, sustained over peak performance.
The "Jaguar" XT5 system at the US Department of Energy's Oak Ridge National Laboratory was knocked out of the top spot by Tianhe-1A, which is what happens when a cat stands still in the GPU era of HPC. The Jaguar machine has 224,162 Opteron cores spinning at 2.6 GHz and delivers 1.76 petaflops of performance on the Linpack test. This Cray machine links Opteron blade servers using its SeaStar2+ interconnect, which has been superseded by the new "Gemini" XE interconnect in the XE6 supers that started rolling out this summer.
If Oak Ridge moved to twelve-core Opteron 6100 processors and the XE6 interconnect, it could have doubled the performance of Jaguar and held into the Top 500 heavyweight title. One other thing to note: The Jaguar machine is 75.5 per cent efficient on the Linpack benchmark, a lot better than the Tianhe-1A ceepie-geepie.
The "Nebulae" ceepie-geepie built from six-core Intel Xeon 5650 processors and Nvidia M2050 GPUs that made its debut on the June 2010 Top 500 list got knocked down from number 2 to number 3 on the list. The Nebulae machine, which is a blade server design from Chinese server maker Dawning, is installed at the National Supercomputing Center in Shenzhen. It is rated at 1.27 sustained petaflops at 43 per cent efficiency against peak theoretical performance.
Number four on the list is also a ceepie-geepie, it is the upgraded Tsubame 2 machine at the Tokyo Institute of Technology. (That's shortened to TiTech rather than TIT, which would be where you'd expect a machine called Milky Way to be located. But we digress). The Tsubame 2 machine is built from Hewlett-Packard's SL390s G7 cookie sheet servers, which made their debut  in early October. TiTech announced the Tsubame 2 deal back in May , and this machine includes over 1,400 of these HP servers, each with three M2050 GPUs from Nvidia.
The Tsubame 2 machine has 73,278 cores and is rated at 2.29 peak petaflops and delivered 1.19 petaflops of sustained performance on the Linpack test. That's a 52 percent efficiency, about what the other ceepie-geepies are getting. By the way, the prior Tsubame 1 machine was based on x64 servers from Sun Microsystems, with floating point accelerators from Clearspeed in only some of the nodes. And one more thing: Tsubame 2 runs both Linux and Windows, and according to the Top 500 rankers, both operating systems offer nearly equivalent performance.
In the Hopper
The fifth most-powerful super in the world based on the Linpack tests (at least the ones we know about) is a brand new box called Hopper. Installed at the US DOE's National Energy Research Scientific Computing center, Hopper is a Cray XE6 super using that new Gemini interconnect and twelve-core Opteron 6100 processors - no fancy schmancy GPU co-processors. (Well, at least not yet, anyway.) Hopper has 153,408 cores spinning at 2.1 GHz and delivers 1.05 petaflops of sustained performance with an efficiency of 82 per cent.
If it is not yet obvious, there is a bottleneck in getting parallel supercomputer nodes to talk through their networking stacks running on their x64 processors and out over the PCI-Express 2.0 bus. If Nvidia or AMD want to do something useful, embedding a baby x64 processor inside of a GPU co-processor along with a switchable 10 Gigabit Ethernet or 40 Gb/sec InfiniBand port would make a very interesting baby server node. Throw in cache coherence between the x64 and GPU processors and maybe getting to 50 petaflops won't seem like such a big deal.
The Bull Tera-100 super at the Commissariat a l'Energie Atomique in France, is based on Intel's Xeon 7500 high-end processors and Bull's bullx supercomputer blades and ranks sixth in the world. The machine uses QDR InfiniBand to lash the nodes together, and is rated at 1.05 petaflops. This machine does not have GPUs in it from either AMD or Nvidia, and neither does number eight, the Kraken XT5 super from Cray that is owned by the University of Tennessee and which is operated by DOE's Oak Ridge National Laboratory. Kraken delivers 831.7 teraflops of sustained Linpack performance, unchanged from when it came onto the list a year ago.
Number seven on the list, the Roadrunner Opteron blade system at Los Alamos National Laboratory (another DOE site) does use accelerators, but they are IBM's now defunct Cell co-processors, which are based on IBM's Power cores and which have eight vector math units per chip. While the Roadrunner machine demonstrated the viability of co-processors to push up to the petaflops. But Roadrunner is stalled at 1.04 petaflops, is probably not going to be upgraded, and is therefore uninteresting even if it will do lots of good work for the DOE. (If you consider designing nuclear weapons good work, of course.)
Number nine on the list is the BlueGene/P super, named Jugene, built by IBM for the Forschungszentrum Juelich in Germany, which debuted at number three at 825.5 teraflops on the June 2009 list and hasn't changed since then. Rounding out the top ten on the Top 500 list is the Cielo Cray XE6 at Los Alamos, a new box that is rated at 816.6 teraflops of sustained Linpack performance.
GPU is my co-pilot
On the November 2010 list, there are 28 HPC systems that use GPU accelerators, and the researchers who put together the Top 500 for the 36th time - Erich Strohmaier and Horst Simon, computer scientists at Lawrence Berkeley National Laboratory, Jack Dongarra of the University of Tennessee, and Hans Meuer of the University of Manheim - consider IBM's Cell chip a GPU co-processor. On this list, there are sixteen machines that use Cell chips to goose their floating point oomph, with ten using Nvidia GPUs and two using AMD Radeon graphics cards.
The Linpack Fortran matrix benchmark was created by Dongarra and colleagues Jim Bunch, Cleve Moler, and Pete Stewart back in the 1970s to gauge the relative number-crunching performance of computers and is the touchstone for ranking supercomputers.
There are three questions that will be on the minds of people at the SC10 supercomputing conference in New Orleans this week. The first is: Can the efficiency of ceepie-geepie supers be improved? The second will be: Does it matter if it can't? And the third will be: At what point in our future will GPUs be standard components in parallel supers, just like parallel architectures now dominate supercomputing and have largely displaced vector and federated RISC machines?
To get onto the Top 500 list this time around, a machine had to come in at 31.1 teraflops, up from 24.7 teraflops only six months ago. This used to sound like a lot of math power. But these days, it really doesn't. A cluster with 120 of the current Nvidia Tesla GPUs with only half of the flops coming through where the CUDA meets the Fortran compiler will get you on the list. The growth is linear, then on the June list next year, you will need something like 40 teraflops or about 150 of the current generation of GPUs. And with GPU performance on the upswing, maybe the number of GPUs in a ceepie-geepie to get onto the Top 500 list might not require so many GPUs.
As has been the case for many years, processors from Intel absolutely dominate the current Top 500 list, with 398 machines (79.6 per cent of the boxes on the list). Of these, 56 machines are using the Xeon 5600 processors, one is still based on 32-bit Xeons, one is based on Core desktop chips, five are based on Itanium processors, and three are based on the new high-end Xeon 7500s.
In the November 2010 rankings, there are 57 machines using AMD's Opteron processors, while there are 40 machines using one or another variant of IBM's Power processors. While the machine counts are low for these two families of chips, the core counts sure are not because of the monster systems that are based on Power and Opteron chips.
There are 1.41 million Power cores on the Top 500 list this time around, which was 21.5 per cent of the total 6.53 million cores inside of the 500 boxes and which represented 7.35 aggregate petaflops or 11.2 per cent of the total 65.8 petaflops on the list. There are 1.54 million Opteron cores (23.5 per cent of cores) on the aggregate list for 14.2 peak petaflops (21.6 per cent of total flops)
None of these core counts include the GPU core counts, which is something that the Top 500 people should reconsider, even though in all cases the flops are counted.
Across all processor architectures, there are 365 machines using quad-core processors and 19 already are using CPUs with six or more processors per socket. It is safe to say that the HPC market will eat whatever number of cores the chip makers can bake.
There are two Sparc-based supers on the current Top 500 list and the Earth Simulator super built by NEC for the Japanese government is still barely on the list (and will probably be knocked off on the next list in June 2011).
Xeon rides the wave
Having said all of that, the 391 machines using Intel's Xeon processors represent the belly of the Top 500 list. With a total of 3.5 million cores (53.5 per cent of the total core count on the list) and 43.2 petaflops of number-crunching oomph (65.8 per cent of total flops), the Xeon is the champion of the top-end HPC world. Of course, the Xeon CPUs is getting credit for flops that are being done by GPUs in many cases.
In terms of core count, there are 289 machines that have between 4,096 and 8,192 cores, and 96 machines that have from 8,192 to 16,384 cores. You need more than 1,000 cores to make the list, and there are only two boxes that have fewer than 2,048 cores and only 61 have between 2,048 and 4,096 cores. The system count drops off pretty fast above this core count, with 52 machines having more than 16,384 cores.
The Top 500 list is pretty evenly split between Ethernet, with 226 machine, and InfiniBand of various speeds, at 226 machines. The remaining machines are a smattering of Myrinet, Quadrics, Silicon Graphics NUMAlink, and Cray SeaStar and Gemini interconnects. There were seven machines on the list using 10 Gigabit Ethernet for lashing nodes in parallel supers together, and 29 used 40 Gb/sec (QDR) InfiniBand
By operating system, Linux in its various incarnations dominates the list, with 450 out of 500 machines running it. Unix accounted for 20 machines, Windows five machines, and the remainder were running mixed operating systems. If Microsoft wanted to catch a new wave, it would work to get the best possible GPU runtime and programming tools to market. Just tweaking the MPI stack in Windows HPC Server 2008 R2 to get rough parity with Linux is not going to make a dent at the big supercomputer centers of the world. Then again, Microsoft is trying to move into the HPC arena from the technical workstation up, and it has other advantages that Linux platforms do not in this regard.
IBM has the most systems on the November 2010 Top 500 list, with 199 boxes (39.8 per cent of the total) and 17.1 petaflops (26 per cent of the total flops on the list) of aggregate peak performance on the Linpack test. Big Blue is followed up by Hewlett-Packard, with 158 machines and 11.7 petaflops, which works out to 31.6 per cent of machines and 17.8 per cent of total flops. Cray has only 29 machines on the current super ranking, which is 5.8 per cent of machines but 16.3 per cent of peak floating point power. Silicon Graphics has 22 machines on the list, which is 4.4 per cent of boxes and 4.5 per cent of aggregate flops. Dell has 20 boxes on the list and its hand in a few mixed boxes as well, and Oracle, Fujitsu, NEC, and Hitachi all have a handful of machines, too.
Supercomputing is inherently political (especially so given where the funding for the upper echelon of the Top 500 list comes from), and countries most certainly measure each other up in their HPC centers. The United States leads with machine count, at 275 machines with a combined 31.5 petaflops, and China has jumped well ahead of Japan to become the solid number two, with 42 machines and 12.8 petaflops in total across those machines. Japan has 26 machines that add up to 4.6 petaflops, and Germany's 26 machines have an aggregate of 3.5 petaflops. The United Kingdom is close behind with 24 machines, for a total of 2.2 petaflops, followed by Russia with 11 machines and 1.1 petaflops. ®