Monolithic supers nab power efficiency crown
But ceepee-geepees threaten green revolution
Old-school monolithic, massively parallel supercomputers based on Power and Sparc processors hold their own in flops-per-watt comparisons against the new hotness: hybrid machines that combine x64 processors and GPU coprocessors.
So say two researchers at Virginia Tech who, in the wake of the recent June rankings of the Top 500 most powerful supercomputers, have released the Green500 rankings, which rate supers not on pure number-crunching goodness, but on energy efficiency.
The Green500 and Top 500 lists have plenty of machines in common, but given that they are based on different metrics, there are lots of machines that make it onto one list that do not make it onto the other. For example, there are experimental machines in academia and research institutions that don't have enough performance to make the Top 500, and there are machines with enough aggregate oomph to make the Top 500, but guzzle electricity like a drunken sailor.
At the top of the Green500 list are two prototype BlueGene/Q machines that IBM is testing before it installs the 20 teraflops "Sequoia" supercomputer at the US Department of Energy's Lawrence Livermore Laboratory next year. (Argonne National Laboratory has also recently inked a deal to get a 10 petaflops BlueGene/Q box.)
The number one machine, humming away at the TJ Watson Research Center in upstate New York, is based on a 16-core PowerPC processor called the PowerBCQ that's running at 1.6GHz and delivering 12.8 gigaflops of oomph per chip. With 8,192 cores, this machine has a peak theoretical performance of 104.86 teraflops – it delivered 85.9 teraflops on a recent run of the Linpack Fortran benchmark test to earn its 109th position on the Top 500 rankings.
This BlueGene/Q prototype consumed 40.95 kilowatts, yielding 2,097.2 megaflops per watt. For whatever reason, a second prototype of the BlueGene/Q hardware from last year – perhaps with a different software stack or different interconnect or both? – was also run with 8,192 of the same Power cores but was only able to hit 65.4 teraflops while consuming 38.8 kilowatts. While that efficiency was far lower, it was good enough for the first prototype to rank second on the Green500 list, at 1,684 megaflops per watt.
The Sequoia 20 petaflopper, which will have 1.6 million cores, should be impressively green if IBM can scale up whatever the trick was in it second BlueGene/Q prototype. It so, Sequoia will suck down 5.6 megawatts instead of the 7.4 megawatts it'd need if based on the first BlueGene/Q prototype.
Enter the hybrids
After the Power-based prototypes topping the Green500 list comes a machine based on server processors and server-class GPU coprocessors. The University of Nagasaki built a hybrid cluster called DEGIMA with Intel Core i5 processors matched to ATI Radeon graphics controllers, with "server" nodes linked together with 40Gb/sec InfiniBand switches.
Like most CPU-GPU hybrids – yet – DEGIMA is not terribly efficient, with just under 42.8 teraflops of oomph on the Linpack test against a peak theoretical performance of 111.2 teraflops. But its components (7,600 cores total) are relatively inexpensive and the machine only burned 240.3 kilowatts, yielding an efficiency of 1,375.9 megaflops per watt and giving it the number three ranking on the Green 500 list.
The fourth most energy-efficient super that Virginia Tech's Wu-chun Feng and Kirk Cameron put on the Green500 list was another hybrid: the Tsubame 2.0 hybrid super built by Tokyo Institute of Technology, comprised of HP ProLiant SL390s G7 tray servers and Nvidia Tesla M2070 GPU coprocessors.
Tsubame 2.0 has 73,278 x64 cores and a little less than half of its peak theoretical performance of 2.29 petaflops goes up the chimney, and does so consuming 1.24 megawatts. That works out to 958.4 megaflops per watt. Another IBM hybrid super – the PLX Cluster at the Cineca/SCS research consortium in Italy, made up of 274 iDataPlex dx360 M3 servers using six-core Xeon 5600 processors and Nvidia M2070 coprocessors – was rated at 142.7 teraflops and consumed only 160 kilowatts, earning position five at 891.9 megaflops per watt.
Interestingly, and making a strong case for monolithic parallel machines, is the K super built by Fujitsu for the Japanese government using its eight-core Sparc64-VIIIfx processors, which ranked as the most powerful machine on the Top 500 super list, and number six on the Green500.
This machine, which will eventually scale above 10 petaflops of aggregate performance, currently has 548,352 cores – that's 17,136 blade servers – for a total of 8.77 peak petaflops. The interesting bit is that when this machine was put through the Linpack paces, it yielded 8.16 petaflops of sustained performance: a very impressive 93 per cent efficiency where the compiler meets the core.
Many people might assume that the K super would not be particularly energy efficient, but it doesn't do too badly. As currently configured, the K super burns 9.9 megawatts, yielding 824.6 megaflops per watt of efficiency. It will be interesting to see if Fujitsu's hardware and software engineers can improve upon this as they tweak and tune the machine.
Exit the Cells
Other machines at the top of the Green500 list include a slew of special blade servers based on IBM's "Cell" PowerXCell 8i chips and a 3D torus interconnect. While these machines are rated at 773.4 megaflops per watt, IBM has just killed off the last of the QS22 blade servers using the Cell chips, so these machines are dead ends.
The number two machine on the Top 500 list is the Tianhe-1A super located at the National Supercomputing Center in Tianjin, China. It has a sustained performance of 2.57 petaflops and burns just over 4 megawatts of juice with its hybrid Intel Xeon-Nvidia GPU architecture, for a 635.2 megaflops per watt. It slips in at number 13 on the Green500 list.
A number of Power7-based clusters are near the top of the Green500 list, with the most energy-efficient one – at number 14 on the list – being based Hitachi SR16000 servers, which are rebadged IBM Power Systems machines.
This cluster, built by Hitachi for the Yukawa Institute for Theoretical Physics, is based on IBM's Power7 chips running at 3.3GHz. It has 3,456 cores and has a peak performance of 91.2 teraflops and a sustained Linpack performance of 73.4 teraflops (using QDR InfiniBand networking). This machine consumes 129.6 kilowatts, yielding a Green 500 rating of 566 megaflops per watt.
At the TJ Watson labs – and at number 17 on the Green500 list – IBM has a Power 750 cluster with 2,560 cores running at a faster 3.86GHz but using 10 Gigabit Ethernet links; it runs at a 73.8 per cent efficiency (in terms of performance) on the Linpack test, and delivers 58.3 teraflops for its position 192 spot on the Top 500 ranking. But this machine only delivers 483.7 megaflops per watt, which is about where hybrid CPU-GPU clusters are near the top of the list.
SGI's most energy-efficient machine on the Green500 list comes in at number 12, a hybrid Xeon-Tesla machine that it built with NSSOL for the National Institute for Environmental Studies, and that yielded 650.3 megaflops per watt.
The most energy-efficient Cray machine is the "Hopper" XE6 Opteron parallel super installed at the DOE's Lawrence Berkeley National Laboratory. This machine has 153,408 cores and delivers just over 1 petaflops of sustained performance on the Linpack test. However, it consumes 2.9 megawatts to do it, for 362.1 megaflops per watt – which gets Cray only to number 42 on the Green500 list.
The least energy efficient machine to make the Green500 list is the "Thunderbird" Xeon-InfiniBand cluster built by Dell for Sandia National Laboratories back in 2006. This machine, which has 9,024 cores, delivers a mere 53 teraflops but consumes 2.48 megawatts, for a juice-squandering 21.4 megaflops per watt.
There are also a whole bunch of modern Xeon and Opteron systems using Gigabit Ethernet links that don't do much better than the Thunderbird in terms of power efficiency at the bottom of the Green 500 list. But power efficiency is not a top priority for everyone. Sometimes, the goal is just to get the numbers crunched in the cheapest manner possible. ®