Original URL: http://www.theregister.co.uk/2011/06/20/top_500_supers_june_2011/

Japan takes the Top 500 lead with K super

The mother of all Sparc systems

By Timothy Prickett Morgan

Posted in HPC, 20th June 2011 11:39 GMT

ISC'11 Japan needs a little good news these days, and it comes from the International Super Computing 2011 conference in Hamburg, Germany, as the K supercomputer, a massively parallel Sparc-based cluster built by Fujitsu, has taken the lead in the number-crunching race as gauged by the June 2011 edition of the Top 500 supercomputer list and its Linpack Fortran benchmark.

The K machine was developed under the Project Keisoku name and was sometimes called the Next Generation Supercomputer Project as it was under development by the Japanese government. The machine was slated to cost $1.2bn and was to have a hybrid architecture featuring the three system titans of Japan, with vector machines and interconnect designed by NEC and Hitachi and scalar machines created by Fujitsu, with a combined peak performance of 10 petaflops. In May 2009, citing the excessive costs of production at the same time the Great Recession hit, NEC and Hitachi pulled out of the project, leaving Fujitsu holding the K football and having to keep the project alive and transition it to a Sparc-only design as it took it into the political end zone for continued funding.

Fujitsu K Supercomputer

The K supercomputer at Riken

As El Reg detailed last fall, Fujitsu kept the 6D mesh torus interconnect that was created by the three vendors for the K machine and simply built the whole box using its four-socket, water-cooled blade servers instead of a mix of vector and scalar processors. Fujitsu created the eight-core "Venus" Sparc64-VIIIfx chip specifically for the machine, and neither Oracle nor Fujitsu has committed to using this processor in commercial Sparc Enterprise M servers. (And they are damned fools if they don't.) The processors run at 2GHz, and each blade has 512 gigaflops of double-precision floating point power, which is not a lot for a GPU but it is not too bad for a CPU. The K super is located at the Rikagaku Kenkyusho (Riken) research lab in Kobe, Japan. The current configuration of the K machine has 548,352 cores – that's 17,136 blade servers – for a total of 8.77 peak petaflops. And here's the cool bit: running the Fortran matrix test and using that Tofu interconnect, the machine was able to actually crank through 8.16 petaflops, yielding a "shut your mouth" 93 per cent efficiency.

Maybe NEC and Hitachi were perhaps a little hasty?

Top500 Jun 2011 chip performance

One K super and Sparc is now a big part of the aggregate Top 500 oomph

The K super quite handily dispatched the former champ, which only held the title for half a year. That was the Tianhe-1A hybrid super at the National Supercomputing Center in Tianjin. The Tianhe-1A mixes six-core Intel Xeon processors, Nvidia Tesla GPUs, and a smattering of homegrown Sparc processors (yes, China makes its own Sparc chips) to hit 2.56 petaflops. The Tianhe-1A ceepie-geepie has 14,336 Xeon processors and 7,168 of Nvidia's Tesla M2050 fanless GPU coprocessors, and uses a homegrown tray server design and a proprietary interconnect called Arch; it has 186,368 cores across those CPUs and GPUs, the latter of which link into the CPUs through PCI-Express peripheral slots. While the Tianhe-1A machine has a peak theoretical performance of 4.7 petaflops, the resulting machine has only a 54.6 per cent efficiency running the Linpack test.

Riken says the K super consumed 9.89 megawatts of juice. The Tianhe-1A machine only draws 4 megawatts, so even if it is a lot less efficient, the performance per watt of these two machines is a lot closer than you might think. We'll save the thermal efficiency analysis for the Top 500 list for another day, but suffice it to say that there are 29 machines on the June 2011 list that burn more than 1 megawatt.

Japan and China each have another system in the top ten portion of the list.

China holds the number four position with "Nebulae" machine, another ceepie-geepie comprised of Xeon processors and Nvidia Teslas, that is installed at the National Supercomputing Center in Shenzhen. The Nebulae machine has a total of 120,640 cores across its CPUs and GPUs, which are housed in a blade server chassis crafted by Chinese server maker Dawning. This machine has a peak performance of 2.98 petaflops, but only hits 1.27 petaflops where the Fortran compiler hits the silicon.

Japan's other system at the top portion of the June 2011 supers ranking is the Tsubame 2.0 super, which is built from Hewlett-Packard's ProLiant SL390s G7 tray servers, sporting Xeons and Nvidia Tesla coprocessors. NEC is prime contractor on this machine, even though the components come from American IT companies, which is politically necessary to have the machine installed at the Tokyo Institute of Technology. This hybrid super has a total of 73,278 cores and a peak performance of 2.29 petaflops, but like other ceepie-geepies, the efficiency is low running Linpack and only 1.19 petaflops actually gets done doing the benchmark.

The top machine in the United States is the "Jaguar" supercomputer installed at Oak Ridge National Laboratory, which was ranked first in the world on the November 2009 and June 2010 lists, but which was knocked into the number two spot by Tianhe-1A last November. Jaguar is a Cray XT5 system based on six-core Opteron processors from AMD and uses the SeaStar-2+ interconnect. This machine is a bit long in the tooth and should be updated to 16-core Opterons and the "Gemini" XE6 interconnect if the US Department of Energy wants to spend the dough. Jaguar has 224,162 Opteron cores and is rated at 1.76 petaflops on the Linpack test at an efficiency of about 75.5 per cent.

Two other Cray systems made the top ten portion of the list, and they were proper XE6 boxes. The new "Cielo" super at Los Alamos National Laboratory, based on eight-core Opteron 6100 processors and using the Gemini interconnect, enters the list at number six. With 142,272 cores, it comes in at 1.11 petaflops of sustained performance at an efficiency of 81.3 per cent, which is not too shabby. Number eight on the list is another XE6 machine, called "Hopper" and using twelve-core Opteron 6100s, at the DOE's Lawrence Berkeley National Laboratory; this machine has 153,408 cores and has a sustained performance of 1.05 petaflops (81.8 per cent efficiency).

Number seven on the list is the "Pleiades" Xeon cluster using InfiniBand interconnect at NASA Ames Research Center built by Silicon Graphics. This system has 111,104 cores and delivers 1.09 petaflops of number-crunching oomph (82.7 per cent efficiency).

Number nine on the list is the Tera-100 parallel cluster built by Bull for the Commissariat a l'Energie Atomique (CEA) in France. Tera-100 is based on Intel's Xeon 7500 high-end processors and Bull's bullx supercomputer blades; it uses QDR InfiniBand to lash the nodes together and is rated at 1.05 petaflops, unchanged from the November 2010 list when it entered the Top 500 rankings.

IBM's largest machine on the June 2011 list, the "Roadrunner" hybrid Opteron-Cell blade super, was a contender for the Top 500 roost a few years back but after sitting still, other machines have blown by it. (Beep, beep!) The Roadrunner machine, which was the first machine to break the petaflops barrier and which fell behind Jaguar eighteen months ago, has a combined 122,400 cores across its Opteron and Cell processors and delivers 1.04 petaflops of performance on Linpack (at an efficiency of 75.7 per cent).

The age of petaflops is upon us

The June 2011 list is the first time since the Top 500 was cataloged beginning in 1993 that all of the top ten machines were rated at petaflops or more. And it won't be too long before 10 petaflops will be the ante to get into the upper echelons of the list. Here's how the current projections look:

Top500 Jun 2011 projection

Exascale: easy to build, hard to power and cool

While CPU clusters are dominating the top part of the Top 500 list at this moment, don't draw the wrong conclusion from this. There are specific cases where larger numbers of scalar or vector processors with proprietary interconnects are going to be necessary for a particular set of code. But in many cases, the cheap and low-powered flops of GPUs or other kinds of coprocessors – Intel's x64-based parallel Knights processors, FPGAs, or other gadgets – are going to be the only way a lot of organizations are going to be able to afford to do their supercomputing. Thus far, there are two GPU-accelerated machines on the Top 10 and a total of 17 machines using GPUs on the entire Top 500 list. Of these accelerated machines, a dozen use Nvidia GPUs, five use IBM Cells, and two use Advanced Micro Devices Radeon graphics cards.

It is early days for accelerated, hybrid supercomputing. But there is a general consensus that you can't just keep scaling up with x64, Power, or Sparc processors indefinitely without having to put in a few nuclear power plants alongside an exascale-class to juice it up. Optimistic vendors think we can get to exascale machines by 2018, maybe a little later, if we can solve some pretty hefty engineering problems. The problems always look insurmountable at the time, as they did breaking the gigaflops, teraflops, and petaflops barriers. This time, though, the thermodynamics issues are truly staggering.

If you add up all the number-crunching power of the machines on the Top 500 list, you get 58.88 petaflops, which is up 34.7 per cent from the November 2010 list and up 81.7 per cent from the 32.4 aggregate petaflops on the June 2010 list. To get onto the Top 100 part of the list this time around, you needed a machine with 88.92 teraflops. The smallest machines on the list (in terms of performance) are a pair of BladeCenter blade servers at an unnamed manufacturer in China using IBM's HS22 blades and quad-core Xeons rated at 40.2 teraflops.

IBM may not have a lot of machines near the top of the list – expect to see some 10 petafloppers from Big Blue either later this year or early next year – but in terms of total computing oomph on the list, Big Blue still has a big slice of the pie:

Top 500 June 2011 Vendor Performance

The Top 500 aggregate flops pie (by system capacity not count)

Cray, Fujitsu, and Hewlett-Packard all have similar and fairly large slices of the aggregate petaflops, too, with most of Fujitsu's in the K super, of course. IBM has 213 systems on the list (42.6 per cent of systems), compared to HP's 153 machines (30.4 per cent). IBM gained 13 machines over the November 2010 list, while HP lost five boxes. Interestingly, among commercial supercomputer buyers, IBM and HP have this market all but cornered. Of the 286 machines on the Top 500 list that are not at government-sponsored labs or educational institutions, IBM has 147 of them and HP has 133 of them, leaving only six other machines for other vendors to scratch each other to death for. On the entire list, Cray has 29 machines, Dell has 19 machines, and SGI has 16 machines.

Multicore processors are the norm for supercomputers these days, of course. Of the 500 machines ranked this time around, 212 machines uses CPUs with six or more cores, up from only 95 machines six months ago. Quad-core processors are on the wane, down to 231 machines from 365 on the November 2010 list. There are 387 machines using Intel processors, down a bit from 398 six months ago. Intel's latest "Westmere" Xeon 5600 and 7500 series of processors are used in 178 systems, more than triple what was on the list last November. AMD saw some gains, too, with 65 machines on this list compared to 57 machines a year ago. (Thanks mostly to Cray, but also getting some help from HP and Dell.) There are 45 machines on the list using one or another Power processor, up from 40 six months ago.

Top500 Jun 2011 interconnect performance

Top 500 super interconnects share, by system capacity not count

In terms of interconnects, Gigabit Ethernet, which is perfectly fine for a lot of embarrassingly parallel applications, is used on 233 machines, up from 227 boxes on the November 2010 list. Industrial companies, who don't have the somewhat unlimited budgets of government labs, favor Gigabit Ethernet network. Across all Top 500 installations, InfiniBand at its various speeds is used on 205 systems, down from 214 six months ago. But don't jump to the wrong conclusion. For some HPC workloads, low latency and high bandwidth do matter, and that is why if you look at the distribution of network protocols by the aggregate flops in the machines, InfiniBand networks are used on a total of 23 petaflops compared, more than twice the 11.6 petaflops on all of those Gigabit Ethernet clusters.

By geography, the United States had 256 systems on the June 2011 Top 500 list, down from 274 six months ago. Europe had 125 systems on the list, with 30 in Germany, 27 in the United Kingdom, and 25 in France. Asia is on the move though, with 103 systems, up from 84 in November 2010. China is the big builder, of course, with 62 machines, up from 42. Japan has 26 machines on the list, the same as last time. ®