Top 500 supers - rise of the Linux quad-cores
Jaguar munches Roadrunner
SC09 Looking at the semi-annual Top 500  list of supercomputers, you would never know the world was battling recession. Supercomputer centers keep chucking out old tech and rolling in new tech at the same feverish pace.
The fall edition of the list was announced this morning at the SC09 supercomputing trade show in Portland, Oregon. The single biggest transition in the list is the move to quad-core - and in some notable cases, six-core - processors inside supercomputing systems. And most of the machines on the list now run Linux with x64 processors.
The key differences between machines are what network interconnects and topologies are deployed and what adjunct processing elements (if any) are used to boost the performance of the central processors in server nodes. The differences are substantial enough for specific workloads to keep a whole bunch of HPC suppliers busy at a time in the market when you think there might be two suppliers left.
According to the techies who put together the Top 500 rankings - Hans Meuer of the University of Mannheim, Germany; Erich Strohmaier and Horst Simon of Lawrence Berkeley National Laboratory; and Jack Dongarra of the University of Tennessee - there are 427 machines that have quad-core processors. Intel's "Nehalem-EP" Xeon 5500 processors have tripled their presence on the Top 500 list, to 95 machines, up from 33 in the June 2009 rankings and zip in the November 2008 rankings because the Xeon 5500s came a little later than expected to market in March of this year. (El Reg detailed those two prior Top500 lists here  and there ).
Intel is the main chip supplier in 402 of the 500 machines that made the cut this time around, and 380 of those are using quad-core processors of one generation or another. Unlike AMD's current line of quad-core "Shanghai" and six-core "Istanbul" processors, which plug into existing Rev F system boards, Intel's Xeon 5500s require a box swap from the prior Xeon 5300 "Clovertown" and Xeon 5400 "Harpertown" series.
There are 42 machines on the list using AMD's processors and 52 using IBM's Power processors, both losing a little ground since the June 2009 ranking. Dual-core processors are still in use on 59 machines (all of the Power5 and Power6 supers on the list are using dual-core chips), and there are only four machines that employ single-core processors in their cluster nodes.
But AMD is still pretty pumped, and for a number of reasons. First of all, after trying to outrun IBM's hybrid "Roadrunner" Opteron-Cell massively parallel blade server at Los Alamos National Laboratory for the past several lists, the "Jaguar" all-Opteron XT5 box made by Cray and installed at Oak Ridge National Laboratory has pounced upon Roadrunner and pulled out some flesh and spit out some feathers.
The Jaguar system at Oak Ridge has been upgraded to the six-core Istanbuls in recent months. It now has 224,162 cores running at 2.6 GHz, and it has 1.76 petaflops of aggregate sustained number-crunching performance as gauged by the Linpack Fortran benchmark test. The Jaguar box doesn't have any fancy-schmancy co-processors, but Oak Ridge just announced  in October that it has received funding to build a supercomputer that uses Nvidia's future Fermi graphics processing units and CUDA programming environment.
Oak Ridge did not specify how it would make use of the GPUs. But it is possible that they will be added to the Jaguar box in a hybrid architecture akin to the one embodied in the Roadrunner machine built by IBM for Los Alamos.
Roadrunner was reconfigured last month and began its classified computing work  for the US Department of Energy, and in that reconfiguration, the machine had a few nodes busted out and is now rated at just a hair over 1 petaflops using a mix of 1.8 GHz dual-core Opterons and 3.2 GHz PowerXCell 8i co-processors (for a total of 122,400 cores) and InfiniBand interconnect to link nodes and PCI-Express to link the Cell chips to the Opteron sockets. The machine is nonetheless still ranked number two on the Top 500 list, although it may not be for long.
Number three on the super ranking is the new "Kraken" XT5 system built by Cray for the University of Tennessee, which like the Jaguar machine is based on the six-core Istanbul chips running at 2.6 GHz. Kracken has 98,928 cores and is rated at 831.7 teraflops of floating point oomph.
The "Jugene" BlueGene/P parallel super installed at the Forschungszentrum Juelich in Germany comes in at number four on the Top 500 list, rated at 825.5 teraflops using 850 MHz PowerPC 450 cores. This machine came online for the June 2009 list and has not changed this year.
Perhaps the most interesting new machine in the stratosphere of the Top 500 list is the Tianhe-1 hybrid supercomputer installed at the National Supercomputer Center in Tianjin, China. Tianhi, which means "River in the Sky" in Chinese, will be used to do aircraft design and oil exploration and will be the anchor of a national supercomputing grid for the northeast region of the country.
Tianhi-1 is comprised of Xeon server nodes using a mix of E5540 and E5450 processors, with each node configured with two of AMD's Radeon HD 4870 graphics cards to be used as co-processors. The machine has 71,680 cores and is rated at 563.1 of sustained teraflops and 1.2 petaflops of peak theoretical performance. That might be awful in terms of efficiency, but the machine is important because it puts China in the top five and it shows that you can build a powerful machine using a mix of off-the-shelf CPUs and GPUs to get a powerful cluster, even if it is inefficient.
Rounding out the top ten systems on the list are machines that used to be a lot closer to the top. The "Pleiades" Altix cluster build by Silicon Graphics for NASA Ames, rated at 544.3 teraflops and using quad-core Nehalem processors on their blades, is number six, followed by the 478.2 teraflops BlueGene/L massively parallel machine at Lawrence Livermore National Laboratories (and the top-ranked super on the November 2007 list) at number seven. Argonne National Laboratory's 458.6 teraflops BlueGene/P is number eight, and Sun Microsystems' "Ranger" Opteron-based blade cluster at the University of Texas is rated at 433.2 teraflops.
While Sun has been quiet about most things server related since Oracle announced its $7.4bn deal to acquire the company back in April, Sandia National Laboratories has tapped Sun to build the "Red Sky" blade cluster. This machine uses Intel's quad-core Xeon 5570 processors and Sun's x6275 blades and InfiniBand switches and is rated at 423.9 teraflops.
The politics of petaflops
In terms of vendor share, the Top 500 is still dominated by server makers IBM and Hewlett-Packard. IBM has 185 systems on the list this time around, with a total of 2.14 million cores and 9.72 petaflops of aggregate performance, giving IBM's machines a 34.8 per cent share of the number crunching on the list.
HP might not have had a big box near the top of the list in a long time - HP's most powerful machine is the 132.8 teraflops Cluster Platform 3000 based on the BL460c blade servers installed at Tata & Sons in India, ranked 26 on the list - but the world's volume server supplier has 210 machines on the November 2009 Top 500 list, with more than 1 million cores and 6.64 petaflops of aggregate oomph across those boxes, a 22.8 per cent share of the combined performance of the Top 500 list.
Niche server players but HPC specialists Cray and Silicon Graphics have 19 machines each on the November ranking. Cray's machines have 596,315 cores for a total of 4.4 petaflops of combined performance, while SGI has much skinnier machines - at least until it starts installing the much-anticipated "UltraViolet" shared memory systems based on Intel's future Xeon 7500 "Nehalem EX" eight-core processors. SGI's 19 machines, including a mix of Altix 4700 Itanium-based machines as well as the Altix ICE Xeon blade clusters, have a total of 198,304 cores for a not-too-shabby aggregate of 1.83 petaflops of performance.
Sun Microsystems, which has wanted to be a more serious player in HPC for the past decade and considering its server designs and switches it should be, has 11 machines on the current Top 500 list. The Sun boxes have 171,442 cores and 1.52 petaflops of aggregate performance. Dell has 16 machines on the list, with 616 teraflops and a mere 85,766 cores. (Dell, IBM, and Sun share boxes not included in that Dell total, and Dell has partnered with ACS for another machine.) Bull has five machines on the list for 481 teraflops and Appro International has six machines for 481 teraflops.
North America v the world
Japanese server makers NEC and Hitachi have three machines each on the list and their commitment to the supercomputing space has been shaken by their financial performance in the past year, which has lead to the two companies to walk away  from the Project Keisoku hybrid vector-scalar, 10 petaflops supercomputer project, which aimed to put the indigenous three Japanese server makers (including Fujitsu) to work on the $1.2bn project.
In July, Fujitsu was given the whole Project Keisoku contract , based on its future "Venus" eight-core Sparc64-VIIIfx processors. Fujitsu has five boxes on the list, with 33,688 total cores and 318.7 total teraflops.
Of the 500 machines on the list, 479 of them are made by vendors that are headquartered in North America, with 11 coming from vendors located in Europe (including the T-Platforms Xeon X5570 cluster installed at Moscow State University, which is rated at 350.1 teraflops). Another nine boxes are made in Asia and two are designated "global" in terms of their source because vendors involved in the creation of the supers hail from different countries.
The geography of where the machines end up (as opposed to where they come from) is interesting, particularly considering the politics of supercomputing. (There's a lot of chest puffing in HPC, and there always has been.) On the November list, 287 of the Top 500 supers are installed in the Americas, which is 57 per cent of the base of boxes and 60.5 percent of the 27.95 petaflops of total performance on the list.
Europe is home to 152 machines and 26.6 per cent of the aggregate floating point capacity on the list. Within Europe, the United Kingdom is the leading country on the Top 500 list (as it tends to be) with 45 boxes, followed by Germany and France with 27 boxes each. Asia hosts 51 boxes, which is 10.2 per cent of the base, and 3.31 petaflops, which is 11.8 percent of the oomph. China has 21 systems from the list, Japan has 16 systems, and India has three. Oceania has nine boxes and Africa has one.
I see your petaflops - and I raise you 10
Petaflops had become boring on the June 2009 list, and all eyes on the HPC community are on how they can push up to 10 petaflops and beyond and push to get funding to build such monstrous machines. While there are only two machines on the list that have broken through the petaflops barrier, everybody knows they can do it. It is just a matter of doing what others have done, or mixing it up a little.
Getting to 10 petaflops is no more trivial now than breaking 1 teraflops was in 1996 or 1 petaflops was in 2008. It takes a lot of changes in technology to make such big leaps. The teraflops barrier was broken with massive parallelism and fast interconnects, and the petaflops barrier was initially broken by a hybrid architecture pairing x64 processors and co-processors to boost their math performance.
The fact that the current top-end Jaguar machine does not use GPU or FPGA co-processors to get to over 2.3 petaflops of peak performance does not mean 10 petaflops will be attained with CPUs alone. Some HPC codes work well with CPU-only setups, and others will do better with the combination of CPU-GPU architectures. What HPC vendors need to do is get GPUs into the server nodes and more tightly connected to the CPUs they serve.
If you draw the projections (as the techies behind the Top 500 list have done), then sometime in late 2011 or early 2012, the fastest machine in the Top 500 list should be able to hit 10 petaflops and the aggregate performance on the list will be well above 100 petaflops. By sometime in 2015, a supercomputer will have to be rated at 1 petaflops or so just to make it on the list, if projections stay linear as they have since 1993, when the Top 500 list started.
On the current list, it takes 20 teraflops to rank at all, just so you can see how quickly Moore's Law and a lot of clever networking pushes HPC technology. Provided supercomputing centers can shift their codes to hybrid architectures, the price/performance of multicore x64 processors and their related GPUs is probably the horse to bet on. Exotic machines may have seen their heydays already. ®