IBM uncloaks 20 petaflops BlueGene/Q super
Lilliputian cores give Brobdingnagian oomph
SC10 Although everybody seems to be excited about GPU-goosed supercomputing these days, Big Blue is sticking to its Power-based, many-cored BlueGene and Blue Waters massively parallel supers, and revving them up to bust into the 20-petaflops zone.
The Blue Waters massively parallel Power7-based supercomputer and its funky switching and interconnect, and very dense packaging were the big iron of last year's SC09 event in Portland, Oregon, which El Reg told you all about here . And we've covered the GPU additions to the iDataPlex  bladish-rackish custom servers IBM builds, as well as the forthcoming GPU expansion blade  for Big Blue's BladeCenter blade servers, which are due in December and which are also special-bid products.
But the BlueGene/Q super — made of fleets of embedded PowerPC processor cores — is still, in terms of aggregate number-crunching power, the biggest and baddest HPC box on the horizon from IBM for the next two years.
IBM lip-smackingly announced  the sale of the "Sequoia" BlueGene/Q supercomputer to the US Department of Energy back in February 2009, just as the current BlueGene/P machines were ramping up production. But the company did not provide many details about the architecture, except that it would pack 1.6 million cores into a single cabinet, would have 1.6PB of storage, a peak performance of 20 petaflops, and burn 6.6 megawatts of juice. The machine will be installed at Lawrence Livermore National Laboratory, which bought the first experimental BlueGene/L super.
This week IBM yanked a compute node and an I/O out of the prototype portion of the future BlueGene/Q super that's installed at its Watson Research Center in New York and showcased them at the SC10 supercomputing show, the first outing of the BlueGene/Q system components.
To understand BlueGene/Q, you have to compare it to the prior BlueGene machines and their predecessors to see how far the design has come and why IBM still believes that the BlueGene approach — small cores, and lots of them — provides the best bang for the watt.
The original BlueGene/L machine was based on some early parallel-computing design work done in the early 1990s by IBM in conjunction with Columbia University, Brookhaven National Laboratory, and RIKEN (the big Japanese government-sponsored super lab) to make a massively parallel machine called QCDSP  to do quantum chromodynamics calculations using digital signal processors.
A follow-on machine called QCDOC  replaced the DSPs with embedded PowerPC processors, putting 64 compute nodes on a single board that interconnected with a proprietary backplane.
In December 1999, IBM ponied up $100m of its own dough to create the original BlueGene/L machine, aiming the box at massive protein-folding problems. Two years later, LLNL saw that such a machine could be used for nuclear weapons simulations and placed the first order for the prototype.
By the fall of 2004, a prototype of the BlueGene/L machine became the fastest supercomputer in the world, using eight BlueGene/L cabinets and 1,024 compute nodes for a sustained performance of 36 teraflops. That machine has been upgraded many times, and now has reached its full system configuration, which includes 65,536 compute nodes and 1,024 I/O nodes (both based on 32-bit PowerPC processors).
BlueGene/L held the top spot on the Top 500 ranking of supercomputers, which is based on the Linpack Fortran benchmark test, for four years. The machine is based on single-core 32-bit PowerPC 440 processors that spin at 700MHz and which are packed two cores to a die with a shared L2 and L3 cache. Each core has two floating-point units as well as memory controllers, on-chip Gigabit Ethernet interfaces, and the proprietary interconnect that implements a 3D torus interconnect (derived from the Columbia University machines) that runs the Message Passing Interface (MPI) clustering protocol to lash the nodes together like oxen pulling a cart.
The BlueGene/L machine at LLNL, which was first installed in 2005 and which has been upgraded a number of times, has 131,072 cores, 32TB of aggregate main memory, a peak performance of 367 teraflops, a sustained performance of 280.6 teraflops on the Linpack test, and burns around 1.2 megawatts. The machine is air-cooled.
IBM's currently selling massively parallel box is the BlueGene/P, which puts four 850MHz PowerPC 450 cores on a chip with the memory controllers, floating point unit, and BlueGene interconnect on the chips as well as a beefed-up 10 Gigabit Ethernet controller and the old Gigabit Ethernet port on the chip. Those PowerPC 450 cores are still 32-bit units, by the way.
Each BlueGene/P node can support 2GB of main memory (512MB for each core), and the 3D torus has 5.1GB/sec of bandwidth and somewhere between 160 nanoseconds and 1.3 microseconds of MPI point-to-point latency between its nearest peers in a single node — that's a factor of 2.4 more bandwidth and about 20 per cent lower latency.
The BlueGene/P collective network that brings the nodes together has 1.7GB/sec of bandwidth per port (2.4 times that of the BlueGene/L machine) and there are three ports per node that have a 2.5 microsecond latency talking to other nodes. In a worst-case scenario, where a node has to make 68 hops across 72 racks in the 3D torus to reach another node to get data, the latency is 5 microseconds, a big improvement over BlueGene/L, which took 7 microseconds to make the same hops.
An optical 10 Gigabit Ethernet network links the BlueGene/P nodes to the outside world and there is a Gigabit Ethernet network for controlling the system. The BlueGene/P system puts 1,024 compute nodes in a rack and from 8 to 64 I/O nodes (which plug into the same physical boards as the compute nodes) per rack. The machine delivers 13.9 teraflops per rack and can scale up to 256 racks, for a 3.56 petaflops of peak (not sustained) number-crunching performance across more than 1 million cores.
The BlueGene/P nodes, like their BlueGene/L predecessors, were air-cooled and put compute and I/O nodes on the same node boards. The BlueGene/P machines crammed twice as many cores onto a chip module (four cores instead of two) and twice as many compute nodes (32 instead of 16) onto a single compute drawer, basically quadrupling the cores and nearly quintupling floating-point performance.
The power drain on BlueGene/P also went up by a factor of 1.5, with a petaflops of peak oomph burning about 2.9 megawatts. But the performance per watt increased by 9 per cent, so it was a net gain on all fronts: performance and energy efficiency.
With the BlueGene/Q designs, IBM is doing a number of different things to boost the performance and energy efficiency of the massively parallel supers. First, the BlueGene Q processors — called BGQ for short at IBM — bear some resemblance to IBM's Power7  chip used in its commercial servers, and an even stronger resemblance to the Power A2  "wire-speed" processors, which El Reg discussed in detail this year as they were announced.
Like these two commercial chips, the BlueGene/Q processor is a 64-bit chip with four threads per core. The BlueGene/Q processor module is a bit funky in that it has 17 cores on it, according to Brian Smith, a software engineer for the product who was demonstrating the compute and I/O modules at the SC10 expo. On that BGQ processor, one of the cores will run a Linux kernel and the other 16 are used for calculations, according to Smith.
The cores used in the BlueGene/Q prototype run at 1.6GHz, compared to the 2.3GHz speed on the sixteen-core Power A2 wire-speed processor. (The cores could be the same or very similar on both chips.) With the BlueGene/Q super, not only is the BGQ chip moving to 64-bits, but it also has four threads per core to increase its efficiency.
Oomph and gunk
Here's a photo of the BlueGene/Q compute node (pardon my photography, but the lighting conditions were awful on the show floor — and I am also not great with a lens):
IBM's BlueGene/Q 17-core compute node, blue gunk included
The chip in the middle of the compute node is the BGQ processor, which has the Power cores as well as memory controller and various interconnect features on it. The compute node is not fully populated with its DDR3 main memory, which is why some of it has blue gunk on it, which is covering the sockets where memory will be plugged in.
The interesting thing about the BlueGene/P design is that it will be water cooled, with a spring-loaded aluminum jacket wrapping around the front and back of the compute node, which slides into its midplane socket on the compute drawer right between two copper pipes full of water.
When you press the BGQ compute node into its slot, there is a clip you push down, and that compresses the aluminum against the BGP processor and memory chips on the node and against two adjacent, squared-off copper pipes filled with water. There is no special thermal contact material to keep the chips in contact with the aluminum or the aluminum in contact with the copper tubing. The spring provides 100 pounds of force and everything stays in contact so the heat can be drawn off the processor and memory and whisked away by water coursing through the pipes, thanks to thermodynamics.
The BlueGene/Q compute drawer
Smith said that the system design would allow BlueGene/Q to be cooled with water at 60 to 65 degrees Fahrenheit, which is fairly warm for a water-cooled system but increasingly normal as system makers realize they are overcooling both data centers and components because "that's the way we have always done things." There are no fans in the compute drawer, just two power supplies and pipes for water inlet and outflow.
The compute drawer has an interconnect that is fed by a fiber optic links from each compute node (the orange wires in the photo) and this interconnect snaps into the midplane to link it to the other compute drawers and compute nodes in the BlueGene/Q cluster. The water comes in and cools the optical interconnection chips first, then swishes through the compute nodes.
The BlueGene/Q compute drawer has 32 compute modules (each a server in the cluster), and each node will have 16GB of DDR3 main memory (1GB per core). A compute drawer has 512 cores, 2,048 threads, and 512GB of memory. A BlueGene/Q rack holds 32 of these compute drawers, which are half-depth, which means 16 in the front and 16 in the back. That's a stunning 1,024 server nodes in a rack (16,384 cores and the same gigabytes of memory) and 1.57 million cores dedicated to processing calculations, with another 98,304 cores for running the Linux kernel Big Blue uses for the BlueGene machines.
Another interesting fact: IBM is using a 5D mesh/torus interconnect to lash together the BlueGene/Q nodes, which quite possibly could mean it is moving backwards through time as well as across universes in the multiverse.
Actually, Smith said the way to think about the 5D interconnect was that you create a hypercube linkage between nodes, and then you link the vertices of the hypercubes together to make the 5D torus mesh. I know you had no problem at all visualizing that, but I'm not entirely sure that this is an accurate description of a 5D mesh/torus, so let's move on.
With the BlueGene/Q design, IBM is breaking apart the I/O nodes from the compute nodes for two reasons. First, by breaking them up, they can scale independently of each other and users who need less I/O can add more compute to a given rack and therefore take up less space to get a given amount of work done. Also, the I/O processors, which are based on the same BGQ modules, are not so densely packed that you need to cool them with water.
The BlueGene/Q I/O node
The BlueGene/Q I/O drawer has eight nodes and eight slots for adding in 10 Gigabit Ethernet or InfiniBand PCI-Express peripheral cards (which you can see on the upper left).
The Sequoia super that Lawrence Livermore will be getting in 2012 — IBM said it'd be in late 2011 back when the deal was announced in February 2009, so there's been some apparent slippage — will consist of 96 racks and will be rated at 20.13 petaflops. Argonne National Laboratory said back in August that it wanted a BlueGene/Q box, too, and it will have 48 racks of compute drawers for a total of 10 petaflops of floating-point power.
On the November 2010 ranking of the Top 500 supercomputers that was announced this week at SC10 in New Orleans, IBM had slapped together a half-rack of BlueGene/Q iron (well, more literally aluminum and copper, as you saw), and that machine was able to hit 65.3 teraflops of performance on the Linpack test against a peak theoretical performance of 104.9 teraflops. That works out to a 62.3 per cent efficiency. That 1/192nd of the Sequoia BlueGene/Q machine ranked 114 on the Top 500 list, by the way.
El Reg was not able to find out if the BlueGene/Q interconnect was goosed in the machines in terms of bandwidth and latency, but presumably there has been lots of work here to balance the extra processor performance. A rack is now rated at somewhere around a peak 209.7 teraflops in the Q generation, compared to a 13.9 peak in the P generation. That's a huge leap in raw performance, and presumably one that requires faster interconnects to be more efficient.
If IBM did not substantially change the interconnect, that might explain why the BlueGene/L at Lawrence Livermore (ranked number 12 on the list at 478.2 teraflops) has an 80.2 per cent efficiency comparing sustained Linpack versus peak theoretical performance, and the BlueGene/P at Argonne (ranked number thirteen at 458.6 teraflops) has an efficiency at 82.3 per cent.
The Jugene 825.5 teraflops BlueGene/P super at Forschungszentrum Juelich in Germany is also delivering an 82.3 per cent efficiency on the Linpack test. By comparison, BlueGene/Q is not terribly efficient. But it is also early days in the design. It is still, after all, a prototype, just like BlueGene/L was in 2005 and BlueGene/P was in 2007. ®