Feeds

IBM uncloaks 20 petaflops BlueGene/Q super

Lilliputian cores give Brobdingnagian oomph

  • alert
  • submit to reddit

High performance access to file storage

SC10 Although everybody seems to be excited about GPU-goosed supercomputing these days, Big Blue is sticking to its Power-based, many-cored BlueGene and Blue Waters massively parallel supers, and revving them up to bust into the 20-petaflops zone.

The Blue Waters massively parallel Power7-based supercomputer and its funky switching and interconnect, and very dense packaging were the big iron of last year's SC09 event in Portland, Oregon, which El Reg told you all about here. And we've covered the GPU additions to the iDataPlex bladish-rackish custom servers IBM builds, as well as the forthcoming GPU expansion blade for Big Blue's BladeCenter blade servers, which are due in December and which are also special-bid products.

But the BlueGene/Q super — made of fleets of embedded PowerPC processor cores — is still, in terms of aggregate number-crunching power, the biggest and baddest HPC box on the horizon from IBM for the next two years.

IBM lip-smackingly announced the sale of the "Sequoia" BlueGene/Q supercomputer to the US Department of Energy back in February 2009, just as the current BlueGene/P machines were ramping up production. But the company did not provide many details about the architecture, except that it would pack 1.6 million cores into a single cabinet, would have 1.6PB of storage, a peak performance of 20 petaflops, and burn 6.6 megawatts of juice. The machine will be installed at Lawrence Livermore National Laboratory, which bought the first experimental BlueGene/L super.

This week IBM yanked a compute node and an I/O out of the prototype portion of the future BlueGene/Q super that's installed at its Watson Research Center in New York and showcased them at the SC10 supercomputing show, the first outing of the BlueGene/Q system components.

To understand BlueGene/Q, you have to compare it to the prior BlueGene machines and their predecessors to see how far the design has come and why IBM still believes that the BlueGene approach — small cores, and lots of them — provides the best bang for the watt.

The original BlueGene/L machine was based on some early parallel-computing design work done in the early 1990s by IBM in conjunction with Columbia University, Brookhaven National Laboratory, and RIKEN (the big Japanese government-sponsored super lab) to make a massively parallel machine called QCDSP to do quantum chromodynamics calculations using digital signal processors.

A follow-on machine called QCDOC replaced the DSPs with embedded PowerPC processors, putting 64 compute nodes on a single board that interconnected with a proprietary backplane.

In December 1999, IBM ponied up $100m of its own dough to create the original BlueGene/L machine, aiming the box at massive protein-folding problems. Two years later, LLNL saw that such a machine could be used for nuclear weapons simulations and placed the first order for the prototype.

By the fall of 2004, a prototype of the BlueGene/L machine became the fastest supercomputer in the world, using eight BlueGene/L cabinets and 1,024 compute nodes for a sustained performance of 36 teraflops. That machine has been upgraded many times, and now has reached its full system configuration, which includes 65,536 compute nodes and 1,024 I/O nodes (both based on 32-bit PowerPC processors).

BlueGene/L held the top spot on the Top 500 ranking of supercomputers, which is based on the Linpack Fortran benchmark test, for four years. The machine is based on single-core 32-bit PowerPC 440 processors that spin at 700MHz and which are packed two cores to a die with a shared L2 and L3 cache. Each core has two floating-point units as well as memory controllers, on-chip Gigabit Ethernet interfaces, and the proprietary interconnect that implements a 3D torus interconnect (derived from the Columbia University machines) that runs the Message Passing Interface (MPI) clustering protocol to lash the nodes together like oxen pulling a cart.

The BlueGene/L machine at LLNL, which was first installed in 2005 and which has been upgraded a number of times, has 131,072 cores, 32TB of aggregate main memory, a peak performance of 367 teraflops, a sustained performance of 280.6 teraflops on the Linpack test, and burns around 1.2 megawatts. The machine is air-cooled.

IBM's currently selling massively parallel box is the BlueGene/P, which puts four 850MHz PowerPC 450 cores on a chip with the memory controllers, floating point unit, and BlueGene interconnect on the chips as well as a beefed-up 10 Gigabit Ethernet controller and the old Gigabit Ethernet port on the chip. Those PowerPC 450 cores are still 32-bit units, by the way.

Each BlueGene/P node can support 2GB of main memory (512MB for each core), and the 3D torus has 5.1GB/sec of bandwidth and somewhere between 160 nanoseconds and 1.3 microseconds of MPI point-to-point latency between its nearest peers in a single node — that's a factor of 2.4 more bandwidth and about 20 per cent lower latency.

The BlueGene/P collective network that brings the nodes together has 1.7GB/sec of bandwidth per port (2.4 times that of the BlueGene/L machine) and there are three ports per node that have a 2.5 microsecond latency talking to other nodes. In a worst-case scenario, where a node has to make 68 hops across 72 racks in the 3D torus to reach another node to get data, the latency is 5 microseconds, a big improvement over BlueGene/L, which took 7 microseconds to make the same hops.

An optical 10 Gigabit Ethernet network links the BlueGene/P nodes to the outside world and there is a Gigabit Ethernet network for controlling the system. The BlueGene/P system puts 1,024 compute nodes in a rack and from 8 to 64 I/O nodes (which plug into the same physical boards as the compute nodes) per rack. The machine delivers 13.9 teraflops per rack and can scale up to 256 racks, for a 3.56 petaflops of peak (not sustained) number-crunching performance across more than 1 million cores.

The BlueGene/P nodes, like their BlueGene/L predecessors, were air-cooled and put compute and I/O nodes on the same node boards. The BlueGene/P machines crammed twice as many cores onto a chip module (four cores instead of two) and twice as many compute nodes (32 instead of 16) onto a single compute drawer, basically quadrupling the cores and nearly quintupling floating-point performance.

The power drain on BlueGene/P also went up by a factor of 1.5, with a petaflops of peak oomph burning about 2.9 megawatts. But the performance per watt increased by 9 per cent, so it was a net gain on all fronts: performance and energy efficiency.

With the BlueGene/Q designs, IBM is doing a number of different things to boost the performance and energy efficiency of the massively parallel supers. First, the BlueGene Q processors — called BGQ for short at IBM — bear some resemblance to IBM's Power7 chip used in its commercial servers, and an even stronger resemblance to the Power A2 "wire-speed" processors, which El Reg discussed in detail this year as they were announced.

Like these two commercial chips, the BlueGene/Q processor is a 64-bit chip with four threads per core. The BlueGene/Q processor module is a bit funky in that it has 17 cores on it, according to Brian Smith, a software engineer for the product who was demonstrating the compute and I/O modules at the SC10 expo. On that BGQ processor, one of the cores will run a Linux kernel and the other 16 are used for calculations, according to Smith.

The cores used in the BlueGene/Q prototype run at 1.6GHz, compared to the 2.3GHz speed on the sixteen-core Power A2 wire-speed processor. (The cores could be the same or very similar on both chips.) With the BlueGene/Q super, not only is the BGQ chip moving to 64-bits, but it also has four threads per core to increase its efficiency.

High performance access to file storage

Next page: Oomph and gunk

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Bored with trading oil and gold? Why not flog some CLOUD servers?
Chicago Mercantile Exchange plans cloud spot exchange
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.