Original URL: https://www.theregister.com/2009/11/27/ibm_power7_hpc_server/

IBM shows off Power7 HPC monster

Big Blue unveils big box: Crowd goes wild

By Timothy Prickett Morgan

Posted in Channel, 27th November 2009 13:02 GMT

SC09 IBM likes to go on and on about the transaction processing power and I/O bandwidth of its System z mainframes, but now there is a new and much bigger kid on the block.

Its name is the Power Systems IH supercomputing node, based on the company's forthcoming Power7 processors and a new homegrown switching system that blends optical and copper interconnects.

The Power7 IH node was on display at the SC09 supercomputer trade show last week in Portland, Oregon, and El Reg was on hand to get the scoop from the techies who designed the iron. This server node is the heart of the 1 petaflop "Blue Waters" supercomputer being installed at the University of Illinois. (That's sustained, not peak, performance.)

As we have previously reported, IBM lifted the veil a bit on its Power7 family of chips and chip packages at the Hot Chips conference in August. The Power7 chips are implemented in a 45 nanometer copper/SOI process and have 1.2 billion transistors with eight cores on a single die.

Each Power7 core has 12 execution units: two fixed point units, two load store units, four double-precision floating point units, one vector unit (for doing matrix maths), and one decimal floating point unit (for doing money maths). Those floating point units, like those in all past generations of 64-bit Power processors that trace their heritage back to IBM's AS/400 designs from 1995 (not the crappy PowerPC chips that came out of the IBM-Motorola-Apple partnership) can do two floating point operations per clock cycle.

The Power7 core has 32 KB of L1 instruction cache and 32 KB of L1 data cache. Each core sports simultaneous multithreading that delivers four virtual threads per core, and has a 256 KB of L2 cache tightly coupled to it. The chip also has 32 MB of embedded DRAM that acts as a shared L3 cache, with 4 MB segments affiliated with each of the eight cores. The Power7 chip has two dual-channel DDR3 memory controllers implemented on the chip, which deliver 100 GB/sec of sustained bandwidth per chip.

Power7 and Switch Packages

The Power7 IH four-chip package and its companion switch
with mixed optical/copper interconnect

At SC09, IBM gave out a lot more details on the four-chip multichip module (MCM) that it said it was cooking up for supercomputing customers - and by the way, the Power7 IH server node is a lot denser, in terms of flops per unit volume, than anyone had been led to believe. The picture at the left shows the Power7 IH node MCM chip package, with the four eight-core Power7 chips. Next to it is the IH node hub/switch, which is implemented in a similar chip package.

Both chip packages have the same pin count at 5,336 pins (92 pins by 58 pins), according to Alan Benner, a senior technical staff member of the server and network architecture team within IBM's Systems and Technology Group: more on this hub/switch in a minute. The package has 512 GB/sec of aggregate memory bandwidth and 192 GB/sec of I/O bandwidth.

As we had been anticipating, IBM is indeed dropping the clock speeds of the Power7 chips down, and IBM now confirms that for this Power7 IH MCM, clock speeds will range from 3.5GHz to 4GHz. The current Power6+ chips top out at 5 GHz, but only have two cores per chip and fewer execution units. IBM says that this single package will deliver just north of one teraflops of number-crunching power using just the floating point units.

At 800 watts, the package is not cool by any means, but the Power7 IH MCM is delivering performance at 1.28 gigaflops per watt at the package level. A Xeon 5500 chip from Intel can do four floating point operations per core, or 16 across four cores in a single die, and that means the top-speed X5570 running at 2.93GHz and rated at 95 watts can deliver 46.9 gigaflops, or only 493 megaflops per watt at the chip level.

Dropping down to the 80-watt E5540 helps a bit, delivering 506 megaflops per watt, and stepping down to the 60-watt L5530 running at 2.4 GHz gives you 640 megaflops per watt. The Power7 module is precisely twice as good, but you can damned sure bet it will cost a lot more than twice as much.

The Power7 IH node, as you can see in this picture, is not small. It is 39 inches wide by 6 feet deep, including space for cables. The IH node drawer is 2U high and it has room for eight of these Power7 IH MCMs, for a total of 256 cores. There are two monster motherboards underpinning the processors and their memory and the hub/switch and its interconnects. These mobos are manufactured by Japanese server maker Hitachi and Benner said that these were one of the largest motherboards ever made.

Power7 IH HPC Server Node

The Power7 IH HPC server node

The IH nodes are completely water-cooled, with water blocks on the Power7 MCM packages, on the 8 GB DDR3 memory modules IBM had specially designed for the box, and on the Power7 IH hub/switches, which were not given a proper name yet.

The memory modules include buffers on the DIMMs, which IBM also designed, to help accelerate their performance. There are 16 DIMM slots per socket in the Power7 IH node, and IBM is using 8GB DIMMs, yielding 4GB per core.

A total of 1TB of main memory is on each drawer, and the fully loaded Blue Waters box will have 2PB of main memory. IBM is being a bit cagey about the memory architecture, but the Power7 chips have some features to implement a kind of global address space (not cache coherent shared memory like in SMP and NUMA servers). It will be interesting to see how this global address space memory is architected and how it performs.

In this picture, the power supplies are to the right, and moving leftward are banks of DDR3 memory, the eight Power7 IH MCM sockets, another bank of memory, the IH hub/switch modules with optical links going out to the left and right (the orange cables), which route out to the back and come out as optical links to other server nodes in a cluster. The left side of the chassis (which is the back of the rack in the Blue Waters machine) is also where there are 16 PCI-Express 2.0 x16 slots and an extra x8 slot just for the heck of it.

The water cooling links into the nodes through the front of the chassis, which is to the right in this picture. The chassis weighs a little more than 300 pounds fully loaded. A dozen of these, plus up to 1PB of local storage, can be put into a specially designed rack. This rack delivers 98.3 teraflops of number-crunching power.

Power7 IH Node Hub/Switch Network

The Power7 IH node hub/switch network

The hub/switch at the heart of the Power7 IH node and linking them together is the secret sauce of this machine. Benner would not elaborate much on this network, but did confirm that it borrows ideas from the "Federation" SP hub/switch IBM created. This was for ASCI Blue and other supers running AIX and InfiniBand switches and related InfiniBand technologies Big Blue has been using to link Power5 and Power6 processors to remote I/O drawers for years.

Benner did brag that the hub/switch technology in the IH node "was better than both" Federation and InfiniBand, and said that one of the key distinctions is that it presents a two-level topology to all nodes in the network of machines. Within a node, all of the processors are linked to each other electronically through the motherboard and controlled by the IH node hub/switch.

The optical interconnects mount onto the top of the hub/switch - the squares on the top are actually comprised of a grid of small optical transceivers, with each square delivering 10 Gb/sec of bandwidth, according to Benner. The hub/switch module and the Power7 IH MCMs are put together in IBM's Bromont, Quebec, facility in Canada, which is also where Sony PlayStation 3 and Microsoft Xbox 360 chip packages are manufactured. IBM's East Fishkill, New York, wafer baker is where the Power7 chips and the chips that create the hub/switch are cooked up.

The way the Power7 IH node interconnect works is simple: most of the optical interconnects that come out of the backend of the box are used to link all of the nodes into a supernode, which is four drawers of capacity rated at 32 teraflops. The hub/switch interconnect shown at SC09 can currently scale to 512 supernodes, which works out to 16.4 petaflops. (IBM is going to have to overclock this puppy to 4.88 GHz to hit 20 petaflops, apparently.)

Benner said that the hub/switch module delivered a 1,128 GB/sec - that's bytes, not bits - in aggregate bandwidth. That is 192 GB/sec of bandwidth into each Power7 MCM (what IBM called a host connection), 336 GB/sec of connectivity to the seven other local nodes on the drawer, 240 GB/sec of bandwidth between the nodes in a four-drawer supernode, and 320 GB/sec dedicated to linking nodes to remote nodes. There is another 40 GB/sec of general purpose I/O bandwidth. ®