'Double Stuf' Power7+ sockets: Yummy, but so is overclocking
IBM needs to swing both ways in server land
Analysis IBM's first Power7+ processor systems are expected to launch on October 3, but El Reg has a modest proposal for Big Blue as it prepares its rolling rollout: take a page from the Oreo Cookies cookbook, and "Double Stuf" 'em up and down the line.
We've already given you the low-down on the Power7+ processors for IBM's Power Systems servers: as it has done with the Power5+ and Power6+ generations of machines, and as AMD has done with the Opteron 6100 and 6200 processors, the Power7+ machines will feature some parts that have two whole complete Power7+ processors, with their clock speeds geared down, stuffed into a single processor socket, thereby doubling up the amount of cores, threads, and cache available inside the system.
When IBM talked about this at the recent Hot Chips 24 chippery fiesta, the company's top brass did not say which parts of the product line they would double down, but in the past IBM has done it with entry and midrange boxes where x86 iron was giving it the most intense competition.
With the Power7 generation, IBM went in the other direction, offering a variant of the Power 770 enterprise class box called the Power 780 that allowed customers to turn half of its core permanently off and Turbo Core the processors to a slightly higher clock speed and effectively doubling the cache per core. This was a good option for single-threaded jobs and database workloads with a heavy batch component, in which the higher clock speed matters more than parallelism.
If you are an AIX customer running DB2 or Oracle databases (which are priced on a per-core basis) or an IBM i customer (which is also priced per core and includes the DB2 for i relational database in the operating system license), you are going to need to do some math to try to figure out if a doubled-up Power7+ machine will be right for you.
Some variants of the forthcoming Power7+ systems are optimized for higher-speed single chip modules (SCMs), which have one eight-core die in a socket, and others being optimized for lower-speed dual-core modules (DCMs) that put two dies in a single package and double up the cores and caches at a slightly lower clock speed.
The DCMs, like prior doubled-up sockets in the Power5+ and Power6+ generations, run at a lower clock speed – or else the sockets would burst into flame – but the extra cores and threads allow for the overall doubled-up system to do more work on multithreaded jobs.
Conceptually, here is how Scott Taylor, one of the chip architects who worked on both the Power7 and Power7+ processors, illustrated the differences between the SCMs and DCMs at Hot Chips 24:
Single and 'Double Stuf' Power7+ sockets – pick carefully
Just to remind you how far IBM can push this, you will remember that with the Power7 chip, IBM also had a quad-chip module, or QCM, that put four eight-core Power7 chips on a single package, which was packaged up in the "Blue Waters" Power 775 machine that was going into the University of Illinois' National Center for Supercomputing Applications.
IBM canceled the Blue Waters contract with the University of Illinois last summer, presumably because it could not make money on the deal, and Cray subsequently won a $188m deal to supply a Blue Waters machine that will pack somewhere between 10 and 20 petaflops of floating point oomph using a hybrid CPU-GPU design.
I said at the time, mostly in jest because it would never happen, that IBM should resurrect this machine as a 1.5 million CPW server drawer for IBM i-based clouds. Considering that the stupidly fast Torrent hub/switch interconnect can lash together 2,048 Power775 nodes, you could build a cloud in one data center with 2.1 billion – that's billion with a B – CPWs of aggregate computing oomph across 1,365 compute nodes with 349,440 cores and 342 storage nodes.
CPW, by the way, is short for Commercial Workload Performance, an internal variant of the TPC-C online transaction processing test that IBM uses to gauge the relative performance of OS/400 and IBM i workloads.
At an entry P05 software tier for the cores, the hardware and the software for such a beast might cost $3bn. Running an AIX/DB2 combo on this hypothetical Blue Waters monster, the same iron would cost the same $3bn and deliver an aggregate of around 4.29 million RPerfs of performance, based on a different TPC-C variant that IBM uses to gauge the performance of commercial AIX workloads. This would be roughly equivalent to 1,526 of IBM's top-end 32-socket, 256-core Power 795 machines using the Power7 chips.
Now, with the Power7+, IBM could double up the core count for the about the same system price by octo-stuffing the Power 775 processor modules. That's silliness, of course. IBM is not going to do that unless some very big supercomputer centers pay it to do so, and it is fair to say that it will never certify IBM i on a Power 775 node or any follow-on, and AIX plus DB2 is equally unlikely. But it is an interesting thought experiment that shows IBM could push clustered databases into the stratosphere if it truly wanted to.
What is clear is that the double-stuffed Power7+ sockets present customers with some interesting options and tough choices, and depending on pricing – which is a complete unknown at this point – they might be as economically interesting as they are technically appealing.
Die shot of the Power7+ chip from IBM
IBM has not given precise clock speeds for the Power7+ processors, but Taylor said at the Hot Chips conference that the shrink to 32 nanometer processes from the 45 nanometer processes used to make the Power7 processors would allow for the clock speeds to rise by 25 per cent with the Power7+ chips.
The Power7 chips run at between 3GHz and 4GHz at the moment, with the 3GHz chips used in entry Power 720 rack and tower systems and PS701 blades and the 4GHz chips running in the Power 795 beast, so we can expect Power7+ clock speeds to be in the range of 3.75GHz to 5GHz, yielding at least a 25 per cent boost in single-thread performance, and very likely more than that since the L3 cache has been expanded by a factor of 2.5 to 80MB across an eight-core chip.
Double-stuffed parts: twice the threads and cache
Taylor also said in passing that the DCM variants of the chips would run at approximately the same speed as the current Power7 chips – that's between 3GHz and 4GHz, yielding roughly the same potential clock cycles for apps but with 2.5 times the L3 cache and other tweaks (like larger main memory, presumably faster PCI-Express I/O, and on chip accelerators for memory compression and other functions). So even at the same clock speeds, it seems likely that the DCM variants of the Power7+ chips will yield more performance than the current Power7s of equivalent clock speeds.
What I can tell you for sure about Power7 versus Power7+ SCM and DCM performance is this: they have considerably more performance. Here's a chart that Taylor tossed up in his presentation, showing the relative performance normalized across the same number of cores, presumably using top-end parts in the same thermal bands to make the comparison fair – he admitted that he was not exactly sure what the comparisons were:
Relative performance of single and double Power7 server sockets
If the bars on this chart are proportional to the performance increase, then a Power7+ SCM is delivering around 32 per cent better performance on ERP software and about 37 per cent better performance on raw integer work than the Power7 SCM it replaces. For work in which threads and L3 cache matter, the performance gain from the Power7 to the Power7+, core for core and SKU for SKU, is around 82 per cent for OLTP workloads and around 55 per cent for Java workloads.
As you can see from the chart, raw integer work and ERP software performance does not spike as much using the Power7+ DCMs as does database and Java work where threads and L3 cache are king. A Power7+ DCM has more than twice the performance of a Power7 chip running at about the same clock speed, which just goes to show you how important cache really is.
But why stop there?
With the power-gating features on the Power7+ chips (which I detailed here), the Turbo Core mode clock speed for these Power7+ chips, whether an SCM or a DCM, could be even considerably higher.
In the prior Power7 machines, only a few of which have Turbo Core mode, the clock speed uplift of turning half the cores off within the die was only 5.6 to 6.3 per cent – not a lot of extra clock speed. But when you turn off half the cores, you keep on all of the L3 cache memory, and that doubles the caches on the remaining cores. This can have a dramatic effect on the performance of cache-sensitive workloads such as Java apps and the databases they smack.
Now that IBM has power gating on individual cores and caches on the Power7+ chip (which it did not have with the Power7s), it can now not only turn off elements not in use, but also boost the clock speeds a bit higher in Turbo Core mode – and do so on the fly. Imagine, then, a special key that might permanently block activating all of the cores on a machine equipped with either the SCM or DCM versions of the Power7+ chips.
Let's see what happens.
Maybe Turbo Core mode can boost clock speeds on the SCMs to between 4.4GHz and 5.5GHz, which is a 10 per cent bump in performance all by itself just from the clocks. And having 80MB of cache spread across those four cores in the modified Power7+ (and remember that Intel's new Xeon E5 chip has 20MB of L3 cache across eight cores running at a top speed of 2.9GHz) should also boost performance. On OLTP work, if 25 per cent of the performance is coming through clock speeds, then boosting the L3 cache by a factor of 2.5 added another 57 per cent or so.
Cache misses on such a fast processor are a big deal, and hence why IBM has a 48MB L3 cache across six cores on the new System zEnterprise EC12 mainframe engine, which clocks at 5.5GHz, plus another 384MB of L4 cache front ending that L3 cache. This L4 cache sits on the SMP hub/switch at the heart of each system board in the System z mainframe.
Now ask yourself this: what would be the effect of boosting the L3 cache per core by another factor of two? By my simple math, it should be around 45 per cent more oomph, if that L3 cache can be kept fed from the chip interconnect and main memory.
So, a Turbo Core version of the Power7+ chip with all of its L3 caches on and four cores power gated and shut down might yield another 55 per cent more oomph per core. When you do the math on this hypothetical Power7+ processor, that gives you 40 per cent more OLTP throughput per socket with half as many cores as the Power7 chip it would replace.
In other words, that's 40 per cent more work done than an eight-core Power7 socket with half the IBM I or AIX/DB2 software bill because you are only running on four cores with the Power7+ chip.
Test that idea out for a bit. Mull it around. And then ask yourself why IBM doesn't offer such a machine, and call up your IBM sales rep and ask for one.
And don't forget to add flash to this hypothetical overclocked Power7+ box so the machine keeps the main memory and cache memories well fed. The same thing could be done with the DCMs as well, of course. You could double-up the cores per sockets, turn off half or even three-quarters of the cores, and run the cores that do work in Turbo Core mode all the time.
IBM could even sort through the bins and find Power7+ chips where many of the cores are duds but the L3 caches are fine and run at high speed, and make special database versions of the Power Systems machines. Imagine a DCM version with four cores activated, with the cores running at between 3.3GHz and 4.4GHz, with 160MB of L3 cache across those four cores.
How much work could such a socket do? It looks to me like it might do 25 per cent more work than an eight-core Power7 socket, at about half the IBM i or AIX/DB2 software bill.
This is a thought experiment, and one that IBM should play around with before it launches the Power7+ servers in a week. Even if it such overclocked systems are not formally launched, we see no reason why IBM could not offer it on a special-bid basis to customers where it makes more sense than Double Stuf sockets.
IBM needs to swing both ways. ®