Power7+ chips debut in fat IBM midrange systems
Near the top at first, trickling down to smaller boxes next year
IBM has taken the wraps off the first of its Power Systems machinery to make use of its cache-heavy Power7+ processors, and as El Reg anticipated from the hints in the announcement invitation  put out two weeks ago, Big Blue is starting near the top of the line as it upgrades systems that run AIX, IBM i (formerly known as OS/400), and Linux.
As has been the case for the past several generations, the rollout for the Power7+ chips will be a gradual one. "The rest of the products will get the Power7+ next year, with the exception of the Power 795," Steve Sibley, director of worldwide product management for IBM's Power Systems division, tells El Reg. "Just like with the Power 595, we already built the fastest processor and I/O into that machine."
It's tough to argue with the guy in charge of the product line – but it's not impossible. Even if IBM can't crank up the clock speed of the 3.7GHz and 4GHz processors used in the high-end, 32-socket Power 795 machine,, the ability to have processors with 2.5 times the L3 cache per core (at 10MB) and better sleep states and Turbo Core modes would no doubt be of use to more than a few Power 795 shops.
If enough customers ask for such a thing, you can bet IBM will sell 'em. Just because Big Blue didn't do it before doesn't mean it can't do it now.
The eight-core Power7+ processor was previewed at the end of August  at the Hot Chips 24 chippery fest, and we gave you a peek  into its expected performance in the wake of the tech presentation, along with some thoughts on the overclocking potential for the Power7+ chip.
As already divulged, the Power7+ chip will come in one variant that puts a single chip running at a higher clock speed into a socket, and another one that takes two Power7+ chips and crams them into a single socket to double-up the cores, threads, and L3 cache in a socket with what we assume will be a pretty substantial hit to clock speed. IBM calls a regular chip a single-chip module, or SCM, and the double-stuffer a dual-chip module, or DCM.
Die shot of the Power7+ chip from IBM
The Power 770+ and Power 780+ machines announced on Wednesday are based on SCMs, just like the current Power 770 and 780 machines, known as Power7' (Power7 prime) in the internal IBM lingo because these chips were announced in conjunction with a doubling of memory capacity and a shift to PCI-Express 2.0 peripheral slots in these machines in October 2011 .
The Power7+ chip has a lot of new features to help accelerate specific functions inside of Power System boxes, including on-chip memory compression, encryption, and hashing algorithms, as well as a random-number generator that cannot be predicted because it is based on random electronic effects on the chip.
The Power7+ chip is implemented in a 32-nanometer process. Specifically, IBM's wafer bakery in East Fishkill, New York, uses a copper/silicon-on insulator process with high-k metal gates to etch the Power7+ chips, which have 2.1 billion transistors on the die.
Like the Power7 chips before it and the System z mainframe processors, the z11 and z12, the Power7+ implements shared L3 cache using embedded DRAM (eDRAM) instead of the faster static RAM (SRAM). It takes fewer transistors to make a memory cell for eDRAM, so even if it is slower than SRAM, you can jam a lot more cache right next to the processors and thereby speed up the performance of the overall processor by more than you might expect.
The shrink from 45 to 32 nanometers allows Big Blue to put 80MB of L3 cache on the die, plus a slew of accelerators. IBM says to make the Power7+ chip using SRAM for the L3 could have pushed the transistor count up to 5.4 billion, and the resulting chip would also be larger and therefore very likely getting lower yields on a new process.
In general, Sibley says that the Power7+ processors will deliver about 20 to 30 per cent more performance in the machines in which it will be soon shipping. But considering all of the accelerators, the expanded cache, and the memory compression for AIX (but not for Linux or IBM i) on the chips, customers would be wise to get some capacity planning help from IBM to figure out how their own applications might benefit as they jump from Power5, Power6, or Power7 chips to the new Power7+ chips in Power 770+ or Power 780+ systems. This is particularly true if you are using software-based encryption on any operating system and memory compression for AIX workloads.
For example, if you're using the AIX memory compression that debuted with AIX 7.1 running on Power7 chips, you could get as much as 2X the usable main memory, but by using the two on-chip accelerators that IBM put on the Power7+ chip to run the proprietary compression algorithm for AIX memory compression, you can get up to 2.25X usable main memory and not have the overhead of running the compression algorithms on the Power7+ cores. You get the double benefit of more addressable main memory (4TB can look like 9TB) as well as lower CPU core overhead, allowing the central processors to do more work.
The accelerators, by the way, are in the uncore area of the Power7+ chips and shared by the cores.
Two new servers, and a memory bump in the big bad box
The new Power 770+ server is known by product number 9117-MMD in the IBM catalog, and like its Power 570 and Power 770 predecessors, it's a NUMA machine with one to four server nodes that can be linked with fiber optic cables into a single system through the IBM Power chipset.
Each server node comes in a 4U chassis and has two processor cards, each with two processor sockets, so you can scale the Power 770+ from four to sixteen sockets in a single system image – twice as many sockets as in the Power 770 it replaces.
But that doesn't mean you are going to get twice as many cores in a system.
IBM's Power 770+ server
For the Power 770+ machine, IBM is using Power7+ chips with either three or four cores activated, and not the six or eight cores it had on machines using Power7 processors. Yes, that tells you something about Power7+ chip yields using that 32nm process. And this is one of the reasons why the Power7+ chip is, by IBM's own three-year cadence between chip generations, somewhere in the neighborhood of a year or so late.
A chip that ships on time is the exception, not the rule.
So the Power 770+ machine has the same number of cores as the Power 770, which means it doesn't have as much of a performance boost (in terms of number of cores and threads) as you might have been expecting if you thought that IBM would fill the sockets with chips with six or eight cores.
When yields improve, IBM will no doubt put Power7+ chips with six and eight cores into the Power 770+ systems, and it might even go so far as to double-stuff the sockets for those who really want a lot of cores and threads, and where clock speed doesn't matter so much – but yields will have to improve first.
What you get is a Power 770+ machine with 48 cores running at 4.2GHz with 10MB of L3 cache per core – a 13.5 per cent clock speed boost over the Power7 variant and 2.5X the L3 cache per core – or a machine with 64 cores running at 3.8GHz – a 15.2 per cent clock crank with the same 2.5X L3 cache per core.
If you use IBM's Relative Performance (rPerf) benchmark, a variant of the TPC-C online transaction processing test, to gauge the relative performance of the Power 770+ machine, the old Power 770 with 64 cores running at 3.3GHz was rated at 606.8, while the 64 core version of the Power 770+ is rated at 729.3. That's a 20.2 per cent performance bump.
The Power 770+ has the same 4TB of maximum memory as the updated Power 770' (prime, remember, not a typo) that came out last October, which was twice as much memory as the original Power 770 from 2010. And I happen to believe, by the way, that the Power7+ chips were supposed to go in those Power7' machines from October 2011, but IBM will never confirm that this was indeed the plan.
Each node in the Power 770+ system has six disk bays for a maximum of 1.8TB of storage inside the server skins. Each enclosure has six PCI-Express 2.0 slots for a total of 24 across the fully extended Power 770+ system, which is the same as on the original Power 770 and last year's Power7' box.
The Power Systems machines have a special variant of InfiniBand used for remote I/O drawers, called 12X, that hangs of the GX++ bus on the server, and Sibley says that this bus has better error management and increased bandwidth to host external peripheral drawers. With those 12X I/O drawers hanging off the GX++ bus, the Power 770+ can have a maximum of 184 PCI-Express 2.0 peripheral slots driving sixteen 12X I/O drawers, yielding a maximum of 3PB of internal storage using 900GB 2.5-inch SAS disks.
While IBM is using Power7+ chips with four or five of their cores and a proportional 40MB to 50MB of their cache decommissioned because those part of the circuits have flaws, with the Power 780+ server, also announced today and known as the 9179-MHD in the IBM product catalog, customers are being given the option of either four-core or eight-core variants of the Power7+ chip on system boards that sport two sockets.
On the eight-core variant, the Power7+ chip runs at a slower clock speed than the October Power7' chip, but the four-core version of the Power7+ chip that IBM chose to put into the system runs at a faster clock speed than the Power7' chip it replaces.
The Power 780 that was announced back in 2010 was based on the same iron as the Power 770, with one important difference: the Power 780 had eight-core Power7 chips running at 3.86GHz with all eight cores working, but it also had a Turbo Core mode where customers could deactivate half the cores and let the Power7 cores run at a slightly sprightlier 4.14GHz while at the same time allocating all of the 32MB of L3 cache to the remaining four cores. With the Power 780' (that %$#@! prime again), IBM doubled up the sockets on each system board as the standard configuration while dropping the clock speed down to 3.44GHz across a machine with sixteen sockets using six-core – not eight-core – chips.
Turbo Core mode went the way of all flesh with the Power 780' from October 2011. And "Why?", you ask? Because Oracle counted all the cores on the die – even when they were not on – because customers could change a setting, reboot the Power 780, and in theory get away with paying half as much for their Oracle software licenses. That extra 5.6 per cent of clock speed and double the L3 cache per core was just not worth paying twice the price for Oracle software.
The proper response there would be to charge half as much for AIX and DB2, but what do I know about marketing against Larry Ellison?
The Power 780+ server
There is similarly no Turbo Core mode with the new Power 780+ machine. The system boards with the four-core Power7+ chips run at 4.42GHz, which is 6.8 per cent higher than the Turbo Core mode on a half-cored Power 780 from 2010 running in Turbo Core mode. The Power 780+ has 10MB of cache for each of those cores, compared to 8MB in Turbo Core mode for the original Power 780.
That original Power 780 was rated at 425.5 rPerfs on the AIX relative performance benchmark in Turbo Core mode with 32 cores running at 4.14GHz, and the new Power 780 is rated at 817.1 rPerfs with 64 cores running at 4.4GHz. That's nearly double the performance for double the cores. It's likely that as yields improve, IBM will offer higher clock speeds for the new Power 780+ to improve that performance.
If you want to push throughput instead of single-thread performance, there's an eight-core Power7+ chip available for the Power 780+ system that runs at 3.7GHz and puts 128 cores in a single system image. This machine is rated at 1,380.2 rPerfs, which is a little more than twice the aggregate performance of the original Power 780 using eight-core Power7s.
The Power 780+ has the same peripheral expansion as the Power 770+ above, and both machines offer as many as 20 logical partitions – LPARs in IBMspeak – per core with the PowerVM 2.2.2 hypervisor, which is twice as many VMs as the prior several generations of Power Systems servers could do. A logical partition can scale down to as little as 5 per cent of CPU capacity, according to Sibley.
The speed at which a running LPAR can be teleported from one physical machine to another using PowerVM hypervisors has also been improved by a factor of three on single VM live migrations. (IBM calls this teleporting between physical machines Live Partition Mobility, or LPM.) A Power 770/780 or Power 770'/780' machines could handle eight concurrent LPMs, and the new Power 770+ and 780+ boxes can do sixteen concurrent LPMs.
Both the Power 770+ and Power 780+ machines will be generally available on October 19. AIX 6.1 and 7.1 with appropriate patches will be supported on the machines, as will an update of IBM i 7.1 on first shipment day. A patched version of the earlier IBM i 6.1 release will come out on November 9, and Red Hat Enterprise Linux 6.3 and SUSE Linux Enterprise Server 11 SP2 will also run on these two new boxes.
The venerable AIX 5.3 release will eventually be allowed to run on these machines, but Big Blue must be getting pretty tired of AIX 5.3 about right now. It killed off OS/400 V5R3 years ago, and will be mothballing i5/OS V5R4 next September.
Pricing for the Power 770+ and Power 780+ systems was not available at press time.
Memory boost for the Power 795
IBM is moving to higher density memory cards based on 4 gigabit DDR3 chips for its high-end Power Systems memory cards, and it is these cards that are used with the new Power 770+ and Power 780+ machines when they are fully loaded with their 4TB of memory.
IBM is also kicking out a new 256GB memory card based on the same 4 gigabit DDR3 chips that allows it to double up the main memory capacity on the top-end Power 795, which sports four-core Power7 chips running at 4.25GHz or eight-core Power7 chips running at 4GHz. That's 16TB of physical memory, and with the Power7+ on-chip memory compression algorithm, that 16TB of physical memory can be made to look like 36TB to the AIX operating system. (There's no technical reason why this memory compression can't be made available on IBM i or Linux, and it should be.)
The PowerVM hypervisor can now do 20 LPARs per core on the Power 795, double what was previously available, but the machine can only handle a maximum of 1,000 LPARs per physical machine – even though you could, at 20 LPARs per core, push an aggregate of 5,120 partitions across 256 cores.
All three machines also now have what IBM calls elastic capacity on demand, which is 15 processor-days of free capacity that come as part of the base price of the system, giving customers a little extra head room for a utility-style price. At $16 per day per core with 8GB of main memory allocated to it, this is a reasonable price.
"It's pretty attractively priced," says Sibley, "offering the flexibility and price of a public cloud with the convenience of being in your own data center."
Clearly, IBM is betting that 15 processor-days of capacity will get customers hooked on using capacity on demand for end-of-week, end-of- month, and end-of-year runs, as well as whenever they have a spike in demand.
IBM is also tossing in a PowerCare consulting gig, worth somewhere in the neighborhood of $20,000, which consists of a two-week engagement with Global Services, for each Power 780+ and Power 795 deal.
One last thing: the new Power 770+ and Power 780+ servers do not have PCI-Express 3.0 peripheral slots, and Sibley says that PCI-Express 3.0 slots will not be available until the next generation of Power processors comes to market.
Assuming the rollout of Power7+ takes at least until next summer or so, that could mean mid-to-late 2014 before Power Systems get PCI-Express 3.0 peripherals. That is a long time to give Intel a lead, but IBM has lagged behind x86 servers when it comes to PCI slots for a long, long time. This isn't new – even if it is not as aggressive as IBM should be. ®