AMD CTO spills 'Steamroller' core specs
Shows off impending Opteron SeaMicro microserver
Hot Chips Mark Papermaster, CTO at Advanced Micro Devices, gave the keynote address at the Hot Chips 24 conference in Cupertino today. Papermaster's mission was ostensibly to talk about heterogeneous computing, a drum that the chip peddler has been banging on since last fall. But the real news was that Papermaster divulged some of the feeds and speeds of the "Steamroller" core modules for desktop and server processors while also showing off a variant of the SeaMicro microserver sporting an Opteron socket.
The "Piledriver" core modules only started shipping in the "Trinity" APUs, a hybrid CPU-GPU design last mentioned back in May, and they have yet to be incorporated inside Opteron server processors, which are still stuck at the "Bulldozer" core module level. As El Reg reported earlier this week, Suresh Gopalakrishnan, vice president and general manager of AMD's server business unit, is promising to stage the Piledriver Opteron machines - the "Delhi" Opteron 3300s for single-socket machines, the "Seoul" Opteron 4300s for two-socket and single-socket systems, and the "Abu Dhabi" Opteron 6300s for two-socket and four-socket servers - over 2012 and 2013, with at least one variant of Piledrivers expected sometime before the end of this year.
With the Bulldozer, Piledriver, and Steamroller designs, AMD is sticking with its fundamental philosophy of creating a two-core processor module that has a shared floating point unit and other shared components. The idea is to create a module tuned for virtualization and HPC workloads alike that shares some circuits and that doesn't just cookie-cutter cores onto a die and thereby waste real estate, suck juice, and emit heat more than it needs to. The Piledriver cores have a number of tweaks to make them run more efficiently, and according to Papermaster's revelations at Hot Chips, AMD has a few more tricks up its sleeve to get aggregations of dual-core modules to get more work done while generating the same or less heat.
AMD's Steamroller core won't be this blurry
Papermaster said that the Steamroller chip would sport improved integer and floating point pipelines and that a number of changes were being made to goose performance compared to the Bulldozer and Piledriver core modules. The company is not adding HyperThreading to its integer units in the core modules and is sticking with the idea that the chip module itself is multithreaded. That said, Bulldozer and Piledriver chips had shared fetch and decoder units across the two halves of the module, and with the Steamrollers, AMD has realized that it really needs to have a pair of decodes on the module to boost throughput. This decoder is fed by a single fetch unit and feeds into the integer scheduler on each half of the core as well as into a shared floating point scheduler. And each integer unit is getting its own decoder so it can promise customers "no compromises two thread performance" through the core module.
The integer unit will also have larger L1 instruction caches, a more efficient dispatch unit, and a beefed up instruction pre-fetch unit compared to the Piledriver chips. The larger cache will reduce L1 instruction cache misses by around 30 per cent, said Papermaster, and mispredicted branches will be reduced by 26 per cent. The integer scheduler will sport better emergency schedule handling and there will be more registers available to "tune up integer execution bandwidth." The Steamroller design will handle data cache misses faster and will speed up store-to-load forwarding, too. Add all the changes up on the integer side of the core module and there will be a 30 per cent increase in operations per clock cycle for Steamroller over Piledriver. That's a process node or so right there without actually having to go through it. There's hope for AMD yet. It's a pity that these Steamroller changes can't be combined with a jump from 28 nanometer wafer baking processes down to 22 nanometer or 20 nanometer processes at the same time.
On the floating point portion of the Steamroller chip, AMD has freed up some silicon real estate on this third-generation Bulldozer core by removing one of the two MMX units, which Papermaster said did not have an appreciable effect on performance for workloads that needed SSE2 capability. (Bulldozer had no such MMX units, and Piledriver has two.) The FP unit still has two 128-bit fuse-multiply-add units, and AMD is still contending that it can get competitive 256-bit floating point performance by turning half the cores off and double-pumping the math unit.
The other interesting feature of the future Steamroller design is that it has a smart L2 cache that allows for segments of the cache to be gates and turned off when they are not in use. (Intel has pulled a similar trick with its "Ivy Bridge" family of processors and plans to bring it to servers larger than those addressed by the single-socket Xeon E3-1200 v3 chips, which are based on Ivy Bridge cores.) AMD will dial this cache up and down as workloads dictate, instead of just sitting there burning up juice in an all-or-none fashion.
AMD knows that its foundry partners, GlobalFoundries and Taiwan Semiconductor Manufacturing Corp, can't take on Intel directly in the process game. So it has to be smarter about gating power on every part of the die to squeeze out efficiencies. And it has to be smarter about how it designs its chips, too. For instance, Papermaster said that AMD has to use more intelligent chip design tools to squeeze more out of the processes its partners can fab chips in. As an example, Papermaster showed part of the floating point unit from the Bulldozer core that was hand-drawn to lay out the circuits for the 32 nanometer process at GloFo. But AMD now has a high-density library that does a much better job, and this same portion of the FP unit run through these tools resulted in the FP unit taking up 30 per cent less area on the chip and burning 30 per cent less power.
"That's like a technology node's level of energy efficiency," said Papermaster.
The Steamroller chip design has been completed and Papermaster said it would be in silicon and inside of systems next year.
SeaMicro servers ahoy!
Coming a bit sooner will be a version of SeaMicro's microserver motherboard that plugs into its SM10000-EX chassis launched in January. That was the version of the SeaMicro box that sported Intel's "Sandy Bridge" Xeon E3-1200 v1 processors.
The Opteron-based mobo for the SeaMicro microserver chassis
When AMD acquired SeaMicro for $334m a month later, Papermaster confirmed El Reg's contention that AMD was not trying to get into the server manufacturing business, but was interested in the "Freedom" 3D torus/mesh interconnect at the heart of the SM10000 machines as well as the load balancing and I/O virtualization ASICs that are also part of the system. In any event, AMD has promised that it would get an Opteron-based SeaMicro mobo into the system by the end of the year, and there is it. Exactly what processor AMD is putting in there is unclear, but a variant of the Opteron 3200 seems likely or the Opteron 3300 if this is the chip that comes out the door first from AMD in the Piledriver generation.
The point of all this talk about "ambidextrous systems" and the SeaMicro interconnect is to prove that AMD is looking at using different kinds of processing elements on a die or within systems with peppy interconnects. "It is not a pure feeds and speeds race," Papermaster proclaimed at the end of his keynote. "It is how you bring the solution together. You have to have a systems view." ®