AMD CTO spills 'Steamroller' core specs
Shows off impending Opteron SeaMicro microserver
Hot Chips Mark Papermaster, CTO at Advanced Micro Devices, gave the keynote address at the Hot Chips 24 conference in Cupertino today. Papermaster's mission was ostensibly to talk about heterogeneous computing, a drum that the chip peddler has been banging on since last fall. But the real news was that Papermaster divulged some of the feeds and speeds of the "Steamroller" core modules for desktop and server processors while also showing off a variant of the SeaMicro microserver sporting an Opteron socket.
The "Piledriver" core modules only started shipping in the "Trinity" APUs, a hybrid CPU-GPU design last mentioned back in May, and they have yet to be incorporated inside Opteron server processors, which are still stuck at the "Bulldozer" core module level. As El Reg reported earlier this week, Suresh Gopalakrishnan, vice president and general manager of AMD's server business unit, is promising to stage the Piledriver Opteron machines - the "Delhi" Opteron 3300s for single-socket machines, the "Seoul" Opteron 4300s for two-socket and single-socket systems, and the "Abu Dhabi" Opteron 6300s for two-socket and four-socket servers - over 2012 and 2013, with at least one variant of Piledrivers expected sometime before the end of this year.
With the Bulldozer, Piledriver, and Steamroller designs, AMD is sticking with its fundamental philosophy of creating a two-core processor module that has a shared floating point unit and other shared components. The idea is to create a module tuned for virtualization and HPC workloads alike that shares some circuits and that doesn't just cookie-cutter cores onto a die and thereby waste real estate, suck juice, and emit heat more than it needs to. The Piledriver cores have a number of tweaks to make them run more efficiently, and according to Papermaster's revelations at Hot Chips, AMD has a few more tricks up its sleeve to get aggregations of dual-core modules to get more work done while generating the same or less heat.
AMD's Steamroller core won't be this blurry
Papermaster said that the Steamroller chip would sport improved integer and floating point pipelines and that a number of changes were being made to goose performance compared to the Bulldozer and Piledriver core modules. The company is not adding HyperThreading to its integer units in the core modules and is sticking with the idea that the chip module itself is multithreaded. That said, Bulldozer and Piledriver chips had shared fetch and decoder units across the two halves of the module, and with the Steamrollers, AMD has realized that it really needs to have a pair of decodes on the module to boost throughput. This decoder is fed by a single fetch unit and feeds into the integer scheduler on each half of the core as well as into a shared floating point scheduler. And each integer unit is getting its own decoder so it can promise customers "no compromises two thread performance" through the core module.
The integer unit will also have larger L1 instruction caches, a more efficient dispatch unit, and a beefed up instruction pre-fetch unit compared to the Piledriver chips. The larger cache will reduce L1 instruction cache misses by around 30 per cent, said Papermaster, and mispredicted branches will be reduced by 26 per cent. The integer scheduler will sport better emergency schedule handling and there will be more registers available to "tune up integer execution bandwidth." The Steamroller design will handle data cache misses faster and will speed up store-to-load forwarding, too. Add all the changes up on the integer side of the core module and there will be a 30 per cent increase in operations per clock cycle for Steamroller over Piledriver. That's a process node or so right there without actually having to go through it. There's hope for AMD yet. It's a pity that these Steamroller changes can't be combined with a jump from 28 nanometer wafer baking processes down to 22 nanometer or 20 nanometer processes at the same time.
On the floating point portion of the Steamroller chip, AMD has freed up some silicon real estate on this third-generation Bulldozer core by removing one of the two MMX units, which Papermaster said did not have an appreciable effect on performance for workloads that needed SSE2 capability. (Bulldozer had no such MMX units, and Piledriver has two.) The FP unit still has two 128-bit fuse-multiply-add units, and AMD is still contending that it can get competitive 256-bit floating point performance by turning half the cores off and double-pumping the math unit.
The other interesting feature of the future Steamroller design is that it has a smart L2 cache that allows for segments of the cache to be gates and turned off when they are not in use. (Intel has pulled a similar trick with its "Ivy Bridge" family of processors and plans to bring it to servers larger than those addressed by the single-socket Xeon E3-1200 v3 chips, which are based on Ivy Bridge cores.) AMD will dial this cache up and down as workloads dictate, instead of just sitting there burning up juice in an all-or-none fashion.
AMD knows that its foundry partners, GlobalFoundries and Taiwan Semiconductor Manufacturing Corp, can't take on Intel directly in the process game. So it has to be smarter about gating power on every part of the die to squeeze out efficiencies. And it has to be smarter about how it designs its chips, too. For instance, Papermaster said that AMD has to use more intelligent chip design tools to squeeze more out of the processes its partners can fab chips in. As an example, Papermaster showed part of the floating point unit from the Bulldozer core that was hand-drawn to lay out the circuits for the 32 nanometer process at GloFo. But AMD now has a high-density library that does a much better job, and this same portion of the FP unit run through these tools resulted in the FP unit taking up 30 per cent less area on the chip and burning 30 per cent less power.
"That's like a technology node's level of energy efficiency," said Papermaster.
The Steamroller chip design has been completed and Papermaster said it would be in silicon and inside of systems next year.
Next page: SeaMicro servers ahoy!
Re: What does ''emergency schedule handling'' mean
I think this is where the execution queue handler has to be either flushed or interrupted and higher priority instructions than the existing items in the queue executed instead. For example, when a high priority hardware interrupt is triggered it needs to use the processor resources with low latency rather than waiting around for a lower level process to complete.
It used to be "simple" when the persistent process chip state was just a few registers - add a higher number of registers, such as those in the non-core specialist functions, and execution queuing and things get "interesting".
The problem is the consumer lines have been pushing sales as of late since the business market is skittish about the economy, and the simple fact is unless you take Windows 8, already shaping up to be one of the most hated Microsoft OSes since WinME, then you are SOL when it comes to anything based on the Bulldozer module design, and this is coming from someone who has built nothing but AMD for years and is typing this on a Thuban.
The problem is every version of Windows OTHER than Windows 8 has a scheduler that treats the BD module as TWO cores, this is wrong. What it is actually is more like a single core with hardware assisted hyperthreading thanks to only having half the FP, why is this bad? Lets say you have two heavy loads..the SMART way to execute on a BD module would be one load on module1/2 and the other on 3/4, that way each load would have its own FP unit. What you will get in any Windows OTHER than Windows 8 is that Windows will schedule BOTH loads on 1/2, which means they are fighting over a single FP unit and the whole thing just drags. This is why in many tests with Win 7 a Thuban or Deneb will beat a BD with the same amount of cores, sometimes even beating BD chips with MORE cores.
Frankly this is a server chip that NEVER should have been released on the desktop, desktop loads are more and more multimedia which is FP city which is exactly what BD is worst at. Instead what they should have done is kept Thuban and Liano for the desktop while working on a new design, pushed out a new Brazos quad for mobile, and kept the BD/PD design for servers where most loads are integer and not FP heavy.
As someone who has been building and selling AMD systems exclusively as well as having my entire family on AMD if they haven't come up with a better design by the time I run out of AM3+ chips I'll have no choice but go Intel, my customers and I are skipping win 8 and there simply is no selling point for BD on Win 7.
AMD seems to be refining the cores and processes to where they should have been when BD was released. If Piledriver and Steamroller provide the 15% projected performance goal then AMD is back on track and sales will grow significantly. Trinity was a great performance bump so AMD has learned a few things.