Original URL: http://www.theregister.co.uk/2010/08/24/amd_hot_chips/
AMD: 'Bobcat' smaller, faster than Intel's Atom
Netbooks. Not servers. For now
AMD says that its upcoming "Bobcat" core for netbooks and notebooks is smaller than a single-core Intel Atom chip – and faster. According to the company, this low-power architecture could eventually follow Atom into the server market, but at the moment, that territory is still reserved for the "Bulldozer."
The annual Hot Chips conference is going on this week at Stanford University, and it is a bit more subdued that usual because so many different processors have already been announced this year. Advanced Micro Devices, which has lost a bunch of market share to rival Intel, has been talking up its future "Bobcat" and "Bulldozer" cores for PCs and servers, respectively, and is providing a few more details on these cores at Hot Chips.
But don't get too excited. The Bulldozer chips have a vague 2011 ship date and AMD is not saying much more than it already has in the past several months. The Bulldozer cores, as El Reg told you two weeks ago , have a modular core design where L2 cache, instruction fetch and decode units, and floating point unit schedulers are shared across two cores, which have their own four pipeline integer units and integer schedulers as well as a 128-bit floating point unit.
Depending on the chip, from one to four of these Bulldozer modules are plunked onto a single chip, with an L3 cache, a DDR3 memory controller, and a northbridge for linking the cores to external peripherals shared by all of the modules on the chip. The quasi-core at the heart of the dual-core Bulldozer module is a new core, not a reworked variant of the K8 core used in the prior several generations of Opteron processors.
In the case of the future "Valencia" six-core and eight-core processors, one of these eight-core Bulldozer modules is put into a single package for the C32 socket. For the high-end, two of these eight-core modules are put side-by-side in a single package, called "Interlagos," and linked by HyperTransport, yielding a 16-core processor package (and a twelve-core variant with two dud cores per chip, and probably some of the L3 cache turned off, too). This Interlagos chip plugs into the existing G34 socket.
That means server and workstation makers will be able to deliver systems with 33 per cent more cores and 50 per cent more performance as they can do with the existing "Magny-Cours" Opteron 6100 processors. Ahead of the Hot Chips conference three weeks ago, John Fruehe, director of product marketing for server/workstation products at AMD, blogged  those performance numbers in early August, stealing his own Hot Chips thunder to cover for some weak financials.
Fruehe has also said in interviews with El Reg that Bulldozer's shared component approach results in a Bulldozer module with two quasi-cores, and yields about 1.8 times the performance as two current Magny-Cours cores. That's a 10 per cent performance hit, clock for clock, for every pair of cores, but much lower power consumption because of the shared nature of the Bulldozer modules.
That means expanding the core count from 12 to 16 is about 20 per cent of the extra oomph, and the other 30 per cent is coming from changes in the instruction stream and increased clock speeds. It is our guess that AMD will try to push clock speeds up to around 2.75 GHz on its top-bin Interlagos server chips.
AMD's Bulldozer Opteron modular core.
Ahead of Hot Chips, Fruehe would not confirm the clock speeds on any of the Bulldozer-based parts. But what he did say is that on the commercial workloads that AMD is targeting with the Opteron chips, 90 per cent of the workload is running through the integer unit and 10 per cent is hitting the floating point unit. By sharing the other components on the Bulldozer core, AMD can reduce the effective size of a core and by moving to a 32 nanometer, silicon-on-insulator/high-k metal gate wafer baking process from GlobalFoundries, it can shrink the whole shebang quite a bit from the current 45 nanometer processes. That should lead to higher clock speeds than are currently available on Opteron 6100s.
The Bulldozer cores will also get that performance boost from a new twist on the turbo functions used by Intel and IBM in their respective Core/Xeon and Power6/7 processors, which allow a core to crank up its clock speed when other cores on the chip are not being used. Fruehe would not say how this turbo function will work, but said it would be more elegant than what has been done to date and that it will work dynamically, boosting performance or cutting it back as conditions on the system dictate.
Fruehe reiterated AMD's disdain for HyperThreading and any kind of simultaneous multithreading, contending that its approach of sharing certain components and yet having two real integer and floating point units (instead of virtual ones) was better for a lot of workloads. "Having 16 threads running on 16 cores is better than trying to cram 16 threads onto eight cores," Fruehe says emphatically.
The logical layout of the Valencia Opteron processor.
When pressed about how far this modular approach with the Bulldozer cores can go, Fruehe was not giving away much information, but did confirm that scaling the modules beyond 16 cores is "doable." It had better be, and with 32 nanometer wafer baking processes if AMD wants to keep on the Moore's Law curve and do a better job of keeping pace with rival Intel.
The first Bulldozers - presumably the high-end Interlagos parts - will sample at the end of this year to OEM partners in the server and workstation rackets, says Fruehe, and as the year goes on and AMD gets a better sense of how the 32 nanometer processes are working out at GlobalFoundries, it will provide some more precise launch dates for the Interlagos and Valencia Bulldozers. It seems likely that the pricier Interlagos parts that plug into the G34 sockets will come first, followed by a quarter or so by the Valencia parts that plug into the C32 sockets.
The "Zambezi" variant of the Bulldozer chip, aimed at the enthusiast desktop PC space as El Reg reported  back in November 2009, is expected in 2011 as well. The Zambezi part is expected to come with four or eight cores and fit into an AM3 socket.
With the Bobcat cores for notebooks and netbooks, AMD is taking a K8 core and tweaking the heck out of it. Greg Hoepper, corporate vice president of design engineering who has managed the Bobcat design, says that the Bobcat core (which is a true, isolated core that does not share components) is "quite small" and that you could, in theory, put an "enormous number of these on a single die" if you wanted to. The early Bobcat implementations will put two cores on a die, with core counts going up form there.
While Hoepper is not giving out feeds and speeds on the chips, he did say this: "Bobcat is smaller than a single core Atom chip, and it has higher performance."
The Bobcat may be based heavily on the K8 core, but it doesn't cut any corners and has fully out-of-order execution of its instructions. The core sports a new set of logic for branch prediction. It supports the SSE1-3 SIMD instructions as well as the AMD-V virtualization extensions and the full AMD64 64-bit instruction set.
The Bobcat chip has 32 KB of L1 instruction cache that sits in front of the fetch and decode units. Below that are the integer, floating point, and address schedulers. The integer unit has two pipes, a load unit, and a store unit, and 32 KB of L1 data cache, while the floating point unit sits off to the side. Both the integer and floating point units share an on-chip L2 cache.
On the power conservation front, Hoepper says the Bobcat cores has a micro-architecture that was designed to minimize data movement (and hence power consumption and inefficient processing) and has clock gating, power gating, and system low power states to conserve energy. The chip will be able to provide around 90 per cent of the performance of today's mainstream performance in notebooks and netbooks in less than half the die size, according to Hoepper. We will have to wait until the Bobcats are launched to quantify that statement, but that is what AMD's internal modeling and benchmark simulations for Bobcat-based chips are showing.
As AMD previously disclosed  when reporting its second quarter results back in mid-July, the company is pushing out the "Llano" Fusion APU chip (which combines a CPU and a GPU in a single package) a few months into 2011. The Llano chip uses the same 32 nanometer process as the first Bulldozer chips do from GlobalFoundries. The word back in July is that the Llanos, which put a quad-core Phenom II-alike chip and a modified version of the Radeon HD5000 series GPU on a chip, will ship sometime in the first half of 2011. (We gave you all the feeds and speeds on Llano back in February .)
AMD's Bobcat core for netbook, notebook, and desktop PC microprocessors.
The Llanos slipped, but the "Ontario" Fusion APU, which is based on the Bobcat cores and which is aimed at high-end notebooks, is being pulled into this year. AMD is not saying how powerful the Ontario's GPU is, but we know it has two CPU cores and, as Hoepper confirmed, the GPU on the chip will be a variant of the "Evergreen" GPU family at the heart of the discrete graphics cards and GPU co-processors sold by AMD. The CPU and GPU components will link to each other and to main memory over a high-speed, on-chip bus. The Ontario chip is being manufactured by AMD's GPU foundry partner, Taiwan Semiconductor Manufacturing Corp, using its bulk 40 nanometer processes.
The Bobcat core design had a number of goals, according to Hoepper, and one of them is playing out right before out eyes. That was for the design to be portable between different wafer bakers. So if GlobalFoundries is slipping in a process, AMD's economic shoestrings are not tied together. It has another chip cooking at TSMC. The Bobcat design is also meant to push the power envelope down, and can hit below one watt per core of power consumption, according to Hoepper. "The Bobcat core integrates into SoC easier, too," says Hoepper, "and that allows us to do more things than we have been able to do in the past." Like really push into notebooks and netbooks hard and into other peripheral and embedded computing devices.
And perhaps, just perhaps, into low-power servers. Both Freuhe and Hoepper said there were no plans for modifying Bobcat chips to run in server platforms. "There is a lot of hype around ARM going into the server space," concedes Hoepper, "and Bobcat would work well here."
Fruehe says that AMD will be able to get six-core and eight-core Bulldozer chips in the 30 to 40 watt power range, which is pretty low for a server. "The question is this," says Fruehe. "Is there a need for a more discrete, less-threaded chip for servers?"
AMD is open to the idea, but don't hold your breath. Unless ARM, Tilera, and Intel Atom chips start getting some server sales. ®