AMD to double up cores with Jaguars
And maybe finally a Cat server variant
Hot Chips For those of us hoping that Advanced Micro Devices would get into the low-powered server racket with some earnestness, it looks like the forthcoming processors based on the "Jaguar" cores will fit the bill quite nicely.
The Jaguars are the kickers to the current "Bobcat" family of x86 processors used in ceepie-geepie hybrids bearing the Fusion APU brand, but in years gone by, AMD poo-pooed the idea that a Cat family chip would ever wear server operating spots.
With the Jaguars, this could change, particularly with the significant increase in core count, cache, and main memory capacity that is being put into the design.
Jeff Rupley, chief architect of the Jaguar processors at AMD, didn't give any hints about where the Jaguar chips would be used - or where they would not be - in his presentation at the Hot Chips 24 conference in Cupertino on Monday, or what the target clock frequencies would be for the Jaguars. Rupley was only there to talk about the Jaguar architecture and deferred on such questions.
But what is clear from the specs is that if a Jaguar chip is suitable for cloud-optimized clients and other low-power devices such as tablets, then there is no good reason that a bunch of them could not be ganged up and crammed inside of a dense chassis of microservers to run Windows or Linux workloads that have only modest performance requirements and where server density is much more important. It could turn out that the performance per watt and performance per dollar per watt of a server-ized Jaguar chip beats a low-voltage Opteron 3300 or 4300.
No matter what, the Jaguar-based processors will have the benefit of moving to new 28 nanometer processes. That's plural, and it means AMD has created the Jaguar design so it can be dual-sourced from either GlobalFoundries or Taiwan Semiconductor Manufacturing Corp, both of whom make desktop and laptop processors for AMD these days. (TSMC also makes AMD's graphics processors, as it does graphics chips from Nvidia.) With the shrink from 40 nanometers to 28 nanometers, AMD is doing a number of things with the Jaguar chips.
First, it is doubling up the core count while making some substantial changes to the cache memory structure relative to that used in the Bobcat-based chips. There are also a number of tweaks to the instruction set to boost the performance per clock cycle (the same thing is expected with the "Piledriver" cores for Opteron server chips later this year) and support for AVX vector math.
The Jaguar design has four cores running along the bottom of the chip, with an L2 cache interface riding on top of the cores that links out to the northbridge of the chipset and to four banks of L2 cache memory with a total of 2MB of cache. That's 512KB of L2 cache per core, the same as in the Bobcats.
The Bobcat cores supported various levels 1 through 4A of the SSE media processing instructions that are compatible with Intel chips, and the Jaguars add in support for SSE4.1 and SSE4.2 instructions. The Jaguar chips will also have a 40-bit physical memory address space, up from 36 bits with the Bobcats, which means they will be able to, in theory address a lot more main memory.
That's 64GB at 36 bits, which is still a hell of a lot of memory for a laptop or tablet, up to 1TB at 40 bits. That big physical memory increase could mean that AMD is indeed planning server variants of Jaguar Fusion APUs, which would be very interesting if the on-chip Radeon GPUs could be made to do some offloaded mathematical calculations.
The floating point unit in the Jaguar is being boosted as well, according to Rupley. In the Bobcat cores, the FP unit had a two-wide decoder with two execution pipelines that could handle 64-bit processing. With Jaguar, the FP unit gets 128-bit processing and a 128-bit wide data path. The FP unit will be able to do four single-precision multiplies and four single-precision adds at the same time; it will also be able to issue one double-precision multiply and two double-precision adds per clock.
If you double pump the FP unit, you can do one 256-bit AVX vector math instruction per clock. This 128-bit FP and 256-bit AVX processing is as good as a Bulldozer, Piledriver, or Steamroller core in the Opteron server chips can do.
The L1 instruction and data caches on each core will stay the same at 32KB each with the Jaguar design, but prefetchers and load/store units have lots of tweaks to make them hum along more efficiently. The integer execution unit is essentially the same, with its schedulers able to issue two instructions, one load, and one store per clock.
The core enhancements added about 4 per cent more performance with Jaguar over Bobcat in terms of instructions per clock (IPC) cycle, and other tweaks add up to more than 15 per cent better IPC. That's not too shabby for a tweak to an existing architecture.
AMD Jaguar core floor plan
All of these components are spread out on the Jaguar chip in an "amoeba-like" floor plan that Rupley says "took a lot of blood, sweat, and tears" to come up with and that was created using tools developed by the ATI side of the house to build AMD's GPUs. "We had some initial floor plans that were really terrible," admits Rupley, as the CPU designers learned to use the GPU tools better.
The Bobcat core weighs in at 4.9 square millimeters in area using the 40 nanometer process at TSMC, and if Jaguar were implemented in the same process it would have about 10 per cent more area, according to Rupley. But lucky for AMD and its customers, Jaguar cores will be implemented in 28 nanometer processes and will only need 3.1 square millimeters of space. ®
Sponsored: VersaStack at-a-glance brochure