Original URL: https://www.theregister.co.uk/2011/11/14/amd_opteron_4200_6200_launch/

AMD rides Bulldozers into the x86 server chip war

Opteron 6200s AND 4200s enter field against Xeon E5s

By Timothy Prickett Morgan

Posted in The Channel, 14th November 2011 05:01 GMT

SC11 Advanced Micro Devices was expected to launch its "Interlagos" Opteron 6200 server processors about now in conjunction with the SC11 supercomputing conference in Seattle.

But what wasn't known was that AMD was going to kick out the entry eight-core "Valencia" Opteron 4200 processors now, too, rather than do a two-step launch.

AMD took a two-step approach with its prior generation of server chips, rolling the twelve-core "Magny-Cours" Opteron 6100s, the big guns, onto the field of the ongoing x86 server chip war in March 2010 for two-socket and four-socket servers and following up with the six-core "Lisbon" Opteron 4100s for machines with one or two sockets in June 2010. The Opteron 6100s got a deep bin sort and a speed boost in February this year and otherwise it has been all quiet on the Opteron front.

AMD has been giving the Opteron 4100 and 6100s air support before they entered the field, talking about the new design of the "Bulldozer" core and how it will make for better server chips that can meet a widening array of workload, performance, and thermal requirements.

AMD Bulldozer core block diagram

The Bulldozer core: share some things and reduce power draw

The Opteron server chips using the Bulldozer cores are implemented in GlobalFoundries' 32nm, 11-metal layer, high-k metal gate, silicon-on-insulator wafer-baking processes. The former AMD foundry, which was spun out three years ago, has had some trouble ramping up this 32 nanometer process, giving AMD headaches and also meaning it could not meet demand for the PC and server chips based on the Bulldozer cores.

As El Reg explained in detail earlier this year when AMD's techies divulged some secrets about the core design at the IEEE's International Solid-State Circuits Conference, the Bulldozer core module has some components shared across two cores, but also gives each core its own thread (with no simultaneous multithreading). AMD refers to this as having "two strong cores" in contrast to the HyperThreading virtual cores Intel puts in its Core and Xeon processors. Each core – which means an integer unit and a floating point unit – in the Bulldozer module has its own integer unit scheduler and L1 data caches, but the cores share fetch and decode units as well as a floating point scheduler and L2 cache memory.

Each integer unit in each Bulldozer core has four pipelines, capable of executing one instruction per cycle. A Bulldozer core module has two 128-bit floating point units, which can do two 64-bit double-precision operations per clock or four 32-bit single precision operations. If one core is not using its floating point unit during a cycle, then the other core can take all 256 bits and do four double-precision or eight single-precision ops in a single clock cycle.

AMD was originally calling this feature an AVX mode, but is now on announcement day calling it Flex FP. Flex FP does support AVX operations. The floating point unit has new multiply-accumulate functions and also supports a bunch of new instructions, including SSE3, SSE4.1, and SSE4.2 SIMD extensions, on-chip AES encryption/decryption, and PCLMULQDQ, which is used to perform a carry-less multiplication of two 64-bit integers. AMD has also added new instructions called XOP and FMA4, which are tweaks to 128-bit SSE5 and SIMD instructions that is more compatible with Intel's AVX implementations.

The Bulldozer module has 2MB of L2 cache memory and has a total of 213 million transistors; it has an area of 30.9 square millimeters and is designed to run at between 0.8 to 1.3 volts. Each core in the Bulldozer module has 16KB of data cache and there is 64KB of shared instruction cache per module. The module has 1MB of L2 cache per core (twice that of the prior Opteron 4100 and 6100 chips), and the four-module chip package has a third more L3 cache per chip, at 8MB.

The Bulldozer cores have a new memory controller that can support up to 384GB of memory per socket (up from a too-skinny 128GB with the prior controller) as well as DDR3 memory running at 1.6GHz. AMD says that the new controller can support load-reduced (LR-DIMM) main memory, which allows more memory chips to be packed onto a memory stick, and 1.25 volt (ultra-low-volt) memory will also be supported in addition to the 1.5 volt (regular) 1.3 volt (low-volt). The new memory controller has "aggressive power down" and "partial power down" settings as well as memory power capping to keep systems within the thermal envelopes set by administrators.

Here's what the Bulldozer module looks like:

AMD Bulldozer core module zoom

To make an eight-core Valencia Opteron 4100, you put four of these Bulldozer modules on a single piece of silicon and wrap them up with a shared DDR3 main memory controller and 8MB of L3 cache, like this:

AMD Bulldozer Valencia Opteron chip

To make a 16-core Opteron 6200 processor, you put two of these in a single package, like this:

AMD Opteron 6200 die

AMD's double-stuffed Opteron 6200 processor (click to enlarge)

The one thing that the new Opteron processors do not have is support for PCI-Express 3.0 peripherals, either on the chip itself or in the chipset. The forthcoming "Sandy Bridge-EP" Xeon E5 will have on chip PCI-Express 3.0 controllers, as El Reg revealed back in May.

"If you ask our competitor, PCI-Express 3.0 is a big deal," says John Fruehe, director of product marketing for servers and workstations at AMD. "If you ask anyone else, it doesn't make a stinking difference. The important thing is that PCI-Express 3.0 forces a platform change that only benefits a few select applications. We'll be there when it is relevant. For us, it is more important to time it right than to be first to market."

That is precisely why AMD didn't rush to support DDR3 main memory with the Opterons, or goose the memory controllers with more capacity.

The Interlagos chip has a total of 2.4 billion transitions, which means the Valencia chip has 1.2 billion.

The x86 lowdown

The Opteron 4200 chips have three x16 HyperTransport 3.0 links per socket running at 6.4GT/sec. They plug into the same C32 sockets as the Opteron 4100s, and the AMD chipsets used in conjunction with the chips can support one or two sockets in a single system image.

AMD has shifted back to thermal design point (TDP) ratings preferred by Intel to categorize its Xeon processors, and the Opteron 4200s are available in 95 watt standard parts, 65 watt Highly Efficient (HE) parts, and 35 watt Extremely Efficient (EE). The HE and EE parts run at lower voltages and lower clock speeds and therefore emit a lot less power. The TDP ratings are in the same thermal bands as the 75 watt, 50 watt, and 32 watt ACP-rated Opteron 4100. The chips come with six or eight cores, with the plain-vanilla chips running at between 2.7GHz and 3GHz. The Opteron 4200 HE parts spin at 2.5GHz with eight cores and 2.8GHz with six cores, and the eight-core EE part runs at 1.6GHz and hence doesn't run very hot.

But the new Turbo Core mode gives customers a potential 300MHz to 500MHz of extra performance if there is enough TDP headroom in the system, and if half the cores are put into a C6 sleep state then the remaining cores in the chip can deliver anywhere from 600MHz and 1GHz of performance over the rated clock speed of each chip. The burst-mode clock speed on the eight-core Opteron 4284 and six-core Opteron 4234 with half of their cores asleep is 3.7GHz. Here's how the Opteron 4200s line up against their predecessors:

AMD Opteron 4200 vs 4100

AMD's Opteron 4200 processors versus the Opteron 4100s

The laws of thermodynamics won't let AMD cram two Opteron 4200s in a single package and keep all the same clock speeds - the heat density is too high and you have to slow down the clocks a bit so the chip doesn't melt. The Opteron 6200 package still runs quite a bit hotter than the Opteron 4200, with TDPs ranging from a low of 85 watts to a high of 140 watts with the Special Edition (SE) part.

The Interlagos package, which plugs into the existing G34 socket, also has one extra x16 HT 3.0 link per socket, which allows for the machine to scale up as far as four sockets in a single system image. You can also create two-socket Opteron 6200 machines. The Opteron 6200s have four memory channels per socket and up to three a dozen memory sticks for a total of 384GB per socket. That 1.5TB of memory for a four-socket box is a lot more competitive with Xeon alternatives than was the prior Opteron 6100s, which topped out at 512GB for a four-socket box.

Here's how the Opteron 6200s stack up against their predecessors:

AMD Opteron 6200 vs 6100

The Opteron 6200s versus their 6100 precursors

Fruehe says that with the Interlagos chips, AMD concentrated on memory throughput and throwing lots of cores at it, and pointed out that an Opteron socket has four memory channels running at 1.6GHz compared to three channels running at 1.33GHz for the Intel Xeons. "All of the workloads that love lots of cores, lots of floating point, or lots of memory, are going to run well on these processors."

The Opteron 6200 processors have been shipping quietly to server OEMs since late August and are available today in volume quantities. (Fruehe says that over a half million Bulldozer cores have already shipped prior to announcement day.) Server vendors are expected to start rolling out support for the chips immediately. The Opteron 4200s will begin shipping in December. ®