Intel puts x64 in a parallel universe
Taking the MIC out of Larrabee
We've all been wondering exactly what Intel would do with various multicore x64 processors that had been designed as co-processors to accelerate graphics and other applications with lots of number-crunching. The answer, as Intel explained at the International Super Computing conference in Hamburg, Germany this week, is simple: Replace lots of standard Xeon processors commonly used in massively parallel supercomputers with many-cored systems on a chip and drop the whole idea of doing discrete graphics cards to compete against Nvidia and Advanced Micro Devices.
At last fall's SC09 supercomputer trade show, Intel's chief technology officer, Justin Rattner, showed off  an experimental graphics co-processor code-named "Larrabee," which was able to hit one teraflops running the SGEMM single precision, dense matrix multiply benchmark test when Intel turned on all the cores and then overclocked it for a short spike. But don't get too excited.
Only a few weeks later, Intel condemned Larrabee to the purgatory of development status  because it was late coming to market. And last week, knowing that everyone would be asking questions about Larrabee's kickers at the ISC event in Hamburg, Intel's PR machine let it be known that it was bowing out  of the discrete graphics processor market, but hinted that the substantial work it had done would be repurposed to make co-processors for HPC accelerating workloads running on x64 processors.
Kirk Skaugen, general manager of Intel's Data Center Group, had to work over the Memorial Day holiday, announcing something Intel is calling the Many Integrated Core architecture, or MIC for short, which Intel is billing as the "industry's first general purpose many core architecture." And rather than having MIC chips be the sole processors in a system, what Intel is going to do is plunk these chips into HPC systems as co-processors, just like Nvidia is doing with Tesla GPUs and Advanced Micro Devices is doing with Stream GPUs.
Here's what the MIC block diagram looks like, and if you remember what the vague statements were made about Larrabee, this sure looks like the same product:
The Many Integrated Core architecture block diagram
Like Larrabee, the MIC chips that Intel plans to bring to market will marry an x64 core and vector processors to some cache and then plunk down lots and lots of these units on a chip with a fast interconnect that creates keeps the caches for each chip coherent (so they can share data quickly and function more or less like a baby parallel supercomputer). The Larrabee design used a superscalar x64 core (without the out-of-order execution of Xeons, so akin to the Atom chip in some respects) and a 512-bit vector math unit that could do 16 floating point operations per clock doing single precision math. Larrabee also had a very wide ring bus for linking all of the cores and the caches together. As you can see from the MIC block diagram, the same elements are in the MIC architecture.
Skaugen said in his keynote at ISC that the MIC chip family would be known as Knights, and he held up what is presumably the same co-processor that Intel showed off at SC09 last fall, which he identified as the Knights Ferry co-processor:
The Intel Knights Ferry HPC co-processor.
This card, which looks exactly like a PCI-Express graphics card, is intended to be used as a software development platform. It has 32 cores, Skaugen divulged, running at 1.2 GHz, and it puts four execution threads on each core for a total of 128 threads. The chip has 8MB of shared coherent cache — that could be a mix of L1 and L2 cache; Skaugen didn't say — and the package comes with 1GB or 2GB of GDDR5 graphics memory for the cores to use as they crunch their numbers.
Yes, it is a graphics card. But it is just a graphics card that no one is ever going to write a graphics driver for — unless some smart asses in the open source community do. Which, by the way, will be possible when the commercial variants of the MIC chips become available. Intel could not stop this, even though it is not promoting MIC chips for use as discrete graphics chips.
No GPU speak required
The important thing about MIC, as far as Intel is concerned, is that the same C, C++, and Fortran compilers and the same developer tools and libraries used by HPC customers who deploy parallel Xeon server clusters will work on the MIC co-processors. There will be different optimizations, of course. But you don't have to speak GPU to make these things work.
At ISC, Skaugen said that the Knights Ferry co-processor used an MIC chip code-named "Aubrey Isle," which you can see below:
You don't need a supercomputer to count the 32 cores on this die. What's weird about this Aubrey Isle chip is how there are seven groups of four cores, and then one group that seems to be scattered around the die near what seems to be interconnect electronics. If I had to guess — and I have to — the vector unit is on top of the die (you can see 16 splotches, each one capable of doing one floating point operation) and the four cores are below them (the squares with the dark edges).
The rest of each core seems to be L1 and L2 cache memory, and it is a fair guess that the coherent L2 cache is made up of a segment of L1 cache that is shared by all the cores. The two horizontal stripes would therefore implement the ring interconnect lashing the cores together. No word on how big this Aubrey Isle beast is or what process it was implemented in, but presumably it is made using Intel's current 32 nanometer processes and is too damned hot to be put into production.
Skaugen said at ISC that Intel will be ramping up production on the Knights Ferry development co-processor throughout 2010. It is the Knights Corner co-processors that will put what was once Larrabee into a proper device aimed at real HPC shops. Knights Corner appears to be the name of the entire device, not the chip, just as Aubrey Isle was the name of the chip used in the Knights Ferry co-processor. The chip inside the Knights Corner co-processor will be implemented in a 22 nanometer process and will have at least 50 of the x64 cores on them, plus an unspecified number of vector processors. It is fairly likely that Intel is designing 64 cores onto the chip, and then — yields being what they are on massive chips — cores with boogers in them will be deactivated and customers will get what they get.
Similarly, Nvidia's "Fermi" graphics co-processors were designed  with 512 cores, but when the machines came out earlier this year, the yields were such that Nvidia could only pump out chips with 448 working cores . The flops were more or less the same, probably because Nvidia cranked up the clocks, which it could do with 12.5 per cent of the cores being duds.
Intel was pretty vague about what kind of performance to expect from the Knights family of GPUs, but you can bet the marketing angle is that Intel does not have to offer the same single- or double-precision flops as AMD or Nvidia with its graphics co-processors because of the ease of programming that comes from using a co-processor based on the x64 instruction set. Intel was bragging at ISC that researchers at CERN were able to port a "complex C++ parallel benchmark" to the MIC software stack and experimental processor in "just a few days."
In its press release about the Knights family of co-processors, Intel said that the MIC architecture would accelerate "select highly parallel applications" but that the "vast majority of workloads will still run best on award-winning Intel Xeon processors." Particularly, a cynic would say, if Intel tries to charge $10,000 for one of these Knights. Then it is checkmate for the whole idea. In any event, Intel won't have the full MIC software development kit ready until sometime in the second half of 2010, and other chips, like the "Ivy Bridge " Xeons, are most likely at the front of the line for the 22 nanometer wafer baking next year.
By the way, Skaugen said at ISC that the next-generation "Sandy Bridge" Xeons, due by the end of the year , would have "significantly greater performance" than the current Westmere and Nehalem Xeons, with higher core counts and HyperThreading boosting performance. With the addition of new AVX vector math instructions, Sandy Bridge Xeons will be able to process twice the flops per clock as the current Xeons, in fact.
The word on the street is that Sandy Bridge Xeons will have 4, 6, or 8 cores and clock speeds of between 2.8 GHz and 3.4 GHz, not including Turbo Boost overclocking. With the AVX units, Sandy Bridge chips will do eight double-precision flops per clock per core, so call it 192 gigaflops with eight cores running at 3 GHz and assume we are talking a 130-watt power envelope.
By comparison, a Knights co-processor with 50 cores and running at maybe 1.5 GHz could have as much as 2 teraflops of single-precision floating point performance. It is unclear if it will be able to run double precision calculations with any speed, but such a chip supporting 1 teraflops of double precision oomph would be compelling to a lot of HPC shops with lots of x64 code. ®