Intel puts x64 in a parallel universe
Taking the MIC out of Larrabee
No GPU speak required
The important thing about MIC, as far as Intel is concerned, is that the same C, C++, and Fortran compilers and the same developer tools and libraries used by HPC customers who deploy parallel Xeon server clusters will work on the MIC co-processors. There will be different optimizations, of course. But you don't have to speak GPU to make these things work.
At ISC, Skaugen said that the Knights Ferry co-processor used an MIC chip code-named "Aubrey Isle," which you can see below:
You don't need a supercomputer to count the 32 cores on this die. What's weird about this Aubrey Isle chip is how there are seven groups of four cores, and then one group that seems to be scattered around the die near what seems to be interconnect electronics. If I had to guess — and I have to — the vector unit is on top of the die (you can see 16 splotches, each one capable of doing one floating point operation) and the four cores are below them (the squares with the dark edges).
The rest of each core seems to be L1 and L2 cache memory, and it is a fair guess that the coherent L2 cache is made up of a segment of L1 cache that is shared by all the cores. The two horizontal stripes would therefore implement the ring interconnect lashing the cores together. No word on how big this Aubrey Isle beast is or what process it was implemented in, but presumably it is made using Intel's current 32 nanometer processes and is too damned hot to be put into production.
Skaugen said at ISC that Intel will be ramping up production on the Knights Ferry development co-processor throughout 2010. It is the Knights Corner co-processors that will put what was once Larrabee into a proper device aimed at real HPC shops. Knights Corner appears to be the name of the entire device, not the chip, just as Aubrey Isle was the name of the chip used in the Knights Ferry co-processor. The chip inside the Knights Corner co-processor will be implemented in a 22 nanometer process and will have at least 50 of the x64 cores on them, plus an unspecified number of vector processors. It is fairly likely that Intel is designing 64 cores onto the chip, and then — yields being what they are on massive chips — cores with boogers in them will be deactivated and customers will get what they get.
Similarly, Nvidia's "Fermi" graphics co-processors were designed with 512 cores, but when the machines came out earlier this year, the yields were such that Nvidia could only pump out chips with 448 working cores. The flops were more or less the same, probably because Nvidia cranked up the clocks, which it could do with 12.5 per cent of the cores being duds.
Intel was pretty vague about what kind of performance to expect from the Knights family of GPUs, but you can bet the marketing angle is that Intel does not have to offer the same single- or double-precision flops as AMD or Nvidia with its graphics co-processors because of the ease of programming that comes from using a co-processor based on the x64 instruction set. Intel was bragging at ISC that researchers at CERN were able to port a "complex C++ parallel benchmark" to the MIC software stack and experimental processor in "just a few days."
In its press release about the Knights family of co-processors, Intel said that the MIC architecture would accelerate "select highly parallel applications" but that the "vast majority of workloads will still run best on award-winning Intel Xeon processors." Particularly, a cynic would say, if Intel tries to charge $10,000 for one of these Knights. Then it is checkmate for the whole idea. In any event, Intel won't have the full MIC software development kit ready until sometime in the second half of 2010, and other chips, like the "Ivy Bridge " Xeons, are most likely at the front of the line for the 22 nanometer wafer baking next year.
By the way, Skaugen said at ISC that the next-generation "Sandy Bridge" Xeons, due by the end of the year, would have "significantly greater performance" than the current Westmere and Nehalem Xeons, with higher core counts and HyperThreading boosting performance. With the addition of new AVX vector math instructions, Sandy Bridge Xeons will be able to process twice the flops per clock as the current Xeons, in fact.
The word on the street is that Sandy Bridge Xeons will have 4, 6, or 8 cores and clock speeds of between 2.8 GHz and 3.4 GHz, not including Turbo Boost overclocking. With the AVX units, Sandy Bridge chips will do eight double-precision flops per clock per core, so call it 192 gigaflops with eight cores running at 3 GHz and assume we are talking a 130-watt power envelope.
By comparison, a Knights co-processor with 50 cores and running at maybe 1.5 GHz could have as much as 2 teraflops of single-precision floating point performance. It is unclear if it will be able to run double precision calculations with any speed, but such a chip supporting 1 teraflops of double precision oomph would be compelling to a lot of HPC shops with lots of x64 code. ®