Original URL: https://www.theregister.com/2011/09/15/intel_rattner_mic_coprocessor/

Intel shows linear scaling with MIC coprocessor

Army of Pentiums march in lockstep

By Timothy Prickett Morgan

Posted in HPC, 15th September 2011 21:00 GMT

IDF 2011 The chip world is moving from multicore to many-core, says Intel chief technology office Justin Rattner. In certain circles, such talk makes sense – a many-core chip includes many more cores than a multicore chip – and according to Rattner, the transition to a many-core world won't be as difficult as expected.

At the Intel Developer Forum in San Francisco, Rattner showed off a number of parallel-processing technologies that the chip maker has in store that will help boost the performance of applications that like to chew on threads – and with an ever-decreasing amount of electricity consumed to do that processing.

Justin Rattner

Intel CTO Justin Rattner

Rattner reminded attendees that it was only five years ago at IDF when he gave the keynote address introducing the Core 2 Duo dual-core processor, and now Intel is delivering four-core i7 chips for desktops and Xeon processors with six and ten cores, with more on the way with the "Sandy Bridge" generation. "It seems like multicore just took off," Rattner said. "Now, we are just at the beginning of the age of many-core processors."

In the aftermath of pulling the plug on the "Larrabee" GPU coprocessor, which is based on streamlined Pentium cores, with a ring interconnect as will be employed in future Xeon and Itanium processors, Intel has revamped the project to create coprocessors that will not do graphics but will absorb x64 calculations that are offloaded from the CPU, much as Nvidia and AMD are trying to do with their GPUs and their respective CUDA and OpenCL programming tools.

The difference between what Intel is doing and what Nvidia and AMD are doing is simple: Intel is sticking to the x64 architecture and tweaking its compilers so applications can automatically dispatch code to a Many Integrated Core (MIC) coprocessor, as the "Knights" family of coprocessors has been branded.

The "Knights Ferry" coprocessor is an experimental software-development platform based on the Larrabee design that Intel has let selected partners and customers play with since last year. Knights Ferry, which Intel talked up at the International Supercomputing Conference back in June, has 32 superscalar cores (without out-of-order execution) and a 512-bit vector math unit that can do 16 floating point operations per clock with single precision. Double precision, which is required for many workloads, was not an option, and that is probably the best reason why Intel has not been in a terrific hurry to commercialize Knights Ferry.

The follow-on, called "Knights Corner", was also trotted out at ISC, and Intel said back then that it would have more than 50 cores and be implemented in the 22 nanometer Tri-Gate process that will be used in the future "Ivy Bridge" Xeons and "Kittson" Itaniums.

Intel has never said how many cores that the Knights Corner MIC chip will have or what clock speed they will run at – but it's going to be 64 cores running at 1.2GHz to 1.6GHz, based on various presentations that are making the rounds at IDF this week. At that core count and assuming some of the cores end up being duds because of the ramp of the 22nm process, Intel should be able to get 2 teraflops single precision and 1 teraflops double precision per coprocessor.

In fact, Rattner himself showed off MIC's scalability in a bunch of different workloads running in the Knights Corner coprocessor during his keynote, which showed how workloads scaled on the device when from 1 to 64 of its cores were activated.

Intel MIC benchmark tests

Scalability of various benchmarks on a MIC

As you can see, the scalability of these parallelized workloads is pretty good up through 32 cores, and doesn't do badly on a bunch of them all the way up to 64 cores.

"So far, people are getting great results and, most importantly for us, it is an easy transition from multicore to many core," Rattner bragged. The reason was that the memory model and instruction set of the MIC coprocessors is familiar to programmers who design code to squeeze the bits out of multicore Xeon server chips.

These results jibe with what the boffins at CERN have been seeing with the 32-core Knights Ferry experimental coprocessor. Andrzej Nowak, who runs CERN's Openlab, said that the Trackfitter application that is used to simulate particle collisions on the Large Hadron Collider and compare them to actual results when real particles smack into each other at near-light speeds, is comprised of 2 million lines of code and runs across a network of servers that have an aggregate of 250,000 x64 cores.

Nowak demonstrated how one code snippet from Trackfitter had been ported to MIC and then ran it on one of the cores on the coprocessor. It meandered, doing its calculating and rendering at a snail's pace, about 1 per cent of the workload per second. Then all 32 cores kicked on and the run was restarted, and the simulation only took a matter of seconds to complete.

Intel MIC CERN benchmark

Scalability of CERN's Trackfitter application on MIC coprocessors

Basically, CERN is seeing essentially linear scalability on the Knights coprocessor. "Developing for MIC becomes a breeze. We can port an application in a matter of days, not months," said Nowak. "We will take any amount of cores you can throw at us," he told Rattner.

The expectation is that Knights Corner will come out sometime in the second half of 2012, in the wake of the "Sandy Bridge-EP" Xeon E5 processors for two-socket servers. This is important because the Xeon E5 chips have PCI-Express 3.0 controllers built onto the processors and will be able to support up to 80GB/sec of I/O into each socket. These two-socket servers are the workhorses of the IT industry, and are the most popular components in parallel clusters. Having a fast pipe between the Xeon and the MIC is key to application scalability.

After the keynote, El Reg pointed out to Rattner that while PCI-Express 3.0 was certainly faster than PCI-Express 2.0, in the long run what probably made the most sense was to put a MIC chip into a motherboard socket of some type, or better still, the same socket that a Xeon processor uses.

"I think we'll get there eventually," Rattner agreed. "But we had to get to market with something now."

When asked about how the MIC coprocessors might impact sales of Xeon processors, given the better performance per watt and, presumably, lower cost per floating point operation, Rattner said that when it came to HPC workloads, the issue was more complex. Specifically, for the integer part of applications, that Pentium core on the MIC "wouldn't impress anyone" compared to the Xeons, which have very good integer performance. "But in the future, MIC will be based on Atom cores, and then the integer gap between Xeon and MIC will start to close." ®