Intel code guru: Many-core world requires radical rethink

Original URL: https://www.theregister.com/2011/06/20/intels_reinders_on_many_core_coding/

Data-shuffling outranks flops-optimizing

Posted in Devops, 20th June 2011 23:47 GMT

Intel is readying its many-core "Knights Corner" coprocessor for release next year, but programming the 50-core-plus beastie and its 100 or 200 core follow-ons will require a whole new approach to code wrangling.

"Something that might have been only a couple per cent overhead at most in a one- or two-core system may become dominant when you get up to one or two hundred cores," Intel parallelism guru James Reinders told The Register when we sat down with him last week at Intel Lab's Hillsboro, Oregon, headquarters.

From Reinders' point of view, programmers need to switch their thinking from optimizing for the best possible performance measured by pure flops to optimizing instead for efficient data movement.

This new thinking is bubbling to the surface due to Intel's "many integrated core" effort – aka MIC, pronounced "Mike". The 32-core Knights Ferry MIC grew out of the failed Larrabee GPU effort and was relegated to a development platform chip in mid-2009, but Knights Corner is planned for release as a product in 2012, and will have "more than 50 cores" according to Reinders.

MICs amplify the parallelism problems already inherent in multicore chips. For example, when we asked Reinders if two cores in, say, two diagonally opposing corners of a Knights processor would waste a lot of cycles if they needed to communicate with one another, he first asserted that the interprocessor network was "very high-speed", but admitted that "you're right, there's a lot of latency" between distant processors.

The Knights' cores all share a coherent cache. "They're completely coherent with each other," he said. "And they have a snooping protocol and a cache directory, if you will, basis."

But coherency or no coherency, "The key to getting performance on [a MIC] is to have some data, keep it local, and nobody else touches it," Reinders told us.

Keep your data in the 'hood

Every time you need to communicate among MIC cores, you'll pay. "Every once in a while you'll shuffle data between them, and as you expect, there's a cost," Reinders said, offing a homey analogy: "It's the same as if we're working on some document here and someone in another conference room [needs it] and we have to swap documents."

In the many-core world, he said, "there's a lot of really interesting work coming out when people realize 'Hey, we should optimize for minimizing data shuffling instead of computation'."

This focus on data movement is a "hot topic" among university researchers and others. "There's some really neat work out of [the University of California at] Berkeley," he told us, that takes a fresh look at matrix multiplication.

"Everybody would have told you, 'Oh, we've optimized for matrix multiply to death for 25 or 30 years'," he said. But the Berkeleyites thought different. "'What if we optimize for data movement, to reduce it, and the heck with the fact that that'll cause us to do extra computation'," they proposed. "And they've gotten some fantastic speed-ups," Reinders reports. "And everybody looks at this and thinks, 'Why didn't we think of that?'."

That new data-shuffling routine could be simply dumped into a many-core library and accessed by anyone, of course. The real challenge, according to Reinders, is applying that kind of thinking to all many-core programming.

Prioritizing in this way requires forgetting what you've learned and adopting a new mindset – an easier task for new programmers than for those who've been in the trenches for years. "I always laugh at my kids who don't remember phones ever having cords on them, or TVs with dials to turn channels. The rest of us try to shift our thinking and forget that we ever used punch cards," he said.

"Optimizing for flops made sense to us," he said, referring to us old farts who began in the IBM 5081 punch card days. "It's a little hard for those of us who got trained one way to switch, but I don't think it's more complicated."

Lessons from chess

There are other factors involved, as well. "Obviously reducing coherence is one of the hot topics," he said. "By keeping [the cache] fully coherent, are we really hurting ourselves? Can we loosen it?" That's all well and good, he said, but added: "Then, of course, the problem is, can you build a programming model that you can explain to anybody."

From Reinders' point of view, "You don't want it to look like you've reduced coherence, so that's the trick." That trick includes "alpha-beta pruning", which he explained as being a key to creating algorithms such as those used by chess-playing apps to ignore moves that make no sense, such as giving up your king, and biasing against those that rarely make sense, such as giving up your queen.

When asked where the logic for such cache-coherency bias would sit, Reinders admitted that there was no easy answer. "We talk about hardware-software co-design at Intel so much these days," he said. "It's so complicated trying to figure out what hints the software could give us that maybe the hardware could take advantage of. It's never-ending."

When we suggested that in his "flops to spare" world, a chunk of a MIC's 100 or 200 cores could be used solely for cache management, Reinders said that although that was doable, doing so wouldn't really address the core need to fundamentally rethink how computation should be done in a many-core world.

"We think 'Oh, there's something I'm computing. I'll compute it once and put it into a variable'. Well, if you put it in a variable, and then a hundred processors access that variable to get that data, you've got a bottleneck. But if all one hundred of them computed it, no bottleneck," he said.

"Boy, is that foreign to my brain," he confessed.

We suggested that our generation might have to wait for our kids to become programmers before this new way of thinking became the new standard. "I hope not," he chuckled, noting that although today's programmers may have to learn a new mindset, they do have one great advantage over the next generation of code monkeys: experience.

But despite the challenges involved in moving from a flops-first to a data-shuffling–first way of approaching code optimization, Reinders thinks that the benefits of the approaching many-core world will more than worth the discomfort that a new programming mindset may cause to coders. "With MIC," he told us, stumbling over his own words in his enthusiasm. "I can't be more excited... it's exciting... putting a cluster on a chip... it's just fantastic." ®