Intel code guru: Many-core world requires radical rethink

Data-shuffling outranks flops-optimizing

Maximizing your infrastructure through virtualization

Intel is readying its many-core "Knights Corner" coprocessor for release next year, but programming the 50-core-plus beastie and its 100 or 200 core follow-ons will require a whole new approach to code wrangling.

"Something that might have been only a couple per cent overhead at most in a one- or two-core system may become dominant when you get up to one or two hundred cores," Intel parallelism guru James Reinders told The Register when we sat down with him last week at Intel Lab's Hillsboro, Oregon, headquarters.

From Reinders' point of view, programmers need to switch their thinking from optimizing for the best possible performance measured by pure flops to optimizing instead for efficient data movement.

This new thinking is bubbling to the surface due to Intel's "many integrated core" effort – aka MIC, pronounced "Mike". The 32-core Knights Ferry MIC grew out of the failed Larrabee GPU effort and was relegated to a development platform chip in mid-2009, but Knights Corner is planned for release as a product in 2012, and will have "more than 50 cores" according to Reinders.

MICs amplify the parallelism problems already inherent in multicore chips. For example, when we asked Reinders if two cores in, say, two diagonally opposing corners of a Knights processor would waste a lot of cycles if they needed to communicate with one another, he first asserted that the interprocessor network was "very high-speed", but admitted that "you're right, there's a lot of latency" between distant processors.

The Knights' cores all share a coherent cache. "They're completely coherent with each other," he said. "And they have a snooping protocol and a cache directory, if you will, basis."

But coherency or no coherency, "The key to getting performance on [a MIC] is to have some data, keep it local, and nobody else touches it," Reinders told us.

Keep your data in the 'hood

Every time you need to communicate among MIC cores, you'll pay. "Every once in a while you'll shuffle data between them, and as you expect, there's a cost," Reinders said, offing a homey analogy: "It's the same as if we're working on some document here and someone in another conference room [needs it] and we have to swap documents."

In the many-core world, he said, "there's a lot of really interesting work coming out when people realize 'Hey, we should optimize for minimizing data shuffling instead of computation'."

This focus on data movement is a "hot topic" among university researchers and others. "There's some really neat work out of [the University of California at] Berkeley," he told us, that takes a fresh look at matrix multiplication.

"Everybody would have told you, 'Oh, we've optimized for matrix multiply to death for 25 or 30 years'," he said. But the Berkeleyites thought different. "'What if we optimize for data movement, to reduce it, and the heck with the fact that that'll cause us to do extra computation'," they proposed. "And they've gotten some fantastic speed-ups," Reinders reports. "And everybody looks at this and thinks, 'Why didn't we think of that?'."

That new data-shuffling routine could be simply dumped into a many-core library and accessed by anyone, of course. The real challenge, according to Reinders, is applying that kind of thinking to all many-core programming.

Prioritizing in this way requires forgetting what you've learned and adopting a new mindset – an easier task for new programmers than for those who've been in the trenches for years. "I always laugh at my kids who don't remember phones ever having cords on them, or TVs with dials to turn channels. The rest of us try to shift our thinking and forget that we ever used punch cards," he said.

"Optimizing for flops made sense to us," he said, referring to us old farts who began in the IBM 5081 punch card days. "It's a little hard for those of us who got trained one way to switch, but I don't think it's more complicated."

Lessons from chess

There are other factors involved, as well. "Obviously reducing coherence is one of the hot topics," he said. "By keeping [the cache] fully coherent, are we really hurting ourselves? Can we loosen it?" That's all well and good, he said, but added: "Then, of course, the problem is, can you build a programming model that you can explain to anybody."

From Reinders' point of view, "You don't want it to look like you've reduced coherence, so that's the trick." That trick includes "alpha-beta pruning", which he explained as being a key to creating algorithms such as those used by chess-playing apps to ignore moves that make no sense, such as giving up your king, and biasing against those that rarely make sense, such as giving up your queen.

When asked where the logic for such cache-coherency bias would sit, Reinders admitted that there was no easy answer. "We talk about hardware-software co-design at Intel so much these days," he said. "It's so complicated trying to figure out what hints the software could give us that maybe the hardware could take advantage of. It's never-ending."

When we suggested that in his "flops to spare" world, a chunk of a MIC's 100 or 200 cores could be used solely for cache management, Reinders said that although that was doable, doing so wouldn't really address the core need to fundamentally rethink how computation should be done in a many-core world.

"We think 'Oh, there's something I'm computing. I'll compute it once and put it into a variable'. Well, if you put it in a variable, and then a hundred processors access that variable to get that data, you've got a bottleneck. But if all one hundred of them computed it, no bottleneck," he said.

"Boy, is that foreign to my brain," he confessed.

We suggested that our generation might have to wait for our kids to become programmers before this new way of thinking became the new standard. "I hope not," he chuckled, noting that although today's programmers may have to learn a new mindset, they do have one great advantage over the next generation of code monkeys: experience.

But despite the challenges involved in moving from a flops-first to a data-shuffling–first way of approaching code optimization, Reinders thinks that the benefits of the approaching many-core world will more than worth the discomfort that a new programming mindset may cause to coders. "With MIC," he told us, stumbling over his own words in his enthusiasm. "I can't be more excited... it's exciting... putting a cluster on a chip... it's just fantastic." ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
Disaster Recovery upstart joins DR 'as a service' gang
Quorum joins the aaS crowd with DRaaS offering
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.