Intel code guru: Many-core world requires radical rethink
Data-shuffling outranks flops-optimizing
Intel is readying its many-core "Knights Corner" coprocessor for release next year, but programming the 50-core-plus beastie and its 100 or 200 core follow-ons will require a whole new approach to code wrangling.
"Something that might have been only a couple per cent overhead at most in a one- or two-core system may become dominant when you get up to one or two hundred cores," Intel parallelism guru James Reinders told The Register when we sat down with him last week at Intel Lab's Hillsboro, Oregon, headquarters.
From Reinders' point of view, programmers need to switch their thinking from optimizing for the best possible performance measured by pure flops to optimizing instead for efficient data movement.
This new thinking is bubbling to the surface due to Intel's "many integrated core" effort – aka MIC, pronounced "Mike". The 32-core Knights Ferry MIC grew out of the failed Larrabee GPU effort and was relegated to a development platform chip in mid-2009, but Knights Corner is planned for release as a product in 2012, and will have "more than 50 cores" according to Reinders.
MICs amplify the parallelism problems already inherent in multicore chips. For example, when we asked Reinders if two cores in, say, two diagonally opposing corners of a Knights processor would waste a lot of cycles if they needed to communicate with one another, he first asserted that the interprocessor network was "very high-speed", but admitted that "you're right, there's a lot of latency" between distant processors.
The Knights' cores all share a coherent cache. "They're completely coherent with each other," he said. "And they have a snooping protocol and a cache directory, if you will, basis."
But coherency or no coherency, "The key to getting performance on [a MIC] is to have some data, keep it local, and nobody else touches it," Reinders told us.
Keep your data in the 'hood
Every time you need to communicate among MIC cores, you'll pay. "Every once in a while you'll shuffle data between them, and as you expect, there's a cost," Reinders said, offing a homey analogy: "It's the same as if we're working on some document here and someone in another conference room [needs it] and we have to swap documents."
In the many-core world, he said, "there's a lot of really interesting work coming out when people realize 'Hey, we should optimize for minimizing data shuffling instead of computation'."
This focus on data movement is a "hot topic" among university researchers and others. "There's some really neat work out of [the University of California at] Berkeley," he told us, that takes a fresh look at matrix multiplication.
"Everybody would have told you, 'Oh, we've optimized for matrix multiply to death for 25 or 30 years'," he said. But the Berkeleyites thought different. "'What if we optimize for data movement, to reduce it, and the heck with the fact that that'll cause us to do extra computation'," they proposed. "And they've gotten some fantastic speed-ups," Reinders reports. "And everybody looks at this and thinks, 'Why didn't we think of that?'."
That new data-shuffling routine could be simply dumped into a many-core library and accessed by anyone, of course. The real challenge, according to Reinders, is applying that kind of thinking to all many-core programming.
Prioritizing in this way requires forgetting what you've learned and adopting a new mindset – an easier task for new programmers than for those who've been in the trenches for years. "I always laugh at my kids who don't remember phones ever having cords on them, or TVs with dials to turn channels. The rest of us try to shift our thinking and forget that we ever used punch cards," he said.
"Optimizing for flops made sense to us," he said, referring to us old farts who began in the IBM 5081 punch card days. "It's a little hard for those of us who got trained one way to switch, but I don't think it's more complicated."
Lessons from chess
There are other factors involved, as well. "Obviously reducing coherence is one of the hot topics," he said. "By keeping [the cache] fully coherent, are we really hurting ourselves? Can we loosen it?" That's all well and good, he said, but added: "Then, of course, the problem is, can you build a programming model that you can explain to anybody."
From Reinders' point of view, "You don't want it to look like you've reduced coherence, so that's the trick." That trick includes "alpha-beta pruning", which he explained as being a key to creating algorithms such as those used by chess-playing apps to ignore moves that make no sense, such as giving up your king, and biasing against those that rarely make sense, such as giving up your queen.
When asked where the logic for such cache-coherency bias would sit, Reinders admitted that there was no easy answer. "We talk about hardware-software co-design at Intel so much these days," he said. "It's so complicated trying to figure out what hints the software could give us that maybe the hardware could take advantage of. It's never-ending."
When we suggested that in his "flops to spare" world, a chunk of a MIC's 100 or 200 cores could be used solely for cache management, Reinders said that although that was doable, doing so wouldn't really address the core need to fundamentally rethink how computation should be done in a many-core world.
"We think 'Oh, there's something I'm computing. I'll compute it once and put it into a variable'. Well, if you put it in a variable, and then a hundred processors access that variable to get that data, you've got a bottleneck. But if all one hundred of them computed it, no bottleneck," he said.
"Boy, is that foreign to my brain," he confessed.
We suggested that our generation might have to wait for our kids to become programmers before this new way of thinking became the new standard. "I hope not," he chuckled, noting that although today's programmers may have to learn a new mindset, they do have one great advantage over the next generation of code monkeys: experience.
But despite the challenges involved in moving from a flops-first to a data-shuffling–first way of approaching code optimization, Reinders thinks that the benefits of the approaching many-core world will more than worth the discomfort that a new programming mindset may cause to coders. "With MIC," he told us, stumbling over his own words in his enthusiasm. "I can't be more excited... it's exciting... putting a cluster on a chip... it's just fantastic." ®
It'd be nice if everything was so convenient
This sort of coding (concern for data locality in massively-parallel architectures) has been quite normal in supercomputing for decades. And in the GPGPU programming segment for years. And it still is darn hard. The basic problem is that not all applications nicely slot into something where data will conveniently be located locally to where it is needed. Hence Berkeley can come up with a fastest-ever parallel algorithm for matrix multiplication - because they chose a problem (matrix mult) that *can* take full advantage of the way modern multi-core systems work. And it's *still* hard to code! Try doing the same with parallel algorithms that don't conveniently have data locality and you find the system runs hardly any faster than on a normal CPU because memory bottlenecks dominate. Sure, *sometimes* you can get it faster with tricks like "compute the same variable per core", but funnily enough Intel won't be telling us of all the failures too (which I suspect vastly outweigh the successes - nobody can publish failures because it might just be their lack of imagination in solving the problem and so gets rejected by reviewers).
Sorry, adding more cores onto a chip isn't some kind of revolution - it's an admission of defeat. Trying to tell us that it's our fault 'cause we need to get smarter in coding is raw marketing spin.
This is completely wasted on ~100% of commercial software
In that part of the software market, it's all about rapid application development, and sod the efficiency. They rely on Moore's Law to make sure that by the time their software hits customer systems, the computers are powerful enough to cope.
So MIC processors will be completely wasted on commercial boxes, which is where the majority of the systems will be sold.
Even if someone (extremely cleverly) produces an IDE that can generate parallel code to make good use of many-cores, much of the workload that is done is not suited to run in a parallel manner anyway.
Apologies in advance to those that do, but most new programmers nowadays are never taught about registers, how cache works, the actual instruction set that machines use, and I'm sure that there are a lot of people reading even on this site who do not really understand what a coherent cache actually is.
I work with people who are trying to make certain large computer models more parallel, and they are very aware that communication and memory bandwidth is the key. Code that is already parallel tops out at a much smaller number of cores than the current systems that they have available can provide. And the next generation system, which will have still more cores, may not actually run their code much faster than the current one.
But even these people, many who have dedicated their working lives to making large computational models work on top 500 supercomputers, don't really want to have to worry about this level. They rely on the compilers and runtimes to make sensible decisions about how variables are stored, arguments are passed, and inter-thread communication is handled.
And when these decisions are wrong, things get complex. We found recently that a particular vendor optimised matrix-multiplication stomped all over carefully written code by generating threads for all cores in the system, ignoring the fact that all the cores were already occupied running coded separate threads. Ended up with each lock-stepped thread generating many times more threads during the matmul than there were cores, completely trashing the cache, and causing multiple thread context switches. It actually slowed the code down compared to running the non-threaded version of the same routine.
It will be a whole new ball game even for these people who do understand it if they have to start thinking still more about localization of memory, and if they will have difficulty, the average commercial programmer writing in Java or C# won't have a clue!
"Intel parallelism guru "
He's a guru? Doesn't sound like it. Just one example : "Well, if you put it in a variable, and then a hundred processors access that variable to get that data, you've got a bottleneck." - and that's foreign to his brain. This point has been obvious for decades to anyone who has had to deal with as few as two processors in a system.
He's making it out to be a cool new field of thinking, but it is really just re-invention to quite a few old farts out there.