Google ops czar condemns multi-core extremists
Sea of 'wimpy' cores will sink you
Google is the modern data poster-child for parallel computing. It's famous for splintering enormous calculations into tiny pieces that can then be processed across an epic network of machines. But when it comes to spreading workloads across multi-core processors, the company has called for a certain amount of restraint.
With a paper (PDF) soon to be published in IEEE Micro, the IEEE magazine of chip and silicon design, Google Senior Vice President of Operations Urs Hölzle – one of the brains overseeing the web giant's famous back-end – warns against the use of multi-core processors that take parallelization too far. Chips that spread workloads across more energy-efficient but slower cores, he says, may not be preferable to chips with faster but power-hungry cores.
Hölzle sees this as the battle of the "wimpy" cores and the "brawny" cores.
"Slower but energy efficient 'wimpy' cores only win for general workloads if their single-core speed is reasonably close to that of mid-range 'brawny' cores," he says. The problem, he explains, is that wimpy cores run into Amdahl's law (PDF). In essence, Amdahl's law says that when you parallelize only part of a system, there is a limit to performance improvement.
"So why doesn’t everyone want wimpy-core systems?" Hölzle writes. "Because in many corners of the real world, they’re prohibited by law — Amdahl’s law. Even though many Internet services benefit from seemingly unbounded request- and data-level parallelism, such systems aren’t above the law. As the number of parallel threads increases, reducing serialization and communication overheads can become increasingly difficult. In a limit case, the amount of inherently serial work performed on behalf of a user request by slow single-threaded cores will dominate overall execution time."
When considering "wimpy" cores, he continues, you can't forget the cost of software development. "Wimpy-core systems can require applications to be explicitly parallelized or otherwise optimized for acceptable performance. For example, suppose a Web service runs with a latency of one second per user request, half of it caused by serial CPU time. If we switch to wimpy-core servers, whose single-threaded performance is three times slower, the response time doubles to two seconds and developers might have to spend a substantial amount of effort to optimize the code to get back to the one- second latency."
The other problem, he says, is that the more you parallelize, the more you increase response time. This is why Google's distributed number crunching platform, MapReduce, isn't suited to real-time calculations. "Often all parallel tasks must finish before a request is completed, and thus the overall response time becomes the maximum response time of any subtask, and more subtasks will push further into the long tail of subtask response times."
The use of wimpy servers can raise non-CPU hardware costs, he continues, and lower utilization. "Consider the task of allocating a set of applications across a pool of servers as a bin-packing problem — each of the servers is a bin, and we try to fit as many applications as possible into each bin. Clearly that task is harder when the bins are small, because many applications might not completely fill a server and yet use too much of its CPU or RAM to allow a second application to coexist on the same server."
Most surprisingly, Hölzle says that extreme parallelization can be less efficient when used on a, well, global scale. "To avoid expensive global communication and global lock contention, local tasks can use heuristics that are based on their local progress only, and such heuristics are naturally more conservative. As a result, local subtasks might execute for longer than they would have if better hints about global progress were available. Naturally, when these computations are partitioned into smaller pieces, this overhead tends to increase."
All this leads the Google man to conclude that spreading calculations across a larger collection of wimpy cores doesn't always make sense. "Although we’re enthusiastic users of multicore systems, and believe that throughput-oriented designs generally beat peak-performance-oriented designs, smaller isn’t always better," he says. "Once a chip’s single-core performance lags by more than a factor to two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult because application programmers will see it as a significant performance regression: their single-threaded request handlers are no longer fast enough to meet latency targets.
"So go forth and multiply your cores, but do it in moderation, or the sea of wimpy cores will stick to your programmers’ boots like clay." ®
The workload should match the architecture?
Does the Pope shit in the woods?
>>We can only assume that Google prefers Intel Xeons to AMD Operons
Well, the difference isn't that great between Intel and AMD so I'm not so sue about the assumption, my guess is that they are blending the concept of cores with threads and architectures such as the T2 which are mult-thread multi-core, so you (effectively, as the OS sees) up to 64 cores per CPU, but the clock speed drops. The servers that they use it in such as the T2000 are fantastic for webserver and MySQL when there's lots of threaded processes but not so good for big number crunchers or software optimised for <20 core (such as Oracle data warehouse), this is the very reason that Sun (Oracle) have both types of architecture, choose the wrong platform for your workload (as management bean-counters who listen to sales reps do) and you'll leave the admins scratching their heads saying, "Why did you buy this?".
The other reason that I think the Intel/AMD comparison is erronious is the the ability to place a hypervisor on a multi-core frame, you can end up with multiple machines in the same footprint for less power, something that's not mentioned in the article.
This sounds like a PR move against ARM with very biased and flawed Google logic...
This sounds like Google are helping their supplier of CPU chips in a PR move against ARM that most likely then helps Google keep CPU costs down.
Google : "Chips that spread workloads across more energy efficient but slower cores, he says, may not be preferable to chips with faster but power hungry cores."
In other words, ARM cores. Yet ARM cores of 2.5Ghz are very capable, not least of which when they will have up to 16 cores yet run on only a few watts of power!, like the soon to be released Cortex A15 range of ARM processors. These are a serious threat to AMD and Intel.
Also I don't totally buy into Amdahl's law because that was back in a time when mainframe manufacturers were trying to justify their mainframe costs (and their existence) against what they could foresee was the threat of small networks of small computers, which did end up wiping them out. Therefore these "power hungry cores" the mainframes were killed off by the "more energy efficient but slower cores".
For example that link to Amdahl's law, says released in AFIPS spring joint computer conference 1967 along with IBM's name on it. Its a IBM sponsored report. Its a PR move trying to justify their existence and we are getting a replay of this kind of battle 40 years later, now this time between the very low power CPUs like ARM vs the power hungry x86 based cores.
Not exactly news, this, is it?
You don't have to be a genius to work out that if you've got a limited amount of memory bandwidth and/or a limited amount of heatsinking per socket (or per system), most workloads would get more benefit out of a single higher performance processor than they would out of multiple lower performance processors with the same memory and thermal constraints. That's been a well understood fact ever since real OSes supported symmetric multiprocessing (1980s?), but since most Wintel folk generally don't understand real-computer real-OS concepts it's been convenient to overlook this while clock speeds have been going up.
But it's now several years since Intel hit the clock speed brick wall, they (perhaps understandably) haven't really got clock speeds any higher for a few years now.
Instead, they have to kid the market that multicore buys the customer some benefit, and being as they're Intel, very few people are prepared to stand up and challenge Intel's ridiculous claims.
So, thanks to Google for bringing this up again.