Google reaches into own silicon brain to slash electricity bill

Skunkworks PEGASUS system tramples cloud's ugly power-sucking secret

Internet Security Threat Report 2014

Google has worked out how to save as much as 20 percent of its data-center electricity bill by reaching deep into the guts of its infrastructure and fiddling with the feverish silicon brains of its chips.

In a paper to be presented next week at the ISCA 2014 computer architecture conference entitled "Towards Energy Proportionality for Large-Scale Latency-Critical Workloads", researchers from Google and Stanford University discuss an experimental system named "PEGASUS" that may save Google vast sums of money by helping it cut its electricity consumption.

PEGASUS addresses one of the worst-kept secrets about cloud computing, which is that the computer chips in the gigantic data centers of Google, Amazon, and Microsoft are standing idle for significant amounts of time.

Though all these companies have developed sophisticated technologies to try to increase the utilization of their chips, they all fall short in one way or another.

This means that a substantial amount of the electricity going into their data centers is wasted as it powers compute chips that are either idle or in a state of very low utilization. From an operator's perspective, it's a sucking chest wound in the budget, and from an environmentalist's perspective it's a travesty.

Now Google and Stanford researchers have designed a system that increases the efficiency of the power consumption of the data centers without compromising performance.


PEGASUS tunes the power consumption of the chips to fit the task

PEGASUS does this by dialing up and down the power consumption of the processors within Google's servers according to the desired request-latency requirements – dubbed iso-latency – of any given workload. (For silicon-heads among our readers, the power management tech PEGASUS uses is Running Average Power Limit, or RAPL, which lets you tweak CPU power consumption in increments of 0.125W. The system "sweeps the RAPL power limit at a given load to find the point of minimum cluster power that satisfies the [service-level objective] target.").

Put another way, PEGASUS makes sure that a processor is working just hard enough to meet the demands of the application running on it, but no harder. "The baseline can be compared to driving a car with sudden stops and starts. iso-latency would then be driving the car at a slower speed to avoid accelerating hard and braking hard," the researchers write. "The second way of operating a car is much more fuel efficient than the first, which is akin to the results we have observed."

Existing power management techniques for large data centers advocate turning off individual servers or even individual cores, but the researchers said this was inefficient. "Even if spare memory storage is available, moving tens of gigabytes of state in and out of servers is expensive and time consuming, making it difficult to react to fast or small changes in load," they explain.


What saving vast amounts of money looks like

Shutting down individual computer cores, meanwhile, doesn't work due to the specific needs of Google's search tech. "A single user query to the front-end will lead to forwarding of the query to all leaf nodes. As a result, even a small request rate can create a non-trivial amount of work on each server. For instance, consider a cluster sized to handle a peak load of 10,000 queries per second (QPS) using 1000 servers," they explain. "Even at 10 per cent load, each of the 1000 nodes are seeing on average one query per millisecond. There is simply not enough idleness to invoke some of the more effective low power modes."

So, PEGASUS, which stands for Power and Energy Gains Automatically Saved from Underutilized Systems, has been created. The tech "is a dynamic, feedback-based controller that enforces the iso-latency policy." It tweaks the power to the chip according to the task its running, making sure to not violate any service-level agreements on latency.

During tests on production Google workloads, the researchers found that PEGASUS saved as much as 30 per cent of power compared to a non-PEGASUS system during times of low demand, and 11 per cent total energy savings over a 24-hour period. The team also evaluated it on a "full scale, production cluster for search at Google", aka, the company's crown jewel workload. Here, PEGASUS did marginally less well by saving between 10 per cent and 20 per cent during low utilization periods. This is due to the way it applied policy across the thousands of servers without taking into account variations between chips.

A potential solution to this is to distribute the PEGASUS controller so that it lives on each node and applies latency policy from there.

"The solution to the hot leaf problem is fairly straightforward: implement a distributed controller on each server that keeps the leaf latency at a certain latency goal," the researchers write.

In the real world, as any grizzled veteran of distributed systems can tell you, implementing any kind of distributed controller scheme is inviting a world of confusion and pain into your data center – but Google isn't the real world, it's a gold-plated organization that can fund the necessary engineers to keep a distributed scheme like this working.

If PEGASUS were to be implemented in a distributed way, the researchers reckon it could save up to 35 per cent of power over the baseline – a huge savings for Google.

As is typical with Google, the paper gives no details of whether PEGASUS has been deployed across Google's infrastructure in production, but given these power savings and the substantial amount of work Google has invested in the scheme, we reckon it's likely. Google did not respond to questions.

"Overall, iso-latency provides a significant step forward towards the goal of energy proportionality for one of the challenging classes of large-scale, low-latency workloads," the researchers write.

The deployment of complex systems like PEGASUS alongside other advanced Google technologies such as OMEGA (cluster management), SPANNER (distributed DBMS), or CPI2 (thread-level performance monitoring) enables Google to make its data centers dramatically more efficient than those operated by smaller, less sophisticated companies. These technologies will, over time, help Google compete in public cloud with rivals such as Amazon and Microsoft, while serving more ads at a lower cost than before.

Ride on, PEGASUS, ride on. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.