Google reaches into own silicon brain to slash electricity bill

Skunkworks PEGASUS system tramples cloud's ugly power-sucking secret

Intelligent flash storage arrays

Google has worked out how to save as much as 20 percent of its data-center electricity bill by reaching deep into the guts of its infrastructure and fiddling with the feverish silicon brains of its chips.

In a paper to be presented next week at the ISCA 2014 computer architecture conference entitled "Towards Energy Proportionality for Large-Scale Latency-Critical Workloads", researchers from Google and Stanford University discuss an experimental system named "PEGASUS" that may save Google vast sums of money by helping it cut its electricity consumption.

PEGASUS addresses one of the worst-kept secrets about cloud computing, which is that the computer chips in the gigantic data centers of Google, Amazon, and Microsoft are standing idle for significant amounts of time.

Though all these companies have developed sophisticated technologies to try to increase the utilization of their chips, they all fall short in one way or another.

This means that a substantial amount of the electricity going into their data centers is wasted as it powers compute chips that are either idle or in a state of very low utilization. From an operator's perspective, it's a sucking chest wound in the budget, and from an environmentalist's perspective it's a travesty.

Now Google and Stanford researchers have designed a system that increases the efficiency of the power consumption of the data centers without compromising performance.


PEGASUS tunes the power consumption of the chips to fit the task

PEGASUS does this by dialing up and down the power consumption of the processors within Google's servers according to the desired request-latency requirements – dubbed iso-latency – of any given workload. (For silicon-heads among our readers, the power management tech PEGASUS uses is Running Average Power Limit, or RAPL, which lets you tweak CPU power consumption in increments of 0.125W. The system "sweeps the RAPL power limit at a given load to find the point of minimum cluster power that satisfies the [service-level objective] target.").

Put another way, PEGASUS makes sure that a processor is working just hard enough to meet the demands of the application running on it, but no harder. "The baseline can be compared to driving a car with sudden stops and starts. iso-latency would then be driving the car at a slower speed to avoid accelerating hard and braking hard," the researchers write. "The second way of operating a car is much more fuel efficient than the first, which is akin to the results we have observed."

Existing power management techniques for large data centers advocate turning off individual servers or even individual cores, but the researchers said this was inefficient. "Even if spare memory storage is available, moving tens of gigabytes of state in and out of servers is expensive and time consuming, making it difficult to react to fast or small changes in load," they explain.


What saving vast amounts of money looks like

Shutting down individual computer cores, meanwhile, doesn't work due to the specific needs of Google's search tech. "A single user query to the front-end will lead to forwarding of the query to all leaf nodes. As a result, even a small request rate can create a non-trivial amount of work on each server. For instance, consider a cluster sized to handle a peak load of 10,000 queries per second (QPS) using 1000 servers," they explain. "Even at 10 per cent load, each of the 1000 nodes are seeing on average one query per millisecond. There is simply not enough idleness to invoke some of the more effective low power modes."

So, PEGASUS, which stands for Power and Energy Gains Automatically Saved from Underutilized Systems, has been created. The tech "is a dynamic, feedback-based controller that enforces the iso-latency policy." It tweaks the power to the chip according to the task its running, making sure to not violate any service-level agreements on latency.

During tests on production Google workloads, the researchers found that PEGASUS saved as much as 30 per cent of power compared to a non-PEGASUS system during times of low demand, and 11 per cent total energy savings over a 24-hour period. The team also evaluated it on a "full scale, production cluster for search at Google", aka, the company's crown jewel workload. Here, PEGASUS did marginally less well by saving between 10 per cent and 20 per cent during low utilization periods. This is due to the way it applied policy across the thousands of servers without taking into account variations between chips.

A potential solution to this is to distribute the PEGASUS controller so that it lives on each node and applies latency policy from there.

"The solution to the hot leaf problem is fairly straightforward: implement a distributed controller on each server that keeps the leaf latency at a certain latency goal," the researchers write.

In the real world, as any grizzled veteran of distributed systems can tell you, implementing any kind of distributed controller scheme is inviting a world of confusion and pain into your data center – but Google isn't the real world, it's a gold-plated organization that can fund the necessary engineers to keep a distributed scheme like this working.

If PEGASUS were to be implemented in a distributed way, the researchers reckon it could save up to 35 per cent of power over the baseline – a huge savings for Google.

As is typical with Google, the paper gives no details of whether PEGASUS has been deployed across Google's infrastructure in production, but given these power savings and the substantial amount of work Google has invested in the scheme, we reckon it's likely. Google did not respond to questions.

"Overall, iso-latency provides a significant step forward towards the goal of energy proportionality for one of the challenging classes of large-scale, low-latency workloads," the researchers write.

The deployment of complex systems like PEGASUS alongside other advanced Google technologies such as OMEGA (cluster management), SPANNER (distributed DBMS), or CPI2 (thread-level performance monitoring) enables Google to make its data centers dramatically more efficient than those operated by smaller, less sophisticated companies. These technologies will, over time, help Google compete in public cloud with rivals such as Amazon and Microsoft, while serving more ads at a lower cost than before.

Ride on, PEGASUS, ride on. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Cray-cray Met Office spaffs £97m on VERY AVERAGE HPC box
Only 250th most powerful in the world? Bring back Michael Fish
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story


Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Protecting against web application threats using SSL
SSL encryption can protect server‐to‐server communications, client devices, cloud resources, and other endpoints in order to help prevent the risk of data loss and losing customer trust.