Google reaches into own silicon brain to slash electricity bill

Skunkworks PEGASUS system tramples cloud's ugly power-sucking secret

Top 5 reasons to deploy VMware with Tegile

Google has worked out how to save as much as 20 percent of its data-center electricity bill by reaching deep into the guts of its infrastructure and fiddling with the feverish silicon brains of its chips.

In a paper to be presented next week at the ISCA 2014 computer architecture conference entitled "Towards Energy Proportionality for Large-Scale Latency-Critical Workloads", researchers from Google and Stanford University discuss an experimental system named "PEGASUS" that may save Google vast sums of money by helping it cut its electricity consumption.

PEGASUS addresses one of the worst-kept secrets about cloud computing, which is that the computer chips in the gigantic data centers of Google, Amazon, and Microsoft are standing idle for significant amounts of time.

Though all these companies have developed sophisticated technologies to try to increase the utilization of their chips, they all fall short in one way or another.

This means that a substantial amount of the electricity going into their data centers is wasted as it powers compute chips that are either idle or in a state of very low utilization. From an operator's perspective, it's a sucking chest wound in the budget, and from an environmentalist's perspective it's a travesty.

Now Google and Stanford researchers have designed a system that increases the efficiency of the power consumption of the data centers without compromising performance.


PEGASUS tunes the power consumption of the chips to fit the task

PEGASUS does this by dialing up and down the power consumption of the processors within Google's servers according to the desired request-latency requirements – dubbed iso-latency – of any given workload. (For silicon-heads among our readers, the power management tech PEGASUS uses is Running Average Power Limit, or RAPL, which lets you tweak CPU power consumption in increments of 0.125W. The system "sweeps the RAPL power limit at a given load to find the point of minimum cluster power that satisfies the [service-level objective] target.").

Put another way, PEGASUS makes sure that a processor is working just hard enough to meet the demands of the application running on it, but no harder. "The baseline can be compared to driving a car with sudden stops and starts. iso-latency would then be driving the car at a slower speed to avoid accelerating hard and braking hard," the researchers write. "The second way of operating a car is much more fuel efficient than the first, which is akin to the results we have observed."

Existing power management techniques for large data centers advocate turning off individual servers or even individual cores, but the researchers said this was inefficient. "Even if spare memory storage is available, moving tens of gigabytes of state in and out of servers is expensive and time consuming, making it difficult to react to fast or small changes in load," they explain.


What saving vast amounts of money looks like

Shutting down individual computer cores, meanwhile, doesn't work due to the specific needs of Google's search tech. "A single user query to the front-end will lead to forwarding of the query to all leaf nodes. As a result, even a small request rate can create a non-trivial amount of work on each server. For instance, consider a cluster sized to handle a peak load of 10,000 queries per second (QPS) using 1000 servers," they explain. "Even at 10 per cent load, each of the 1000 nodes are seeing on average one query per millisecond. There is simply not enough idleness to invoke some of the more effective low power modes."

So, PEGASUS, which stands for Power and Energy Gains Automatically Saved from Underutilized Systems, has been created. The tech "is a dynamic, feedback-based controller that enforces the iso-latency policy." It tweaks the power to the chip according to the task its running, making sure to not violate any service-level agreements on latency.

During tests on production Google workloads, the researchers found that PEGASUS saved as much as 30 per cent of power compared to a non-PEGASUS system during times of low demand, and 11 per cent total energy savings over a 24-hour period. The team also evaluated it on a "full scale, production cluster for search at Google", aka, the company's crown jewel workload. Here, PEGASUS did marginally less well by saving between 10 per cent and 20 per cent during low utilization periods. This is due to the way it applied policy across the thousands of servers without taking into account variations between chips.

A potential solution to this is to distribute the PEGASUS controller so that it lives on each node and applies latency policy from there.

"The solution to the hot leaf problem is fairly straightforward: implement a distributed controller on each server that keeps the leaf latency at a certain latency goal," the researchers write.

In the real world, as any grizzled veteran of distributed systems can tell you, implementing any kind of distributed controller scheme is inviting a world of confusion and pain into your data center – but Google isn't the real world, it's a gold-plated organization that can fund the necessary engineers to keep a distributed scheme like this working.

If PEGASUS were to be implemented in a distributed way, the researchers reckon it could save up to 35 per cent of power over the baseline – a huge savings for Google.

As is typical with Google, the paper gives no details of whether PEGASUS has been deployed across Google's infrastructure in production, but given these power savings and the substantial amount of work Google has invested in the scheme, we reckon it's likely. Google did not respond to questions.

"Overall, iso-latency provides a significant step forward towards the goal of energy proportionality for one of the challenging classes of large-scale, low-latency workloads," the researchers write.

The deployment of complex systems like PEGASUS alongside other advanced Google technologies such as OMEGA (cluster management), SPANNER (distributed DBMS), or CPI2 (thread-level performance monitoring) enables Google to make its data centers dramatically more efficient than those operated by smaller, less sophisticated companies. These technologies will, over time, help Google compete in public cloud with rivals such as Amazon and Microsoft, while serving more ads at a lower cost than before.

Ride on, PEGASUS, ride on. ®

Beginner's guide to SSL certificates

More from The Register

next story
Ellison: Sparc M7 is Oracle's most important silicon EVER
'Acceleration engines' key to performance, security, Larry says
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Lenovo to finish $2.1bn IBM x86 server gobble in October
A lighter snack than expected – but what's a few $100m between friends, eh?
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Hey, what's a STORAGE company doing working on Internet-of-Cars?
Boo - it's not a terabyte car, it's just predictive maintenance and that
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.