Google: 'We'll track EVERY task on EVERY data center server'

Chip-level performance tracking in thousand-server Googly clusters

Gartner critical capabilities for enterprise endpoint backup

Google has wired its worldwide fleet of servers up with monitoring technology that inspects every task running on every machine, and eventually hopes to use this data to selectively throttle or even kill processes that cause disruptions for other tasks running on the same CPU.

The search giant gave details on how it had developed the planet-spanning technology in a technical paper (PDF) due to be published next week – and its contents will be of major interest to anyone running massive Linux-based infrastructure clouds.

"Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior," the researchers write.

"Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job."

In essence, CPI2 lets Google engineers isolate poor performance down to a single task running on a single processor within a cluster of thousands, then drill down to it and select to throttle that task, without causing a CPU overhead of more than 0.1 per cent. It requires no special hardware and its only software dependency appears to be use of Linux.

CPI2 lets Google gather information on the expected CPU cycles-per-instruction (CPI) of any particular task, build standard resource profiles from this data, and then use these profiles to help the web giant identify tasks that are taking more cycles-per-instruction than usual to get executed ("victims") and the tasks that may be causing this disruption ("antagonists"). Software agents can then throttle the antagonists so the victims stop fretting and get back to work (Sounds a bit rough–Ed).

The "vast majority" of Google's machines run multiple tasks, the company wrote. These jobs are either latency-sensitive or batch-based, and are themselves typically composed of multiple tasks. 96 per cent of the tasks running on Google servers are part of a job with at least 10 tasks, and 87 per cent of the tasks are part of a job with 100 or more tasks.

But these tasks can interfere with each other through foul-ups relating to processor caches and memory allocation problems, causing the latency of a task within an application to skyrocket – something that ad-slinger Google wants to avoid at all costs.

To help it control latency spikes on each processor on a task-by-task basis, Google has rolled out CPI monitoring across all of its production servers. It derives its CPI data by measuring processor hardware counters, according to CPU_CLK_UNHALTED.REF divided by INSTRUCTIONS_RETIRED.

Google gathers this data for a 10 second period once every minute via a perf_event tool in counting mode, rather than sampling mode. Total CPU overhead of the system is less than 0.1% and leads to no visible latency impact.

CPI2 lets Google inspect every task on every chip

Don't worry 'victim', the agents will save you!

CPI is calculated to take account of variances in the type of CPU being run on, Google said, as typical clusters will run across a large range of platforms. CPI values are measured and analyzed locally by an agent that runs on every machine (pictured). This agent is always given the most current expected CPI distribution for the jobs it is running tasks for, so that it can spot aberrations without having to phone home.

If this agent spots a "victim" task that is being slowed, it looks once per second for the "antagonist" tasks disrupting it. It uses an algorithm to do this that lets it work out whether there's a relationship between an antagonist's increasing use of CPU and a "victim" task spiking, in terms of the cycles needed per instruction.

If it identifies an antagonist and finds that it is a batch job, then the system will "forcibly reduce the antagonist's CPU usage by applying CPU hard-capping".

It is more than likely that this agent is given tasks from a component of Google's mega-infrastructure-management system named Omega, especially when you consider that the architect of Omega (computer science luminary John Wilkes) also happens to be one of the co-authors of the CPI2 paper.

The CPI2 system can even automatically throttle disruptive tasks, but only if they have been marked as being eligible for what the search giant terms "hard-capping" – something that not many tasks are tagged as. Alternately, system operators can interface with CPI and can manually hard-cap suspects.

CPI data is logged and stored offline, along with profiles of antagonist tasks, so admins can query it via Google's major internal analysis tool, Dremel (of which Cloudera's Impala is a public implementation).

Dremel is used by Google engineers for performance forensics to let them identify particularly aggressive antagonists for their tasks. In the future, it could be possible to reschedule antagonists to different machines and put the most disruptive ones on their own subset of machines, then feed this configuration pattern to the scheduler to avoid problems.

One area for improvement is dealing with multiple antagonists, as this currently confuses the algorithm. Another is to introduce a feedback-driven policy for capping tasks, and expanding the CPI2 method to deal with disk and I/O conflicts as well.

"Even before these enhancements are applied, we believe that CPI2 is a powerful, useful tool," the researchers wrote.

CPI2 looks to be a less consumptive way of getting workable info on app performance than other Google schemes. There's a parallel technique called "Google-Wide Profiling" that tracks both hardware and software performance, but is only in use on a fraction of Google machines due to concerns about performance.

From the mile-high perspective of Vulture West, CPI2 provides an illuminating look at the types of invention needed to not only manage massive IT estates, but make them work better. Next time you perform a search, or check your email, or look up an address via Google services and notice the page takes a bit longer to load than normal, you might spare a thought for the antagonist that is now being seen to by a cold, heartless CPI2 agent. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.