Google: 'We'll track EVERY task on EVERY data center server'
Chip-level performance tracking in thousand-server Googly clusters
Google has wired its worldwide fleet of servers up with monitoring technology that inspects every task running on every machine, and eventually hopes to use this data to selectively throttle or even kill processes that cause disruptions for other tasks running on the same CPU.
The search giant gave details on how it had developed the planet-spanning technology in a technical paper (PDF) due to be published next week – and its contents will be of major interest to anyone running massive Linux-based infrastructure clouds.
"Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other program's behavior," the researchers write.
"Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job."
In essence, CPI2 lets Google engineers isolate poor performance down to a single task running on a single processor within a cluster of thousands, then drill down to it and select to throttle that task, without causing a CPU overhead of more than 0.1 per cent. It requires no special hardware and its only software dependency appears to be use of Linux.
CPI2 lets Google gather information on the expected CPU cycles-per-instruction (CPI) of any particular task, build standard resource profiles from this data, and then use these profiles to help the web giant identify tasks that are taking more cycles-per-instruction than usual to get executed ("victims") and the tasks that may be causing this disruption ("antagonists"). Software agents can then throttle the antagonists so the victims stop fretting and get back to work (Sounds a bit rough–Ed).
The "vast majority" of Google's machines run multiple tasks, the company wrote. These jobs are either latency-sensitive or batch-based, and are themselves typically composed of multiple tasks. 96 per cent of the tasks running on Google servers are part of a job with at least 10 tasks, and 87 per cent of the tasks are part of a job with 100 or more tasks.
But these tasks can interfere with each other through foul-ups relating to processor caches and memory allocation problems, causing the latency of a task within an application to skyrocket – something that ad-slinger Google wants to avoid at all costs.
To help it control latency spikes on each processor on a task-by-task basis, Google has rolled out CPI monitoring across all of its production servers. It derives its CPI data by measuring processor hardware counters, according to
CPU_CLK_UNHALTED.REF divided by
Google gathers this data for a 10 second period once every minute via a
perf_event tool in counting mode, rather than sampling mode. Total CPU overhead of the system is less than 0.1% and leads to no visible latency impact.
Don't worry 'victim', the agents will save you!
CPI is calculated to take account of variances in the type of CPU being run on, Google said, as typical clusters will run across a large range of platforms. CPI values are measured and analyzed locally by an agent that runs on every machine (pictured). This agent is always given the most current expected CPI distribution for the jobs it is running tasks for, so that it can spot aberrations without having to phone home.
If this agent spots a "victim" task that is being slowed, it looks once per second for the "antagonist" tasks disrupting it. It uses an algorithm to do this that lets it work out whether there's a relationship between an antagonist's increasing use of CPU and a "victim" task spiking, in terms of the cycles needed per instruction.
If it identifies an antagonist and finds that it is a batch job, then the system will "forcibly reduce the antagonist's CPU usage by applying CPU hard-capping".
It is more than likely that this agent is given tasks from a component of Google's mega-infrastructure-management system named Omega, especially when you consider that the architect of Omega (computer science luminary John Wilkes) also happens to be one of the co-authors of the CPI2 paper.
The CPI2 system can even automatically throttle disruptive tasks, but only if they have been marked as being eligible for what the search giant terms "hard-capping" – something that not many tasks are tagged as. Alternately, system operators can interface with CPI and can manually hard-cap suspects.
CPI data is logged and stored offline, along with profiles of antagonist tasks, so admins can query it via Google's major internal analysis tool, Dremel (of which Cloudera's Impala is a public implementation).
Dremel is used by Google engineers for performance forensics to let them identify particularly aggressive antagonists for their tasks. In the future, it could be possible to reschedule antagonists to different machines and put the most disruptive ones on their own subset of machines, then feed this configuration pattern to the scheduler to avoid problems.
One area for improvement is dealing with multiple antagonists, as this currently confuses the algorithm. Another is to introduce a feedback-driven policy for capping tasks, and expanding the CPI2 method to deal with disk and I/O conflicts as well.
"Even before these enhancements are applied, we believe that CPI2 is a powerful, useful tool," the researchers wrote.
CPI2 looks to be a less consumptive way of getting workable info on app performance than other Google schemes. There's a parallel technique called "Google-Wide Profiling" that tracks both hardware and software performance, but is only in use on a fraction of Google machines due to concerns about performance.
From the mile-high perspective of Vulture West, CPI2 provides an illuminating look at the types of invention needed to not only manage massive IT estates, but make them work better. Next time you perform a search, or check your email, or look up an address via Google services and notice the page takes a bit longer to load than normal, you might spare a thought for the antagonist that is now being seen to by a cold, heartless CPI2 agent. ®
Sponsored: The Nuts and Bolts of Ransomware in 2016