Google has a canary problem: One clocked off and crocked its cloud
Again. So now Google's re-building dangerously centralized routing rigs
Google's explained why new cloudy virtual machines in its cloud engine couldn't connect to the world for a couple of hours in January: a canary didn't fall off its perch, so the company was unaware of a problem.
The incident wasn't major: for a tick over two hours on January 30th newly created Google Compute Engine instances, Cloud VPNs and network load balancers became unavailable. The affected servers had public IP addresses, but couldn't be reached from the outside world or send any traffic. Nor would load balancing health checks work.
Google's explanation of the incident tells us a little about its infrastructure. Here's the Root Cause it's offered us all:
All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated.
The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase.
Once Google's people figured out a confused canary was the problem, they “restarted the jobs responsible for programming changes to the network load balancers” and then processed the problematic changes in a batch. That sorted things out in short order.
For now, Google's “increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them.”
This is not Google's first canary problem: in October 2016 one of its warning birds failed to chirp long or loud enough and bits of its cloud fell over.
Google's long-term fix is new tests to test more configurations and binning the inefficient code path. The company reckons it was already taking steps to address this sort of problem, by implementing more decentralized routing. That effort is “being accelerated as it will prevent issues with this layer having global impact.”
And of course Google says it's now building new metrics so it will be alerted to problems faster. Perhaps they need a DevOps UFO to flash news of problems? ®