Google has a canary problem: One clocked off and crocked its cloud

Again. So now Google's re-building dangerously centralized routing rigs

Fri 10 Feb 2017 // 01:35 UTC

Google's explained why new cloudy virtual machines in its cloud engine couldn't connect to the world for a couple of hours in January: a canary didn't fall off its perch, so the company was unaware of a problem.

The incident wasn't major: for a tick over two hours on January 30th newly created Google Compute Engine instances, Cloud VPNs and network load balancers became unavailable. The affected servers had public IP addresses, but couldn't be reached from the outside world or send any traffic. Nor would load balancing health checks work.

Google's explanation of the incident tells us a little about its infrastructure. Here's the Root Cause it's offered us all:

All inbound networking for GCE instances, load balancers and VPN tunnels enter via shared layer 2 load balancers. These load balancers are configured with changes to IP addresses for these resources, then automatically tested in a canary deployment, before changes are globally propagated.
The issue was triggered by a large set of updates which were applied to a rarely used load balancing configuration. The application of updates to this configuration exposed an inefficient code path which resulted in the canary timing out. From this point all changes of public addressing were queued behind these changes that could not proceed past the testing phase.

Once Google's people figured out a confused canary was the problem, they “restarted the jobs responsible for programming changes to the network load balancers” and then processed the problematic changes in a batch. That sorted things out in short order.

For now, Google's “increasing the canary timeout so that updates exercising the inefficient code path merely slow network changes rather than completely stop them.”

This is not Google's first canary problem: in October 2016 one of its warning birds failed to chirp long or loud enough and bits of its cloud fell over.

Google's long-term fix is new tests to test more configurations and binning the inefficient code path. The company reckons it was already taking steps to address this sort of problem, by implementing more decentralized routing. That effort is “being accelerated as it will prevent issues with this layer having global impact.”

And of course Google says it's now building new metrics so it will be alerted to problems faster. Perhaps they need a DevOps UFO to flash news of problems? ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

SaaS

Google has a canary problem: One clocked off and crocked its cloud

Again. So now Google's re-building dangerously centralized routing rigs

More about

TIP US OFF

Other stories you might like

Indian PM's 25-year roadmap laid out with help from AI

RISC-V AI chip upstart Rivos plans to undercut Nvidia, helped by a quarter-billion in VC lucre

Fire in the Cisco! Networking giant's Duo MFA message logs stolen in phish attack

Protecting distributed branch office environments from ransomware

FYI: This site claims to have harvested 4B+ Discord chats, today all yours for a price

MGM says FTC can't possibly probe its ransomware downfall – watchdog chief Lina Khan was a guest at the time

YouTube now sabotages ad-blocking apps that stream its vids

China scientists talk of powering hypersonic weapon with cheap Nvidia chip

Gentoo Linux tells AI-generated code contributions to fork off

Latest AMD Ryzen Pro chips are similar silicon, more smarts

Torvalds intentionally complicates his use of indentation in Linux Kconfig

Judge refuses to Ctrl-Z divorce order made by a misclick

About Us

Our Websites

Your Privacy