Google's crash canaries' muted chirping led to load balancer brownout

45 per cent of load balancers threw 502 errors for an hour last Thursday

Wed 19 Oct 2016 // 00:38 UTC

Google has revealed that it broke its own cloud again, this time because of two failures: a software error and alerts that proved too hard to interpret.

The problem hit Google's cloudy load balancers on Thursday, October 13, causing them to produce HTTP 502 (Bad Gateway) responses. At first, two per cent reported the problem. But an hour and two minutes later, at 16:09 Pacific Time, 45 per cent were generating errors. Which made it rather hard to access virtual machines.

Google says its load balancers are “a global, geographically-distributed multi-tiered software stack which receives incoming HTTP(S) requests via many points in Google's global network, and dispatches them to appropriate Google Compute Engine instances” and that the problem started when “a configuration change was rolled out to one of these layers with widespread distribution beginning at 15:07.”

“This change triggered a software bug which decoupled second-tier load balancers from a number of first-tier load balancers. The affected first-tier load balancers therefore had no forwarding path for incoming requests and returned the HTTP 502 code to indicate this.”

Google says its networks incorporate protections “to prevent them from propagating incorrect or invalid configurations” but that these safeguards “were partially successful in this instance, limiting both the scope and the duration of the event, but not preventing it entirely.”

The Alphabet subsidiary's incident says its first layer of protection is “a canary deployment, where the configuration is deployed at a single site and that site is verified to be functioning within normal bounds.”

But while “the canary step did generate a warning … it was not sufficiently precise to cause the on-call engineer to immediately halt the rollout. The new configuration subsequently rolled out in stages, but was halted part way through as further alerts indicated that it was not functioning correctly. By design, this progressive rollout limited the error rate experienced by customers.”

Google's fix for the problem is more articulate canaries.

“We typically have a period of soak time during a canary release before rolling out more widely,” the company explains. “In this case, the change was pushed too quickly for accurate detection of the class of failure uncovered by the configuration being rolled out. We will change our processes to be more conservative when rolling out configuration changes to critical systems.”

Perhaps some relief is in sight for Google cloud customers as the company last week announced a new squad of "Customer Reliability Engineering" people whose job will be to support cloud users. They'll need to be good because Google's list of self-inflicted cloudy wounds is long: the company has broken its cloud by doing two updates at once, by using bad manual processes and with buggy patches. ®

More about

COMMENTS

TIP US OFF

Send us news

Topics

Special Features

Vendor Voice

Resources

SaaS

Google's crash canaries' muted chirping led to load balancer brownout

45 per cent of load balancers threw 502 errors for an hour last Thursday

More about

TIP US OFF

Other stories you might like

Forget the AI doom and hype, let's make computers useful

Indian bank’s IT is so shabby it’s been banned from opening new accounts

Samsung shows off battery tech it says will see you gone in nine minutes

Protecting distributed branch office environments from ransomware

IBM to acquire Hashi for $6.4 billion, hopes it will boost software biz and Red Hat

Australia’s spies and cops want ‘accountable encryption’ - aka access to backdoors

Governments issue alerts after 'sophisticated' state-backed actor found exploiting flaws in Cisco security boxes

With Run:ai acquisition, Nvidia aims to manage your AI kubes

Apple releases OpenELM, a slightly more accurate LLM

Musk moves Tesla's goalposts, investors happily move shares higher

Shouldn't Teams, Zoom, Slack all interoperate securely for the Feds? Wyden is asking

Now all Windows 11 users are getting adverts to 'make the Start menu great again'

About Us

Our Websites

Your Privacy