Google's crash canaries' muted chirping led to load balancer brownout
45 per cent of load balancers threw 502 errors for an hour last Thursday
Google has revealed that it broke its own cloud again, this time because of two failures: a software error and alerts that proved too hard to interpret.
The problem hit Google's cloudy load balancers on Thursday, October 13, causing them to produce HTTP 502 (Bad Gateway) responses. At first, two per cent reported the problem. But an hour and two minutes later, at 16:09 Pacific Time, 45 per cent were generating errors. Which made it rather hard to access virtual machines.
Google says its load balancers are “a global, geographically-distributed multi-tiered software stack which receives incoming HTTP(S) requests via many points in Google's global network, and dispatches them to appropriate Google Compute Engine instances” and that the problem started when “a configuration change was rolled out to one of these layers with widespread distribution beginning at 15:07.”
“This change triggered a software bug which decoupled second-tier load balancers from a number of first-tier load balancers. The affected first-tier load balancers therefore had no forwarding path for incoming requests and returned the HTTP 502 code to indicate this.”
Google says its networks incorporate protections “to prevent them from propagating incorrect or invalid configurations” but that these safeguards “were partially successful in this instance, limiting both the scope and the duration of the event, but not preventing it entirely.”
The Alphabet subsidiary's incident says its first layer of protection is “a canary deployment, where the configuration is deployed at a single site and that site is verified to be functioning within normal bounds.”
But while “the canary step did generate a warning … it was not sufficiently precise to cause the on-call engineer to immediately halt the rollout. The new configuration subsequently rolled out in stages, but was halted part way through as further alerts indicated that it was not functioning correctly. By design, this progressive rollout limited the error rate experienced by customers.”
Google's fix for the problem is more articulate canaries.
“We typically have a period of soak time during a canary release before rolling out more widely,” the company explains. “In this case, the change was pushed too quickly for accurate detection of the class of failure uncovered by the configuration being rolled out. We will change our processes to be more conservative when rolling out configuration changes to critical systems.”
Perhaps some relief is in sight for Google cloud customers as the company last week announced a new squad of "Customer Reliability Engineering" people whose job will be to support cloud users. They'll need to be good because Google's list of self-inflicted cloudy wounds is long: the company has broken its cloud by doing two updates at once, by using bad manual processes and with buggy patches. ®