Google's robo-CTRL-ALT-DEL failed, hung networks and Compute Engine for 90 minutes

And then along came a good old flesh and blood human!

Chalk up another one for good old humans: Google's admitted that an automation failure was the root cause of a 93-minute outage of its Compute Engine in the us-central1 and europe-west3 zones of its cloud on January 18th, 2018.

Google's classified the outage as a "network programming failure" and said autoscaler didn't do its job as a result. That software's failure meant new or newly-migrated VMs couldn't talk to VMs in other zones.

The cloud contender has explained the outage by revealing that "Propagation of Google Compute Engine networking configuration for newly created and migrated VMs is handled by two components. The first is responsible for providing a complete list of VM's, networks, firewall rules, and scaling decisions."

"The second component provides a stream of updates for the components in a specific zone."

flaming plane engine

Google broke its own cloud, again, with dud DB config change

READ MORE

During the outage, the first component didn't send any data. That lack of information meant VMs in other zones couldn't figure out how to contact their colleagues. Autoscaler also needed information flow from the first resource, so it too ground to a halt.

So why did that first component fail? Google's described the problem as "A stuck process" that didn't send updates. In other words, a good old process hang.

Which wouldn't ordinarily be a problem except "Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation."

Yup, you got it: automation failed and a human sorted things out.

Or as Google put it: "The engineering team was alerted when the propagation of network configuration information stalled. They manually failed over to the replacement task to restore normal operation of the data persistence layer."

Google's pledged that in future it will "stop VM migrations if the configuration data is stale" and "the data persistence layer to re-resolve their peers during long-running processes, to allow failover to replacement tasks."

Automatically, from the sound of things. So don't get too smug, humans. ®




Biting the hand that feeds IT © 1998–2018