SaaS

This article is more than 1 year old

Google's robo-CTRL-ALT-DEL failed, hung networks and Compute Engine for 90 minutes

And then along came a good old flesh and blood human!

Mon 19 Feb 2018 // 03:58 UTC

Chalk up another one for good old humans: Google's admitted that an automation failure was the root cause of a 93-minute outage of its Compute Engine in the us-central1 and europe-west3 zones of its cloud on January 18th, 2018.

Google's classified the outage as a "network programming failure" and said autoscaler didn't do its job as a result. That software's failure meant new or newly-migrated VMs couldn't talk to VMs in other zones.

The cloud contender has explained the outage by revealing that "Propagation of Google Compute Engine networking configuration for newly created and migrated VMs is handled by two components. The first is responsible for providing a complete list of VM's, networks, firewall rules, and scaling decisions."

"The second component provides a stream of updates for the components in a specific zone."

Google broke its own cloud, again, with dud DB config change

During the outage, the first component didn't send any data. That lack of information meant VMs in other zones couldn't figure out how to contact their colleagues. Autoscaler also needed information flow from the first resource, so it too ground to a halt.

So why did that first component fail? Google's described the problem as "A stuck process" that didn't send updates. In other words, a good old process hang.

Which wouldn't ordinarily be a problem except "Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation."

Yup, you got it: automation failed and a human sorted things out.

Or as Google put it: "The engineering team was alerted when the propagation of network configuration information stalled. They manually failed over to the replacement task to restore normal operation of the data persistence layer."

Google's pledged that in future it will "stop VM migrations if the configuration data is stale" and "the data persistence layer to re-resolve their peers during long-running processes, to allow failover to replacement tasks."

Topics

Special Features

Vendor Voice

Resources

SaaS

Google's robo-CTRL-ALT-DEL failed, hung networks and Compute Engine for 90 minutes

And then along came a good old flesh and blood human!

Google broke its own cloud, again, with dud DB config change

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Datacenter outages are on the decline, but when they hit, they hit hard

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Industrial systems integrating digitalisation

Tech trade union confirms cyberattack behind IT, email outage

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Americans wake to widespread AT&T cellular outages

X protests forced suspension of accounts on orders of India's government

Cyberattack downs pharmacies across America

About Us

Our Websites

Your Privacy