Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

Thu 14 Apr 2016 // 06:31 UTC

A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss.

And yes, that is Sloss' real title.

Sloss says the problem started when “engineers removed an unused Google Compute Engine (GCE) IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network.” Google announces the IP blocks it is using to help route traffic into its cloud.

On this occasion, the propagation failed due to “a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management.”

When propagation fails, Google usually fails over to the configuration in place before the new block was added. But on this occasion “a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”

Google says it has a “canary step” designed to catch messes like that described above.

But the canary had a bug “and thus the push system concluded that the new configuration was valid and began its progressive rollout.”

Once the new configuration reached Google bit barns around the world those that received the dud information stopped announcing their IP blocks, which made it rather hard to reach them. At this point, the Google cloud worked well because traffic from an un-reachable data centre was routed to another. But the dud IP configuration information was also moving from bit barn to bit barn, pulling them off the net.

The rest is 18 minutes of cloud outage history.

Google says it's found the bugs in its network configuration software responsible for the first mess, has killed 'em and is making “14 distinct engineering changes planned spanning prevention, detection and mitigation” and expects more will follow.

Sloss' apology follows:

We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.

They're fine words, but the fact remains that Google's cloud has been felled by a typo, bungled change management, lightning, failed automation and an imperfect patch. And those problems all happened since August 2015.

Topics

Special Features

Vendor Voice

Resources

SaaS

Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

More about

TIP US OFF

Other stories you might like

Cisco creates architecture to improve security and sell you new switches

Europe gives TikTok 24 hours to explain 'addictive and toxic' new app

Singapore infosec boss warns China/West tech split will be bad for interoperability

Industrial systems integrating digitalisation

Mars helicopter sends final message, but will keep collecting data

Taiwanese film studio snaps up Chinese surveillance camera specialist Dahua

Software glitch saw Aussie casino give away millions in cash

HPE sues China's Inspur Group over server patents

Hugely expanded Section 702 surveillance powers set for US Senate vote

Snowmobile, Amazon's truck-powered migration service, reaches the end of the road

Uncle Sam earmarks $54M of CHIPS funding for small-biz semiconductor boffinry

Psst, hey. It's the NSA. You want some AI security advice?

About Us

Our Websites

Your Privacy