Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

Thu 14 Apr 2016 // 06:31 UTC

A couple of days ago Google's cloud went offline, just about everywhere, for 18 minutes. Now the Alphabet subsidiary has explained why and issued a personal apology penned by “Veep for 24x7” Benjamin Treynor Sloss.

And yes, that is Sloss' real title.

Sloss says the problem started when “engineers removed an unused Google Compute Engine (GCE) IP block from our network configuration, and instructed Google’s automated systems to propagate the new configuration across our network.” Google announces the IP blocks it is using to help route traffic into its cloud.

On this occasion, the propagation failed due to “a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management.”

When propagation fails, Google usually fails over to the configuration in place before the new block was added. But on this occasion “a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.”

Google says it has a “canary step” designed to catch messes like that described above.

But the canary had a bug “and thus the push system concluded that the new configuration was valid and began its progressive rollout.”

Once the new configuration reached Google bit barns around the world those that received the dud information stopped announcing their IP blocks, which made it rather hard to reach them. At this point, the Google cloud worked well because traffic from an un-reachable data centre was routed to another. But the dud IP configuration information was also moving from bit barn to bit barn, pulling them off the net.

The rest is 18 minutes of cloud outage history.

Google says it's found the bugs in its network configuration software responsible for the first mess, has killed 'em and is making “14 distinct engineering changes planned spanning prevention, detection and mitigation” and expects more will follow.

Sloss' apology follows:

We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.

They're fine words, but the fact remains that Google's cloud has been felled by a typo, bungled change management, lightning, failed automation and an imperfect patch. And those problems all happened since August 2015.

Topics

Special Features

Vendor Voice

Resources

SaaS

Google broke its own cloud AGAIN, with TWO software bugs

'VP of 24x7' apologises in person for latest TITSUP

More about

TIP US OFF

Other stories you might like

Forget the AI doom and hype, let's make computers useful

Indian bank’s IT is so shabby it’s been banned from opening new accounts

Samsung shows off battery tech it says will see you gone in nine minutes

Protecting distributed branch office environments from ransomware

IBM to acquire Hashi for $6.4 billion, hopes it will boost software biz and Red Hat

Australia’s spies and cops want ‘accountable encryption’ - aka access to backdoors

Governments issue alerts after 'sophisticated' state-backed actor found exploiting flaws in Cisco security boxes

With Run:ai acquisition, Nvidia aims to manage your AI kubes

Apple releases OpenELM, a slightly more accurate LLM

Musk moves Tesla's goalposts, investors happily move shares higher

Shouldn't Teams, Zoom, Slack all interoperate securely for the Feds? Wyden is asking

Now all Windows 11 users are getting adverts to 'make the Start menu great again'

About Us

Our Websites

Your Privacy