SaaS

This article is more than 1 year old

Google promises proper patch preparation after new cloud outage

Network egress issues hit Google Compute Engine for the second time in three weeks

Mon 9 Mar 2015 // 08:28 UTC

Google Compute Engine (GCE) users experienced a brownout over the weekend, after an incident that bears plenty of likeness to a worse outage that took down the service in February.

The February FAIL came about when “The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information.”

This new outage, which started at 9:55 AM Saturday March 7th, Pacific Standard Time, was caused by “packet loss on egress network traffic” and meant users experienced symptoms ranging from “... no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.”

Things were back to normal 43 minutes later and Google says virtual machines stayed up, but the cause of the mess this time was a botched patch.

Google's offered the following explanation of the incident:

The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.

This outage was less severe than the February incident and Google says its engineers are “investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production.” The company has a sterner response this time around, explained as follows.

Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.

This new outage was brief and minor. But Google's clearly been caught on the hop with its patching procedures.

Google and its ilk are looked to as the experts in computing at hyperscale. Outages like this suggest we've all got a lot to learn. ®

Topics

Special Features

Vendor Voice

Resources

SaaS

Google promises proper patch preparation after new cloud outage

Network egress issues hit Google Compute Engine for the second time in three weeks

More about

More about

Narrower topics

Broader topics

More about

More about

More about

Narrower topics

Broader topics

TIP US OFF

Other stories you might like

Google fires 28 staff after sit-in protest against Israeli cloud deal ends in arrests

Protest group says Google has fired more staff over sit-ins opposing work for Israel

Google One VPN axed for everyone but Pixel loyalists ... for now

Industrial systems integrating digitalisation

Tokyo wags finger at Google for blocking Yahoo Japan! from using ad tech

Google location tracking deal could be derailed by politics

Google squashes AI teams together in push for fresh models

UK data watchdog questions how private Google's Privacy Sandbox is

Google laying off staff again and moving some roles to 'hubs,' freeing up cash for AI investments

Google will pump more than $100B into AI, says DeepMind boss

Japan turns up heat on Apple, Google with threat of hefty fines

Google cools on cookie phase-out while regulators chew on plans

About Us

Our Websites

Your Privacy