Google cloud wobbles as workers patch wrong routers
Is 'Sorry about that, we promise to learn from our mistakes' any way to run a cloud?
Add another SNAFU to the long list of Google cloud wobbles caused by human error: this time The Alphabet subsidiary decided to patch the wrong routers.
The wobble wasn't a big one: it lasted just 46 minutes and only hit Google Compute Engine Instances in the us-central1-f zone. Of course it wasn't minor if yours was one of the 25 per cent of network flows that just didn't make it into or out of that region.
Here's Google's explanation of the problem's root cause:
“Google follows a gradual rollout process for all new releases. As part of this process, Google network engineers modified a configuration setting on a group of network switches within the us-central1-f zone. The update was applied correctly to one group of switches, but, due to human error, it was also applied to some switches which were outside the target group and of a different type. The configuration was not correct for them and caused them to drop part of their traffic.
Face, meet palm. Repeat.
Google says “The traffic loss was detected by automated monitoring, which stopped the misconfiguration from propagating further, and alerted Google network engineers.” Which is just what you'd expect in this age of automation. Sadly, however, “Conflicting signals from our monitoring infrastructure caused some initial delay in correctly diagnosing the affected switches.”
“This caused the incident to last longer than it should have.”
El Reg's cloud desk keeps an eye on status feeds from the big four clouds – AWS, Google, SoftLayer and Azure – and Google's is by far the busiest. A great many of the reports concern human error.
It could well be that rival clouds aren't as forthcoming with reports of messes like this, and that the stream of SNAFUs Google reports is a sign of commendable openness and transparency.
Or they could be signs of immature processes. Google's notice about the incident says “To prevent recurrence of this issue, Google network engineers are refining configuration management policies to enforce isolated changes which are specific to the various switch types in the network. We are also reviewing and adjusting our monitoring signals in order to lower our response times.”
In related news, Google's cloudy operation has announced tweaks to Cloud Trace, a tool it offers to diagnose problems in cloudy applications. ®