Puny humans flub ANOTHER Google cloud patch

Immaturity or radical transparency: you be the judge

No-one says running a globe-spanning cloud is easy, but Google's just borked its own cloud for the fourth and fifth time in a few weeks.

This time the problem was with Google Compute Engine instances in us-central1, which experienced elevated packet loss last Friday. The problem wasn't a shocker – it lasted 14 minutes – but was noticed and duly fixed.

Here's what Google says happened:

“During routine planned maintenance a miscommunication resulted in traffic being sent to a datacenter router that was running a test configuration. This resulted in a saturated link, causing packet loss. The faulty configuration became effective at 02:10 and caused traffic congestion soon after.”

And here's what Google's going to do about it:

“To prevent similar incidents in future, we are changing procedure to include additional validation checks while configuring routers during maintenance activities. We are also implementing a higher degree of automation to remove potential human and communication errors when changing router configurations.”

The company's since reported another packet loss incident that struck on Sunday, US time

That incident and last Friday's are nowhere near as bad as February's near-three-hour outage, the early March VM brownout or the 50-minute App Engine hiccup later in the month.

Google nonetheless fronted up with numerous reports on the faults and reasonably detailed descriptions of the fix.

All of which is very nicely transparent and gives your correspondent something to do on a slow-ish news day. But it also gets your correspondent wondering if what we're seeing in this stream of faults is a hint of operational immaturity, or of radical transparency. And if the latter, do cloud users really want or need to know about every operational hiccup? ®

Sponsored: Becoming a Pragmatic Security Leader

Biting the hand that feeds IT © 1998–2019