Google cloud brownout fix forgets some servers
We fixed all those cloudy containers, says Google, twice!
Containers are so hot right now. So hot that they've just copped their very own cloud outage. And not just an outage but an imperfectly fixed outage.
The SNAFU in question befell Google's Container Engine, which on Friday afternoon West Coast time had trouble “ connecting to the metadata server”. The trouble continued for quite a while – a string of updates over Friday night and Saturday morning (again Pacific Time) explain how Google identified the problem, crafted a fix, implemented it and then declared the problem "should have been resolved for all affected clusters” as of 10:45 Pacific Time on the 25th of October.
Note the “should have”, because at 21:30 on the same day Google had to 'fess up that “We have identified a small number of additional Google Container Engine clusters that were not fixed by the previous round of repair.”
To reconstruct this timeline: on Friday afternoon, Google had a problem. By Coffee o'clock the next day, it declared it had restored normal service. But it hadn't and some people lost a big chunk of their Saturday night cleaning up a corner of the Google cloud that someone didn't fix properly the first time around.
Is this any way to run a cloud?
Yes, mostly, as Google's last cloud glitch was five weeks before this incident. That's a more-than-decent record even if the fix for this fail was flubbed. ®