Google cloud: rubbish at updates, world-class at rapid rollbacks
Another borked software upgrade gives Google's cloud hiccups
Google's revealed that it has once again borked its own cloud with an update.
The latest incident hit last Thursday, when the company made what it's calling “A routine software upgrade to the authorization process in BigQuery”.
That update “had a side effect of reducing the cache hit rate of dataset permission validation … triggered a cascade of live authorization checks that fanned out and amplified throughout the BigQuery service, eventually causing user visible errors as the authorization backends became overwhelmed.”
“As a byproduct, error rates for the service increased as individual requests failed to authorize.”
Users experienced problems for six hours.
Google's promising it will “change the structure of permissions validation so that continual retries will not destabilize the entire service”.
The Reg's cloud desk hopes it is also changing the way it tests patches, because it is racking up quite a list of self-inflicted outages. In March, for example, one flawed fix caused a lengthy VM brownout and another took down App Engine for a time.
Come April and the company said a routine maintenance “resulted in traffic being sent to a datacenter router that was running a test configuration” and caused packet loss.
The good news is that Google fixes this stuff quickly: six-hour wobbles like the BigQuery incident are rather longer than the company's usual time to restore a distressed service. And it's not had outright outages. On the downside, the company keeps experiencing wobbles of its own making, but appears to have a world-class rapid response and rollback capability to keep things moving along. ®