Google BigQuery TITSUP caused by failure to scale-yer workloads
Engineers went head down, bum up but zipped lips gave users the … heebie jeebies
A four-hour outage of Google's BigQuery streaming service has taught the cloud aspirant two harsh lessons: its cloud doesn't always scale as well as it would like, and; it needs to explain itself better during outages.
The Alphabet subsidiary's trouble started last Tuesday when a surge in demand for the BigQuery authorization service “caused a surge in requests … exceeding their current capacity.” Or as we like to say here at El Reg, it experienced a Total Inability To Support Usual Performance and went TITSUP.
As Google explains, “The BigQuery streaming service requires authorization checks to verify that it is streaming data from an authorized entity to a table that entity has permissions to access.” There's a cache between the authorization service and its backend, but because “Because BigQuery does not cache failed authorization attempts, this overload meant that new streaming requests would require re-authorization, thereby further increasing load on the authorization backend.”
As authorization requests piled up, the strain on the already-stressed authorization backend meant “continued and sustained authorization failures which propagated into streaming request and query failures.”
All of which meant that for four hours last Tuesday afternoon (US Pacific time) 73 per cent of BigQuery streaming inserts failed with a 503 error code. At the peak of the problem, the failure rate was 93 per cent.
Google's now figured out that its cache wasn't big enough and that the authorization backend lacked capacity, fascinating admissions given that the premise of cloud is that it will just scale as demand increases. Google seems to have missed some tricks here, as among the tactics in its remediation plan is “improving the monitoring of available capacity on the authorization backend and will add additional alerting so capacity issues can be mitigated before they become cascading failures.”
The second thing Google's promised to do better in future is explain itself.
As the incident report says, “... we have received feedback that our communications during the outage left a lot to be desired. We agree with this feedback.”
“While our engineering teams launched an all-hands-on-deck to resolve this issue within minutes of its detection, we did not adequately communicate both the level-of-effort and the steady progress of diagnosis, triage and restoration happening during the incident. We clearly erred in not communicating promptly, crisply and transparently to affected customers during this incident.”
“We will be addressing our communications — for all Google Cloud systems, not just BigQuery — as part of a separate effort, which has already been launched.”
Google's being a little hard on itself here, as its incident reports are comfortably more detailed than those of its cloudy rivals. But on this occasion it appears users wanted even more.
The ad giant's decision to improve its crash-time comms surely bespeaks its cloudy aspirations: Amazon Web Services and Microsoft's Azure are considered the leaders of the cloud market, with Google and IBM scrapping for third place ahead of fast-growing contenders like OVH and Digital Ocean. And those rivals can all point to Google breaking its own cloud, sometimes with careless errors. ®