Feeds

Google blames Gfail on 'availability' upgrade

100-minute outage. Won't happen again. Really

  • alert
  • submit to reddit

Beginner's guide to SSL certificates

Calling Tuesday's one-hour-and-forty-minute Gmail outage a "Big Deal," Google has pinned the breakdown on some recent changes to the request routers that direct queries to the service's web servers.

Ironically, at least some of the changes were meant to improve Gmail's ability to stay online. But Google underestimated the load these changes would place on the routers when it took a relatively small number of servers offline for upgrades.

The company says it will spend the next few weeks correcting the problem, and it continues to boast that despite several conspicuous outages in recent months, Gmail "remains more than 99.9% available to all users."

In a blog post that surfaced on Tuesday night, Google said Tuesday's outage lasted for about 100 minutes. And though it didn't say what percentage of users were affected, it call the breakdown "a Big Deal, and we're treating it as such." Judging from reports from Reg and the Tweetbook set, the outage was worldwide, and Google indicated in an earlier post to its Google Apps Status dashboard that a "majority" of users were affected.

This morning Pacific time, Google took "a small fraction" of its Gmail servers offline for routine upgrades, and this put an unexpected load on its request routers. "We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response," reads the post form Ben Treynor, the Google vp of engineering who calls himself Site Reliability Czar.

"At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system 'stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded."

This meant that who knows how many people were unable to access Gmail via the web - though the service was still available via POP and IMAP. Boasting that the Gmail engineering team was alerted to the problem within seconds - "we take monitoring very seriously" - the company solved the issue by bringing more request routers online. Service was restored at about 2:10pm Pacific.

Google says it will now increase its request router capacity well beyond peak demand - and make some additional tweaks to its infrastructure. "For example," Treynor says, "we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load).

"We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements."

According to Treynor, Google has turned its "full attention to helping ensure this kind of event doesn't happen again." ®

Security for virtualized datacentres

More from The Register

next story
Hey, Scots. Microsoft's Bing thinks you'll vote NO to independence
World's top Google-finding website calls it for the UK
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
Apple CEO Tim Cook: TV is TERRIBLE and stuck in the 1970s
The iKing thinks telly is far too fiddly and ugly – basically, iTunes
Israeli spies rebel over mass-snooping on innocent Palestinians
'Disciplinary treatment will be sharp and clear' vow spy-chiefs
Huawei ditches new Windows Phone mobe plans, blames poor sales
Giganto mobe firm slams door shut on Microsoft. OH DEAR
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
OECD lashes out at tax avoiding globocorps' location-flipping antics
You hear that, Amazon, Google, Microsoft et al?
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.