Facebook blames outage on internal config flaw
Cascading failure feedback loop calamity
Posted in Management, 24th September 2010 09:16 GMT
Watch Now : Virtual Machine Movement with Hyper-V
Facebook has published a detailed explanation of an internal configuration flaw that left the site unavailable for around two and a half hours overnight - the social network's worst downtime in four years.
The outage stemmed from a cascading series of problems involving an error correction system that feed into a feedback loop that only cutting traffic to a database cluster and rebooting the site could solve.
The social network apologised for the downtime, which affected servers worldwide, and promised to redesign the faulty system it used to correct configuration values to prevent future problems in the area. In the meantime, this system has been taken out of commission.
Facebook's statement can be found here. Arbor Network's chart of traffic flowing to Facebook can be found here.
Thursday night's outage follows similar but less severe problems the day before. ®
Watch Now : Virtual Machine Movement with Hyper-V
COMMENTS
It makes me cross
Each time Facebook publishes any kind of technical information, their post is plagued by hundreds of people commenting and claiming to know how to do it better or fix it. And generally they're talking absolute crap.
Facebook may not be perfect but they know what they're doing.
I'm not sure why it makes me so cross - but it does. It really does!
get a life ?
I laugh at all the people telling facebook users to "get a life" etc.
Ha. Ha. Ha.
I got exactly the same sort of glib condescending shyte from people when I used Cix for the first time, fidonet or indeed email/www.
Standard solution
I am not particuarly gifted in the ways of mult datacentre server management but the ultimate solution appeared to be to turn it off and on again.
Outsourced to Renholm Industries?
Mr.
.Facebook uses Akamai for the static files, such as photos, images, etc.
They don't use it for the main www site normally.
However, yesterday, during the outage, they changed the DNS entry for www.facebook.com to point to:
root@northway# host www.facebook.com.
www.facebook.com is an alias for sorry.ak.facebook.com.edgesuite.net.
sorry.ak.facebook.com.edgesuite.net is an alias for a1030.g.akamai.net.
a1030.g.akamai.net has address 92.122.127.27
a1030.g.akamai.net has address 92.122.127.33
As they said, they needed to stop all traffic to fix the problem, so temporarily diverting to their network of akamai servers seemed to be way they chose to do it

Enabling efficient data center monitoring
The new Office Garage series:
IT infrastructure monitoring strategies