Facebook blames outage on internal config flaw
Cascading failure feedback loop calamity
Posted in CIO, 24th September 2010 09:16 GMT
Free whitepaper – Enabling Datacenter and Cloud Service Management for Mid-Tier Enterprises
Facebook has published a detailed explanation of an internal configuration flaw that left the site unavailable for around two and a half hours overnight - the social network's worst downtime in four years.
The outage stemmed from a cascading series of problems involving an error correction system that feed into a feedback loop that only cutting traffic to a database cluster and rebooting the site could solve.
The social network apologised for the downtime, which affected servers worldwide, and promised to redesign the faulty system it used to correct configuration values to prevent future problems in the area. In the meantime, this system has been taken out of commission.
Facebook's statement can be found here. Arbor Network's chart of traffic flowing to Facebook can be found here.
Thursday night's outage follows similar but less severe problems the day before. ®
Free whitepaper – Enabling Datacenter and Cloud Service Management for Mid-Tier Enterprises

Enabling Datacenter and Cloud Service Management for Mid-Tier Enterprises
The Register guide to hosted apps
Secure Mobile Working
Office 365 in the real world
Linux on the Desktop
