Facebook blames outage on internal config flaw
Cascading failure feedback loop calamity
Facebook has published a detailed explanation of an internal configuration flaw that left the site unavailable for around two and a half hours overnight - the social network's worst downtime in four years.
The outage stemmed from a cascading series of problems involving an error correction system that feed into a feedback loop that only cutting traffic to a database cluster and rebooting the site could solve.
The social network apologised for the downtime, which affected servers worldwide, and promised to redesign the faulty system it used to correct configuration values to prevent future problems in the area. In the meantime, this system has been taken out of commission.
Thursday night's outage follows similar but less severe problems the day before. ®