Facebook blames 'server config change' for 14-hour outage. Someone run that through the universal liar translator
Is a single tweet enough when millions of people's communications are affected?
Facebook has said a "server configuration change" was to blame for an 14-hour outage of its services, which took down the Facebook social media service, its Messenger and WhatsApp apps, Instagram, and Oculus.
"Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services. We've now resolved the issues and our systems are recovering. We’re very sorry for the inconvenience and appreciate everyone’s patience," the tech goliath said in a tweet.
The outage started around 0900 Pacific Time (1600 UTC) on Wednesday and wasn't fully resolved until 2300 (0600 UTC) – an extraordinary delay for a service used by billions globally and run by a multi-billion-dollar Silicon Valley monster.
That brief and vague explanation – with no promise of an in-depth report to come – has left users and observers surprised and disappointed. Any company providing a service of similar size and impact, such as a phone operator, would be expected to provide constant updates and make its executives available to publicly explain what went wrong.
So far, Facebook has issued one tweet, and in response to repeat questions from The Register – and no doubt many other publications – has done nothing except repeat the same tweet as a statement.
It's not like Facebook is allergic to revealing technical details about itself: it has a whole sub-site dedicated to its internal software and data-center engineering work, though there's not a word about its latest outage, we note. In contrast, Google suffered a cloud platform outage, too, for about four hours yesterday, and its postmortem is detailed: a key part of its backend storage system was overloaded with requests after changes were made to counter a sudden demand for space to hold object metadata, ultimately causing services to stall.
Unlike almost every other company running a communications service for millions of users, Facebook does not even provide a system status dashboard for the public. It has a dashboard for app developers, however, it is borderline useless. "We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution," it noted a few hours ago, somewhat stating the bleeding obvious. Since Facebook's network fell over yesterday, third-party applications, funnily enough, were not able to connect, and even now are still experiencing "intermittent errors."
In our experience, communications companies go out of their way to reach out to media outlets and explain major multi-hour outages in order to maintain public confidence in their network; they feel under an obligation to explain what is happening.
Not so Facebook.
Rollback or rolling your eyes?
Digging into the limited explanation of a "server configuration change" as the source of the problem, that terminology is so vague as to be useless: what sort of change? On what servers? What was the change intended to achieve? Was it tested beforehand? Was it rolled out gradually, or suddenly across all regions – and if the latter, why? Why was a rollback not immediately initiated? And if it was, why didn't it work? Why did it take 14 hours to resolve? All these are questions that you would expect a huge technology company to provide.
Instead, the best explanation we've found is a hypothetical rundown by Facebook's former chief information security officer Alex Stamos who assumes that Facebook engineers did initiate an automated rollback but that "the automated system doesn't know how to handle the problem, and gets stuck in some kind of loop that causes more damage. Humans have to step in, stop it, and restart a complex web of interdependent services on hundreds of thousands of systems."
But there is a much bigger question that this outage raises: at what point does Facebook go from a scrappy startup with a service that it doesn't matter if it goes down or not – it's just vacation humble-brags, birthday night outs, and stalker exs – to a service that risks disrupting society, public safety, and even the economy if it fails?
Just this month, US Senator Elizabeth Warren (D-MA) made the argument that services like Facebook, Google, and Amazon have become so large and so fundamental in the digital era that they should be viewed – and legislated as – "platform utilities."
Yeah, no biggy, we just went offline for a day ... Downdetector's outage graph for Facebook
When Facebook even refuses to provide a proper explanation for a 14-hour outage, the argument that there needs to be legislative oversight only grows stronger.
Related to that, yesterday it was revealed that Facebook is being investigated by a grand jury in New York for possible criminal charges thanks to its sharing of people's private data with other technology companies without seeking the consent of, or even informing, those that were affected.
The grand jury has subpoenaed records from two smartphone manufacturers over their secret data-sharing agreements with the antisocial network. And the investigation adds to those already being run by the FTC, the Securities and Exchange Commission (SEC), and the Department of Justice in the US over Facebook's handling of people's profile data in the Cambridge Analytica scandal and related secretive data-exchange programs.
The other big question is how a "server configuration change" led to not just Facebook but also its Messenger, WhatsApp, and Instagram services going down. That would strongly suggest that Facebook has either connected them up or attempted to connect them up at a low level, merging them into one broad platform. In January, it emerged that CEO Mark Zuckerberg had ordered that his instant-chat applications and social network be intertwined. And this month, Zuck alluded to this in an otherwise aimless blog post: "With all the ways people also want to interact privately, there's also an opportunity to build a simpler platform that's focused on privacy first."
What today links Gmail, Google Drive, YouTube, Facebook, Instagram – apart from being run by monopolistic personal data harvesters?READ MORE
So, that "server configuration change" may have been more conspiracy than cockup, a move to bring together Facebook's individual components. An effort so large and complex, it resulted in 14 hours of downtime. That may help explain why the biz is being so secretive about the cause of the outage. Bringing together everything under one roof is certainly one way to avoid potential regulatory break-up.
The company has been repeatedly rebuked – and fined – for trying to treat Instagram and WhatsApp user data as part of a larger Facebook database when that data was provided by users to, what were, entirely different companies under very different terms and conditions. And for using data in a way that it claimed it wouldn't.
Was the outage as a result of Facebook trying to combine systems and so get ahead of regulators, especially when this month, an open debate opened up over whether Facebook's takeover of Instagram and WhatsApp should be rolled back? It's all too possible.
Facebook, as ever, has declined to comment. Although, as we have seen in the past, its responses are often full of outright falsehoods so we may not be missing much. ®
Stop press: Chris Cox, who oversaw Facebook's family of apps, has quit, Zuckerberg announced today.
Sponsored: Becoming a Pragmatic Security Leader