Wonky whitelist update blamed for AT&T's nationwide 911 blackout
FCC report finds lessons, mentions no fines as yet
A wrongly updated whitelist was behind the five-hour nationwide outage of AT&T's emergency 911 service in March, a report by America's phone regulator, the FCC, has revealed.
The whitelist of IP numbers plays a critical role in AT&T's backend systems, but was listed as containing only "customer" rather than "infrastructure" data and so did not go through vigorous checking before it was sent to the live network.
The wrong updated list removed IP addresses used by one of the two providers that AT&T contracts to run its emergency system: Comtech. That meant that phone calls made over its LTE network were prevented from reaching the company.
The result was that 12,600 people across the country heard fast busy signals, endless ringing or silence when trying to make an emergency call and were unable to receive an operator.
The report by the FCC points to a number of failings on AT&T's part, opening the company up to a massive fine from the federal regulator; the report notes it was one of the largest-ever 911 failings.
Aside from the initial mistake of removing Comtech's IP address from the whitelist, and then putting it live without proper testing, AT&T also failed to notify 911 call centers (called public-safety answering points, or PSAPs) of the problem clearly or swiftly.
Phone companies are obliged to let PSAPs know of any network problems within an hour. In this case, it didn't start communicating until three hours after the initial outage and even then the centers complained that the notifications were vague and didn't help them resolve the issue. Some centers were confused by the fact that they were still receiving calls from AT&T users (those users, it turned out, were mostly using traditional 3G networks rather than voice over LTE).
You tell him
AT&T's internal systems also delayed resolution of the problem: the initial outage was reported first to its 911 troubleshooting team within 16 minutes, but it then took that team to report the issue to AT&T's voice-over-LTE (VoLTE) team. And then to the Universal Service Platform team. And then to the Core Backbone team. Before finally reaching the IP team.
The IP team very quickly noted that the outage had happened at the same time as it had updated the whitelist, so it simply rolled back to the previous version and within three minutes the problem was resolved. The issue, though, is that it took nearly five hours for the IP team to be looped in. From now on AT&T will send alerts to all teams if there is a 911 outage.
While all this was going on, the PSAPs did the best they could to get alternative emergency numbers out there, sending out messages over social media, asking local TV stations to scroll numbers at the bottom of their screens and in one case holding a press conference. But even so, tens of thousands of people were left stranded and unable to get into emergency contact with the authorities.
"This was unacceptable. In times of trouble, Americans in need must be able to reach Americans who can help," said FCC chair Ajit Pai in response to the report, which was formally acknowledged by the regulator at a meeting on Thursday.
The report notes that AT&T has made changes to its system that should prevent the same situation from happening again. But it also notes that had the company followed best practice guidelines, it would likely not have happened in the first place.
"This outage could have been prevented," said Pai. "It was the result of mistakes made by AT&T. The Bureau's report shows that there were shortfalls in operational redundancies, risk assessment, and stakeholder and consumer outreach.
"Had AT&T followed certain best practices as outlined by the FCC's Communications Security, Reliability and Interoperability Council, this outage would have had much less impact. Indeed, the cause of the outage could and should have been identified and addressed with periodic audits of the network."
What was missing from the report or Pai's comments was the looming imposition of a fine on AT&T for having failed to maintain its emergency system.
- In April 2014, 11 million Americans in seven states lost 911 service for six hours following a coding error. As a result, the FCC fined CenturyLink $16m, Intrado Communications $1.4m, and Verizon $3.4m.
- In August 2014, T‑Mobile US suffered two 911 outages that together lasted three hours. The FCC fined the company $17.5m.
That means AT&T is likely looking at a fine of anywhere between $15m and $20m. ®