You're flowing it wrong: Bad network route between Microsoft, Apple blamed for Azure, O365 MFA outage
Postmortem report emerges for last week's partial downtime
Microsoft says last week's multi-factor authentication (MFA) partial outage, which hit its cloud-based services, was due to a dodgy network route between its servers and Apple's backend.
According to a postmortem penned by the Azure team on Thursday this week, the whole thing kicked off at around 1330 UTC (0630 PDT) on Friday, October 18 when packets went missing between Microsoft and the iGiant's Push Notification Service, aka the APNs.
From there, a chain reaction sparked. The inability to send MFA authentication tokens to Apple devices via APNs, due to the bad network link, formed a backlog of requests, leading to the system partially breaking down.
"At 13:30 UTC, severe packet loss was experienced on an external network route between Microsoft and the Apple Push Notification Service (APNs)," Microsoft's techies explained.
"The packet loss was greater in severity and duration than previously encountered. It also coincided with morning peak traffic in North America. This combination of events caused a build-up of unprocessed requests in the MFA service, leading to service degradation and failures in MFA requests."
Not every Azure and Microsoft 365 user relying on Redmond's downed MFA security was potentially locked out. Microsoft noted that those who already had a valid login session could still access its services; a number of fresh login attempts were not able to get through. In fact, the IT titan claims "0.51% of users in North American [Azure] tenants using the service during the incident" were affected by the two-hour network degradation.
While Azure technicians were alerted within minutes of the partial outage kicking off, actually getting it under control took some time. Eventually, Microsoft decided to put together a hotfix that would bypass APNs altogether to get the rest of the system back online as usual. As far as the Azure side is concerned, the team described the situation thus:
Service monitors detected the build-up of unprocessed requests in the MFA service at 13:38 UTC and engineers were fully engaged by 13:48 UTC. Engineering confirmed the issue was a loss of network connectivity and began troubleshooting within the Microsoft datacenter networks. Engineering determined that the datacenter networks did not experience loss of connectivity and pinpointed the issue as external to the Microsoft datacenter networks.
While further troubleshooting was underway to identify the most impacted network routes, engineering prepared a hotfix to bypass the impacted external service altogether, and to restore MFA functionality. The hotfix was rolled out to one region to validate the effectiveness of the fix. In the meantime, the external network recovered, and packet loss was reduced to normal rates. Engineering paused further rollout of the hotfix. The network issue was confirmed to be mitigated at 15:57 UTC, and the MFA service functionality recovered. The hotfix, which was then redundant, was rolled back.
Getting the matter resolved for Microsoft 365, aka Office 365, took longer, we note, as that service remained down for a few hours longer.
In response to the outage, Microsoft said it will add additional network connections within its Azure regions, and also add logical partitioning to better separate different types of requests. "In-progress fine-grained fault domain isolation work has been accelerated," Team Microsoft explained. "This work builds on the previous fault domain isolation work which limited this incident to North American tenants." ®