What a pain in the Azzz-ure: Microsoft Azure, SharePoint, etc knocked offline by DNS blunder
Technical term for today's three-hour outage is TITSUP: Total Inability To Support Users' Packets
For at least the past hour or two, Microsoft's Azure cloud has been up and down globally due to a DNS configuration mishap.
The platform-wide outage has knackered all sorts of Redmond-hosted systems around the world, from Azure SQL databases and App Services to multi-factor authentication, Microsoft 365 and Teams, Dynamics, SharePoint Online and OneDrive.
As we type, the cloud giant is gradually getting back on its feet, with Azure regions recovering one by one, though your mileage may vary. It appears trouble started around 1945 UTC.
"Customers may experience intermittent connectivity issues with Azure and other Microsoft services (including M365, Dynamics, DevOps, etc)," the Azure status page reads right now at 2128 UTC.
"Engineers are investigating DNS resolution issues affecting network connectivity. Connectivity issues are resulting in downstream impact to Compute, Storage, and Database services, and some customers may be unable to file support requests.
"More information will be provided as it becomes available. Some customers may start to see recovery."
In other words, the all-clear hasn't been given yet, and anything could happen in the next half hour, as they say.
Over on the Microsoft 365 status page, Redmond's techies blamed an internal DNS configuration error for the downtime:
Users may be unable to access Microsoft 365 services or features.
More info: Affected services include SharePoint Online, OneDrive for Business, Microsoft Teams, Stream, Power BI, Planner, Forms, PowerApps, Dynamics 365, Intune and Office Licensing.
We've identified and corrected a DNS configuration issue that prevented users from accessing Microsoft 365 services and features. We've observed an increase in successful connections and our telemetry indicates that all services are recovering. We're continuing to monitor the environment to validate that service has been restored.
We've identified and corrected a DNS configuration issue that prevented users from accessing Microsoft 365 services. Further details can be found in the admin center under SP178746, OD178975, and MO178979.— Microsoft 365 Status (@MSFT365Status) May 2, 2019
Updated to add at 2300 UTC
Microsoft says it's fixed its broken systems, ending today's three-hour outage, and Azure's network infrastructure should more or less be back to normal: "Mitigation has been applied, and most services have recovered, with the exception of a small subset of services who may still experience some impact."
The tech goliath added that a bungled migration of its DNS systems was ultimately to blame: "Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services.
"During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.
"To mitigate, engineers corrected the nameserver delegation issue. Applications and services that accessed the incorrectly configured domains may have cached the incorrect information, leading to a longer restoration time until their cached information expired."