Feeds

Revealed: How Microsoft DNS went titsup globally on Xbox One launch day

A 'Group Policy Object' snafu brought Redmond to its knees

Next gen security for virtualised datacentres

Exclusive Microsoft's major outage last week was caused by a policy rollout that derailed its own DNS servers – a blunder that also downed some of the tech giant's internal services.

The outage hit on Thursday, during which key websites such as Xbox.com and Outlook.com were knocked over, connectivity to the Office 365 online software suite axed, and multiple Azure cloud services were cut off from the outside world.

We've also heard from multiple sources that the blunder scuppered parts of Microsoft's on-campus networks as well, rendering systems inaccessible to employees.

For a company that prides itself on becoming a "services and devices" firm, having multiple online services fall offline at once is very bad. XBox.com, for instance, was taken down just Microsoft's Xbox One console went on sale worldwide.

Now El Reg can reveal that the root cause of this mega-fail was a flubbed change to an Active Directory Group Policy, which ultimately rendered the company's DNS servers inaccessible.

The mistake "inadvertently blocked incoming DNS queries to Microsoft DNS servers," Microsoft wrote in a "post-incident review" document, seen by The Reg and distributed to affected customers. "All zones owned by Microsoft authoritative DNS infrastructure may not have resolved depending on client-side TTL."

In the report, the company said the outage started at about 10.10pm UTC on Thursday, when "network engineers observed difficulty making changes to DNS records on the authoritative DNS infrastructure". The system was back on track by 11.30pm.

At first, we're told, engineers tried to revert the Group Policy Object change, and started a forced refresh of group policy across DNS server infrastructure. No improvement was observed, and so at 11pm UTC they rebalanced their DNS server infrastructure. This helped, and at 11.15pm they executed a script to reboot the balancing of DNS servers. As this propagated, things got better.

Nonetheless, 80 minutes is not a brilliant amount of time to lock out folks from critical business services. During the outage, users may have had difficulty trying to access crucial online Microsoft services such as Exchange, SharePoint, Lync, and others due to "name resolution issues".

The impact on users also varied according to their own DNS time-to-live (TTL) settings, the company said. To fix the problems Microsoft plans to "improve policy change procedures", it said.

It will also "update communication tools to improve resiliency" as it was unable to post to its services' health dashboard during the incident because "the DNS issue impacted internal service".

DNS is a tricky tech to manage when you're a global company fielding a vast quantity of online systems, but it strikes us that Microsoft made DNS changes a single point of failure – and this needs to be dealt with.

We're also confused as to why Microsoft neglects to publish reports like this in the open, instead treating them like valuable corporate information (they aren't) and sending them only to affected customers. If you receive other ones, don't hesitate to get in touch. ®

Boost IT visibility and business value

More from The Register

next story
6 Obvious Reasons Why Facebook Will Ban This Article (Thank God)
Clampdown on clickbait ... and El Reg is OK with this
No, thank you. I will not code for the Caliphate
Some assignments, even the Bongster decline must
Fast And Furious 6 cammer thrown in slammer for nearly three years
Man jailed for dodgy cinema recording of Hollywood movie
Caught red-handed: UK cops, PCSOs, specials behaving badly… on social media
No Mr Fuzz, don't ask a crime victim to be your pal on Facebook
Barnes & Noble: Swallow a Samsung Nook tablet, please ... pretty please
Novelslab finally on sale with ($199 - $20) price tag
Ballmer leaves Microsoft board to spend more time with his b-balls
From Clippy to Clippers: Hi, I see you're running an NBA team now ...
Video of US journalist 'beheading' pulled from social media
Yanked footage featured British-accented attacker and US journo James Foley
Assange™: Hey world, I'M STILL HERE, ignore that Snowden guy
Press conference: ME ME ME ME ME ME ME (cont'd pg 94)
Call of Duty daddy considers launching own movie studio
Activision Blizzard might like quality control of a CoD film
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Scale data protection with your virtual environment
To scale at the rate of virtualization growth, data protection solutions need to adopt new capabilities and simplify current features.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?