Microsoft explanation for Visual Studio online outage leaves open questions
Five hour lockout caused by errant stored procedure
Microsoft has posted a resolution report on a recent problem with Visual Studio Team Services, a cloud-based code repository and developer collaboration platform.
"Between 09:10 and 14:28 UTC on 04 Feb 2016, customers attempting to log into their Visual Studio Team Services accounts will have been unable to access their accounts," says the report. The reason?
A SQL stored procedure that was being called was allocating too much memory in one of the critical backend SQL databases. After an extended period of time, this caused the SQL databases to fall into an unresponsive state and resulted in customers being unable to access their VSTS accounts.
The fix took some time to implement because Microsoft's first effort was a failure. "Engineers attempted to failover the SQL database which allowed for temporary mitigation, however the same procedure was quickly allocating memory to the newly assigned databases, which in turn became unresponsive," the post reports.
The resolution that worked was to "manually assign allocation limits for the procedure that was being called."
As is often the case, the official explanation raises as many questions as it does answers. What changed, to cause a stored procedure suddenly to consume so much memory that it hangs SQL Server? Why cannot SQL Server itself manage and prevent the condition? Why is failover not automatic if a service-critical database becomes unresponsive?
According to the product description, Visual Studio Team Services is "built on the enterprise-grade infrastructure of Microsoft Azure, and backed by a 99.9% SLA".®