KCL external review blames whole IT team for mega-outage, leaves managers unshamed
And the Oscar for the best story about why you have recovery tests goes to...
Exclusive An external review into last October's catastrophic data loss at King's College London has placed the blame squarely at the feet of the IT technical team, which it found neither understood nor followed the university's system for backing up data.
The probe found that "IT did a pretty good job of managing the actual incident" but instead of naming individuals whose decisions had led to the catastrophe, it emphasised a series of failures of process, the responsibility for which fell to the department in general.
On October 17, 2016, a hardware failure caused an HP 3PAR storage system, which was propping up the entirety of the UK university's IT estate, to implode, taking out everything from payroll to shared drive access. As the smoke began to clear over the subsequent fortnight, departments across the university found themselves facing "irretrievable data loss" including archived research material as well as original data which had been funded with public money.
An assessment of the incident by HP found that its storage system wasn't able to get working again after the borked hardware component was replaced because of a firmware flaw, which had been updated weeks beforehand. The review stated that IT "had not had the opportunity to apply this routine firmware update before the incident". HP claimed this would have prevented the entire service outage.
In response, KCL brought in PA Consulting (themselves no strangers to data loss) to conduct an external review "to understand what went wrong with the infrastructure technology and how it was managed".
The resulting 23-page report, titled the IT Infrastructure Resiliency Review, was published today. Although only available internally at the university, The Register has seen multiple copies with identical hash sums and is making a copy available here [PDF] for those interested in the enormous disaster affecting the public institution.
Among the most anticipated details of the review were whether it would name managers who had been responsible for poor decisions that caused the data loss. These were not included.
Instead, the review has stressed failures in process, explicitly stating that "the cause of the backup failure was due to the IT technical team not fully understanding the importance of the tape backups within the overall backup system and not following the backup procedures completely".
"In addition some data has consciously never been backed up on tape due to capacity constraints and the potential impact of this was never communicated to the College," the review added.
A prolonged migration of the data centre in the cellars of KCL's Strand campus to a new facility at Slough had left the university's backup architecture in a "transitional state for a number of years" with systems being migrated on to a 3PAR which, over time, had been left unconfigured to back up this new data.
Tape backups failed regularly and some folders were not backed up properly for several months. Daily Service Review updates on the tape backup status were being incorrectly reported as successful when they contained repeated failures to back up some shared drives.
All of this was "compounded by the team not comprehending the business criticality of the data being backed up to tape" and as such "these issues were not escalated".
A KCL spokesperson told The Register: "King's commissioned an independent review to establish the cause of the IT outage in October 2016 and to assess the future resilience of the university's IT infrastructure. The review is now complete and has been presented to the university management team.
"We recognise that there were failings and we sincerely apologise to those who were affected. King's accepts all the recommendations in the report and has already implemented a number of them. We have shared the full report internally and will present an in-depth management response, action plan and timeline for implementing improvements to the Council of the university at the end of March." ®
Get in touch
If you work at KCL and have been affected by the October incident, we welcome you to contact The Register here, which would be best done using a non-university email return address, to discuss the findings of the external review.
Sponsored: What next after Netezza?