KCL out(r)age continues: Two weeks TITSUP, two weeks to go
Now IT bricks it: A 'series of extremely unlikely events'
Exclusive We are two weeks into the outage issues at King's College London, and a communiqué from IT has warned staff that those issues won't be completely resolved for at least a fortnight more.
As of this morning, KCL's internal website and its software distribution system are still down, while library services and (vitally) payroll were back online before the weekend for staff and students. Tweets from @KingsITsystems continue to celebrate each time another system returns to the network.
While a spokesperson from the university confirmed to The Register that its staff did not know when the outage, which appears to be the most significant to ever affect an educational institute, would be resolved, we can reveal how it happened and left the university recovering its systems from incremental backups for some disks.
The issues at the university began a fortnight ago, when a routine system upgrade failed, causing corruption of both disks in one-fault-resilient RAID array.
Since then, the university has been working with HP to try to recover the lost data – with "some success, but not complete success" according to an email sent to staff by Nick Leake, the university's CIO, which was seen by The Register.
Consequently, we have taken the decision to revert to the last backup – which was taken on Sunday night. Due to the way that the VS3 dynamically allocates disk space, the impact of the corruption is widespread – the VS3 spreads data around to maximise performance.
We are now in the process of recovering the backed up files, undertaking repairs to the VS3 storage drives, bringing onboard additional storage capacity and considering how we can utilise the facilities in Slough to alleviate workload in the Strand.
A second email admitted that it had been "a very difficult couple of weeks for IT with a major system outage and some data loss due to a series of extremely unlikely events, most of them completely out of our hands. This already has had, and will continue to have, a big impact on the university."
Leake continued to thank staff for their reactions to the issue and the speed at which they have worked together to get systems back online. He signed off with a sure contender for Quote of the Week:
There will be lessons to be learnt from this incident and it would be helpful if you could note those down as they occur to you.
The greatest lesson to be learned will surely be about failure tolerance, responsibility for which must fall to an executive manager.
Something to add to the story? Get in touch with the author here. ®
TITSUP, in this case, stands for Total Inability To Support Usual Pedagogy.
We used the phrase "RAID Array" as it is how the system was described in the KCL email, but The Reg office is perturbed about whether it is redundant to talk of a Redundant Array of Inexpensive/Independent Disks Array.
And then we were wondering whether it was redundant to talk about redundancy in a redundancy system. It's just as much a Monday for you as it is for us, but if you could weigh in on the matter in the comments section that would be very kind.
Updated to add
An email sent around this morning by Professor Edward Byrne AC, KCL's president and principal, reads:
We will be undertaking a full review into the root causes of the outage and will make the necessary changes and investments to mitigate against this happening again. "I would like to acknowledge the huge volume of work done by our IT staff, who have worked 24/7 to get our systems back up and running, and by all of you as you endeavour to minimise the impact on our students. I appreciate this has resulted in considerably more work for you as you strive to continue your day job.