Titsup EMC VNX kit unleashes 5 days of chaos in Sweden
Crashed Tieto system took down bank, pharmacy, schools' website
Tieto, a prominent Swedish IT service supplier, had an EMC Array go titsup on 25 November, causing five days of chaos at the Motor Vehicle Inspectorate, the Sollentuna and Nacka municipalities, the City of Stockholm's schools' website and intranet, the National Board of Health and other prominent sites.
The debacle (in Swedish) started when a VNX array at Tieto apparently went bonkers and crashed. The cache failed, the original data on disk was corrupted and the copy of the data was also corrupt.
According to the Computer Sweden (CS) media outlet (Google Translate) a Legato Networker backup of virtual machine data to tape was involved and the backup data could not be read. There is talk of the EMC (Legato) Networker client software not being compatible with Windows 2008 R2 which Tieto was using. Tieto could not read Networker tape backups on virtual Windows servers.
Swedish media reports that an upgrade at the VNX installation, to an NS480 partly for its caching, failed. This is odd as the NS480 is a Celerra product, a precursor to the VNX line which unified Clariion block and Celerra file storage in one product, albeit with two separate component operating systems. On the face of it adding an NS480 to a VNX would be a downgrade and not an upgrade.
Angry Swedes couldn't register car ownership, or have the equivalent of MOTs (vehicle inspection) carried out, medical prescriptions couldn't be processed, and many other IT services using Tieto facilities were paralysed. These included the Central Student Grants Committee, Malmo University and National Board of Health.
EMC's local headman, Robert Ekström, wouldn't talk about it to the press. Neither would Tieto's VP for IT operations, Michael Jupiter.
A CS report quoted Bo Andersson, the CIO of SBAB bank which was heavily affected, despite having a 99.8 per cent uptime agreement with Tieto: "You have to understand the magnitude of what happened. An hour's interruption is very serious, and after four hours is an emergency plan set. One hundred hours is so far beyond anything we've ever been through, I find no words ... We are deeply shocked"
An EMC spokesperson said: "We cannot comment on the specifics of any of our customers, but we are happy to share with you Tieto’s own statement which was posted to their Swedish website in December." Here is an edited version of that statement:
On Friday 25 November, an EMC storage system in one of Tieto’s data centres in Sweden experienced a rare combination of component errors. EMC responded immediately, and within 48 hours the initial storage related issue was resolved. However, the incident triggered a sequence of events requiring a complex and time consuming recovery process affecting approximately 50 of Tieto’s customers.
... Due to the complexity of the recovery process the full system recovery for some of our customers took longer than originally anticipated. Furthermore services critical to the community have been prioritized throughout the recovery process. Of the total amount of affected services, all main services have already been recovered and brought completely back online.
... Tieto and EMC are jointly conducting a technical analysis to unveil the root causes of this incident. The technical analysis will be part of a more comprehensive investigation led by an external investigator and carried out in collaboration with Swedish Civil Contingencies Agency among others. We deeply regret the inconveniences this event has caused for our customers and our customers’ customers.
An EMC array crash causing five days of disruption and data loss doesn't say much for product reliability or Tieto's business continuance arrangements, even if it is a rare occurrence. Two days before the incident EMC published a document on disaster recovery – a tad ironic with hindsight.
No doubt the root cause will be found and shared with Tieto's customers in contract negotiations under conditions of secrecy, and to satisfy them that it won't happen again. It's doubtful if the world at large will find out what happened though. ®
Waiting on more details.
There's two old sayings that come to mind - "a workman is only as good as his tools", and "you need a computer to make a mess, but you need a beancounter to make a real disaster!" At the moment, it looks like someone at EMC or their supplying partner made a mistake on the tools selected for the job or how they were implemented/upgraded, but it also looks like someone at the customer end hadn't really thought through their DR process. "What happens if I lose my storage device and can't recover from my primary backups" should be a pretty standard question when designing a proper, belts-and-braces system with DR failover. Do I detect the foul odour of beancounters cutting costs.....?
Rise of the virtual machines
means you can fail lots of customers at once.
@Double disk failure
The failure of more than one item is far, far more likely than the usual calculations for redundancy assume!
That is based on "independent random failures" and not allowing for an external influence (PSU surge, over-temp, etc) stressing multiple items, or a bad batch of some component causing much higher failure rates as well. Also the strain of a RAID rebuild on traditional HDD's head servo can provoke them to croak before redundancy is restored, though that should not matter for flash.
On a personal note, we have a Sun/Oracle 'open storage' system configured with dual redundancy and tested it for our acceptance by pulling 2 HDD from a RAID set, and it failed. More than once, and in one case trashing some files (at least ZFS told us which ones!). Years and several firmware revisions later Oracle has not attempted to find the actual cause, and assures us that because they have not replicated it recently that somehow it must be fixed by other code revisions.
Can someone remind me of who said "if you don't have 3 copies of your data, you don't really have your data"?
Not Even Allowed to Test
Several years ago I was involved with the introduction of a then well known maker's hardware and worked on the acceptance test schedules. These were taken from the documents used for other suppliers.
The maker in question ran from the negotiations straight to his board level contacts claiming that we 'just wanted to break his equipment'. Our point was that others saw no problem and passed the test in question without issue.
The maker in question is no longer in business, their competitors are still trading, is there a conclusion lurking there?
Backups are useless
If you don't test restores, regularly.