The Register® — Biting the hand that feeds IT

Feeds

Titsup EMC VNX kit unleashes 5 days of chaos in Sweden

Crashed Tieto system took down bank, pharmacy, schools' website

Ensure Ease of Recovery with Asigra’s Agentless Software

Tieto, a prominent Swedish IT service supplier, had an EMC Array go titsup on 25 November, causing five days of chaos at the Motor Vehicle Inspectorate, the Sollentuna and Nacka municipalities, the City of Stockholm's schools' website and intranet, the National Board of Health and other prominent sites.

The debacle (in Swedish) started when a VNX array at Tieto apparently went bonkers and crashed. The cache failed, the original data on disk was corrupted and the copy of the data was also corrupt.

According to the Computer Sweden (CS) media outlet (Google Translate) a Legato Networker backup of virtual machine data to tape was involved and the backup data could not be read. There is talk of the EMC (Legato) Networker client software not being compatible with Windows 2008 R2 which Tieto was using. Tieto could not read Networker tape backups on virtual Windows servers.

Swedish media reports that an upgrade at the VNX installation, to an NS480 partly for its caching, failed. This is odd as the NS480 is a Celerra product, a precursor to the VNX line which unified Clariion block and Celerra file storage in one product, albeit with two separate component operating systems. On the face of it adding an NS480 to a VNX would be a downgrade and not an upgrade.

Angry Swedes couldn't register car ownership, or have the equivalent of MOTs (vehicle inspection) carried out, medical prescriptions couldn't be processed, and many other IT services using Tieto facilities were paralysed. These included the Central Student Grants Committee, Malmo University and National Board of Health.

EMC's local headman, Robert Ekström, wouldn't talk about it to the press. Neither would Tieto's VP for IT operations, Michael Jupiter.

A CS report quoted Bo Andersson, the CIO of SBAB bank which was heavily affected, despite having a 99.8 per cent uptime agreement with Tieto: "You have to understand the magnitude of what happened. An hour's interruption is very serious, and after four hours is an emergency plan set. One hundred hours is so far beyond anything we've ever been through, I find no words ... We are deeply shocked"

An EMC spokesperson said: "We cannot comment on the specifics of any of our customers, but we are happy to share with you Tieto’s own statement which was posted to their Swedish website in December." Here is an edited version of that statement:

On Friday 25 November, an EMC storage system in one of Tieto’s data centres in Sweden experienced a rare combination of component errors. EMC responded immediately, and within 48 hours the initial storage related issue was resolved. However, the incident triggered a sequence of events requiring a complex and time consuming recovery process affecting approximately 50 of Tieto’s customers.

... Due to the complexity of the recovery process the full system recovery for some of our customers took longer than originally anticipated. Furthermore services critical to the community have been prioritized throughout the recovery process. Of the total amount of affected services, all main services have already been recovered and brought completely back online.

... Tieto and EMC are jointly conducting a technical analysis to unveil the root causes of this incident. The technical analysis will be part of a more comprehensive investigation led by an external investigator and carried out in collaboration with Swedish Civil Contingencies Agency among others. We deeply regret the inconveniences this event has caused for our customers and our customers’ customers.

An EMC array crash causing five days of disruption and data loss doesn't say much for product reliability or Tieto's business continuance arrangements, even if it is a rare occurrence. Two days before the incident EMC published a document on disaster recovery – a tad ironic with hindsight.

No doubt the root cause will be found and shared with Tieto's customers in contract negotiations under conditions of secrecy, and to satisfy them that it won't happen again. It's doubtful if the world at large will find out what happened though. ®

What you need to know about cloud backup

Waiting on more details.

There's two old sayings that come to mind - "a workman is only as good as his tools", and "you need a computer to make a mess, but you need a beancounter to make a real disaster!" At the moment, it looks like someone at EMC or their supplying partner made a mistake on the tools selected for the job or how they were implemented/upgraded, but it also looks like someone at the customer end hadn't really thought through their DR process. "What happens if I lose my storage device and can't recover from my primary backups" should be a pretty standard question when designing a proper, belts-and-braces system with DR failover. Do I detect the foul odour of beancounters cutting costs.....?

12
1

Rise of the virtual machines

means you can fail lots of customers at once.

9
0
Anonymous Coward

@Double disk failure

The failure of more than one item is far, far more likely than the usual calculations for redundancy assume!

That is based on "independent random failures" and not allowing for an external influence (PSU surge, over-temp, etc) stressing multiple items, or a bad batch of some component causing much higher failure rates as well. Also the strain of a RAID rebuild on traditional HDD's head servo can provoke them to croak before redundancy is restored, though that should not matter for flash.

On a personal note, we have a Sun/Oracle 'open storage' system configured with dual redundancy and tested it for our acceptance by pulling 2 HDD from a RAID set, and it failed. More than once, and in one case trashing some files (at least ZFS told us which ones!). Years and several firmware revisions later Oracle has not attempted to find the actual cause, and assures us that because they have not replicated it recently that somehow it must be fixed by other code revisions.

Can someone remind me of who said "if you don't have 3 copies of your data, you don't really have your data"?

5
0

More from The Register

Microsoft to open Windows Stores inside 600 Best Buy locations
Product showcases 'must be seen to be believed'
Author Iain (M) Banks falls to cancer at 59
Misses the release of his final work
 breaking news
What did the Lehman Brothers implosion look like to a techie?
Insider tells all about the Gnab Gib at Lehmans
It's official: 'tweet' an English word – not just in the avian sense
If the Oxford English Dictionary says it is so, then it is so
 breaking news
The only Waze is Google: Ad giant tipped to gobble map app 'for $1.3bn'
Pac-Man-satnav-ish upstart in bidding war with Apple, Facebook
 breaking news
1-in-10 e-tomes 'are self-published'... most are 'rubbish' says book ed
Publishing man scoffs at go-it-alone writers, ursines still fouling in forests
 breaking news
Facebook RSS reader said to uncloak June 20
Secret event scooped by Scottish developer?