Feeds

Diary of a server failure

Uh oh. There go my RAID arrays

  • alert
  • submit to reddit

Choosing a cloud hosting partner with confidence

Sysadmin blog I recently experienced a major server failure. This article is my post-mortem.

First, the environment in question:

I have various ESXi servers using 300GB Velociraptors in RAID 5 as local storage. One server that hosted 27 user VMs had two disks in a six-disk RAID 5 fail simultaneously. When I rebooted the server the RAID card (LSI 1078) decided there were in fact two (broken) six disk arrays.

Uh oh.

The LSI interface revealed no method by which I might convince the card that these two arrays were in fact one perfectly healthy set. As luck would have it, one of my prototype systems had an Intel RS2BL080 RAID card. (Fantastic piece of kit, by the way.) This controller is based on the newer LSI 2108 chip. I fed it the damaged array and prayed to $deity that it would understand the metadata on the disks.

It did. Better still, it picked all six drives up as members of the same array. The $deity object was however instantiated with the “fickle” property; no matter what I changed using the Intel controller, the original 1078 still picked the drives up as two arrays. The Intel controller resided in a Windows system. For added fun, ESXi absolutely refused to install itself onto the system with the Intel card. Though I have had some success with the excellent open source VMFS driver, I wasn’t quite ready to put my faith in it for critical information. How then to get at the information on that array?

I cobbled together a third system which would accept an ESXi install. I placed the Intel controller into this third system. It worked beautifully; I was able to get access to the array and the VMs stored within and avoid three long days of restoring from backups.

So what went wrong, and how could I have avoided this?

Two days prior to this incident, I had begun experiencing a spate of drive “failures” in my servers. Drives would drop out of the array for no reason but pass both vendor diagnostics and DBAN. It had been roughly 50 days since I last power cycled the servers: it was the Velociraptor TLER error that supposedly doesn’t exist. My assumption that I had gotten around to updating all of my Velociraptors proved a nearly fatal mistake.

RAID 5 was a bad choice. Rebuild times are too high. Most arrays contain members of roughly the same age and drive generation; chances of a second drive failing during a rebuild are high. While the described issue did not experience this particular failure, I’ve recently had two near misses due to RAID 5 rebuild times.

Local storage for my ESXi servers is also an issue. I did not buy my servers all in one go; capacity evolved organically to meet demand. With no SAN, pulling a terabyte worth of VMs off of server A then uploading them to server B takes over a day. ESXi 4.0 is speed limited when using the vSphere client to move files. (Our servers are not supported by 4.1)

For all the precautions I might have taken to prevent the error in the first place, the most damning failure was one of not having a recovery solution in place for this class of error. I had considered the possibility, but did not have more than the most basic plans in place to deal with it. Take home lesson? If your system can break in a given way, take the time to research exactly how you’ll deal with it when it does. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
EMC, HP blockbuster 'merger' shocker comes a cropper
Stand down, FTC... you can put your feet up for a bit
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.