Feeds

IT shops rank servers on downtime

IBM Power: good. Windows: not so much

  • alert
  • submit to reddit

Choosing a cloud hosting partner with confidence

Server vendors make a lot of noise about how reliable their systems are, but how do they really stack up?

It's hard to say. Getting qualitative information out of vendors is easy enough - they all seem to have the most reliable machines ever built - but what about some objective quantitative information that puts these claims to the test? This kind of data is hard to come by, but it does exist.

Laura DiDio, a server analyst who used to be at Yankee Group until she left to start up her own gig over at Information Technology Intelligence Corp, used to do surveys of CIOs around the globe asking them about the amount of downtime their various server platforms experience in a year. DiDio is still doing that work today and is happy to give EL Reg some insight into what she leaned from customers in ITIC's server hardware and operating system reliability survey.

The study that ITIC has put together is based on a survey of more than 400 C-level executives at companies in a representative distribution of industries and platforms located in 20 countries worldwide. The study looked not only at how many different kinds of outages were reported on the platforms running at these sites, but also the length of time the outages took and the experience level of the system administrators at the sites. Not surprisingly, the sites that have the most sophisticated platforms with the most seasoned system administrators and the most rugged platforms have the least amount of downtime.

The ITIC server reliability study puts server outages (be they caused by either a hardware or a software issue) in one of three buckets. Tier 1 outages are the "stupid stuff," says DiDio, such as someone accidentally powering down a box, which are quickly fixed. Tier 2 outages are trickier and result in system downtime of between 30 minutes and four hours. A crashed application, or getting permissions bollixed up, or some kind of patch gone awry, are the kinds of causes of Tier 2 outages.

These usually require more than one system administrator to figure out and often require at least one administrator to be on site to physically deal with the box. Tier 3 outages are the worst, and most rare among enterprise-class servers. These outages can span more than one box in an n-tier application and database setup, and they take more than four hours to resolve. They also can result in lost data and usually cause irritation to end users who can't get into their applications.

Among the customers surveyed by ITIC, IBM's Power Systems running AIX experienced (this includes older System p and pSeries iron) the least amount of downtime per year, when averaged across all customers using these platforms. AIX shops reported an average of 0.42 Tier 1 incidents per year and 0.34 Tier 2 incidents, and not one customer reported a Tier 3 outage on their AIX boxes. The Power Systems machines (and this includes older System i and iSeries iron) had an average of 0.56 Tier 1 outages per year, 0.44 Tier 2 outages per year, and 0.12 Tier 3 outages. So in 2009 at least, the i platform fared a little worse than the AIX platform running on Power iron.

The numbers for the i platform were pretty similar to the numbers reported to ITIC by shops running HP-UX on PA-RISC or Itanium iron or running Solaris on Sparc iron. HP-UX shops deploying HP-UX 11i v3 on older PA-RISC iron reported an average of 0.60 Tier 1 outages per year, followed by 0.43 Tier 2 outages and 0.10 Tier 3 outages. With HP-UX on Itanium, the numbers were a little higher, with an average of 0.65 Tier 1 outages, 0.48 Tier 2 outages, and 0.14 Tier 3 outages. On Sparc boxes running Solaris, customers reported an average of 0.59 Tier 1 outages per year, 0.49 Tier 2 outages, and 01.10 Tier 3 outages.

When you do the math on the outages tracked by ITIC, the average Power Systems-AIX box had less than 15 minutes of unplanned downtime per year, half of what it was last year. HP-UX boxes (averaged across PA-RISC and Itanium machines) averaged just 36 minutes of unplanned downtime on PA-RISC iron and 39 minutes on Itanium iron. Solaris boxes were in the same ballpark, with 35.4 minutes of downtime, but the aging of Sparc iron (caused in part by just concerns among customers about Sun's future when it went onto the financial rocks in early 2008) is pushing up the downtime numbers a little bit here in 2009, according to ITIC's survey results.

Interestingly, servers running Mac OS at the shops polled by ITIC had 37.4 minutes of downtime per year.

Microsoft's Windows Server 2003 and Windows Server 2008 platforms did not fare as well, but they are improving. In 2008, ITIC's survey respondents reported an average of 3.77 hours of unplanned downtime per year for their Windows boxes, but this has shrunk by 35 per cent in 2009 to 2.42 hours of downtime. While Windows servers have more downtime, the percentage of server incidents that make it to the Tier 2 or Tier 3 level are not appreciably higher, with only 29 per cent of total outages being caused by these higher level outages this year.

Those using IBM's Power-AIX machines reported that 19 per cent of their incidents rose to the Tier 2 or Tier 3 level, and Power-i shops reported a similar 21 per cent of incidents at that level. Solaris shops said that 25 per cent of their outages came in at these higher levels.

The other interesting thing that DiDio tracked in the server reliability study is the experience level of the system administrators and the time it takes to patch a server. The average system admin in a Unix shop has 12.7 years of experience (and the average is 11 years for AS.400-i shops), which compares favorably with the 7 years of experience for Windows admins, four years for Linux admins, and three years for Mac OS server admins.

"The experience level for Unix and AS/400 administrators is equivalent to having a master craftsman build something for you or a Grade-A mechanic fixing you car," says DiDio.

She added that commercial Linuxes have improved greatly in terms of documentation and that this is being reflected in the average time it takes a system administrator to patch a server. Linux shops reported that it took them anywhere from 15 to 19 minutes to patch a server, with variation depending on the Linux. (Ubuntu shows the greatest improvement among the Linuxes this year). The Power-based servers took around 11 minutes to patch, on average, whether they were running AIX or i, while Solaris machines took 31 minutes and HP-UX boxes took 33 minutes. Windows Server 2003 machines took an average of 32 minutes to patch, according to an average of survey respondents, and Windows Server 2008 machines took 38 minutes to patch.

"The lesson to learn from all of this," says DiDio, "is that companies should not skimp on training and certification. That's penny wise, but pound foolish."

You can find out more about the ITIC server reliability survey here. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
EMC, HP blockbuster 'merger' shocker comes a cropper
Stand down, FTC... you can put your feet up for a bit
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.