Feeds

IT shops rank servers on downtime

IBM Power: good. Windows: not so much

  • alert
  • submit to reddit

Top 5 reasons to deploy VMware with Tegile

Server vendors make a lot of noise about how reliable their systems are, but how do they really stack up?

It's hard to say. Getting qualitative information out of vendors is easy enough - they all seem to have the most reliable machines ever built - but what about some objective quantitative information that puts these claims to the test? This kind of data is hard to come by, but it does exist.

Laura DiDio, a server analyst who used to be at Yankee Group until she left to start up her own gig over at Information Technology Intelligence Corp, used to do surveys of CIOs around the globe asking them about the amount of downtime their various server platforms experience in a year. DiDio is still doing that work today and is happy to give EL Reg some insight into what she leaned from customers in ITIC's server hardware and operating system reliability survey.

The study that ITIC has put together is based on a survey of more than 400 C-level executives at companies in a representative distribution of industries and platforms located in 20 countries worldwide. The study looked not only at how many different kinds of outages were reported on the platforms running at these sites, but also the length of time the outages took and the experience level of the system administrators at the sites. Not surprisingly, the sites that have the most sophisticated platforms with the most seasoned system administrators and the most rugged platforms have the least amount of downtime.

The ITIC server reliability study puts server outages (be they caused by either a hardware or a software issue) in one of three buckets. Tier 1 outages are the "stupid stuff," says DiDio, such as someone accidentally powering down a box, which are quickly fixed. Tier 2 outages are trickier and result in system downtime of between 30 minutes and four hours. A crashed application, or getting permissions bollixed up, or some kind of patch gone awry, are the kinds of causes of Tier 2 outages.

These usually require more than one system administrator to figure out and often require at least one administrator to be on site to physically deal with the box. Tier 3 outages are the worst, and most rare among enterprise-class servers. These outages can span more than one box in an n-tier application and database setup, and they take more than four hours to resolve. They also can result in lost data and usually cause irritation to end users who can't get into their applications.

Among the customers surveyed by ITIC, IBM's Power Systems running AIX experienced (this includes older System p and pSeries iron) the least amount of downtime per year, when averaged across all customers using these platforms. AIX shops reported an average of 0.42 Tier 1 incidents per year and 0.34 Tier 2 incidents, and not one customer reported a Tier 3 outage on their AIX boxes. The Power Systems machines (and this includes older System i and iSeries iron) had an average of 0.56 Tier 1 outages per year, 0.44 Tier 2 outages per year, and 0.12 Tier 3 outages. So in 2009 at least, the i platform fared a little worse than the AIX platform running on Power iron.

The numbers for the i platform were pretty similar to the numbers reported to ITIC by shops running HP-UX on PA-RISC or Itanium iron or running Solaris on Sparc iron. HP-UX shops deploying HP-UX 11i v3 on older PA-RISC iron reported an average of 0.60 Tier 1 outages per year, followed by 0.43 Tier 2 outages and 0.10 Tier 3 outages. With HP-UX on Itanium, the numbers were a little higher, with an average of 0.65 Tier 1 outages, 0.48 Tier 2 outages, and 0.14 Tier 3 outages. On Sparc boxes running Solaris, customers reported an average of 0.59 Tier 1 outages per year, 0.49 Tier 2 outages, and 01.10 Tier 3 outages.

When you do the math on the outages tracked by ITIC, the average Power Systems-AIX box had less than 15 minutes of unplanned downtime per year, half of what it was last year. HP-UX boxes (averaged across PA-RISC and Itanium machines) averaged just 36 minutes of unplanned downtime on PA-RISC iron and 39 minutes on Itanium iron. Solaris boxes were in the same ballpark, with 35.4 minutes of downtime, but the aging of Sparc iron (caused in part by just concerns among customers about Sun's future when it went onto the financial rocks in early 2008) is pushing up the downtime numbers a little bit here in 2009, according to ITIC's survey results.

Interestingly, servers running Mac OS at the shops polled by ITIC had 37.4 minutes of downtime per year.

Microsoft's Windows Server 2003 and Windows Server 2008 platforms did not fare as well, but they are improving. In 2008, ITIC's survey respondents reported an average of 3.77 hours of unplanned downtime per year for their Windows boxes, but this has shrunk by 35 per cent in 2009 to 2.42 hours of downtime. While Windows servers have more downtime, the percentage of server incidents that make it to the Tier 2 or Tier 3 level are not appreciably higher, with only 29 per cent of total outages being caused by these higher level outages this year.

Those using IBM's Power-AIX machines reported that 19 per cent of their incidents rose to the Tier 2 or Tier 3 level, and Power-i shops reported a similar 21 per cent of incidents at that level. Solaris shops said that 25 per cent of their outages came in at these higher levels.

The other interesting thing that DiDio tracked in the server reliability study is the experience level of the system administrators and the time it takes to patch a server. The average system admin in a Unix shop has 12.7 years of experience (and the average is 11 years for AS.400-i shops), which compares favorably with the 7 years of experience for Windows admins, four years for Linux admins, and three years for Mac OS server admins.

"The experience level for Unix and AS/400 administrators is equivalent to having a master craftsman build something for you or a Grade-A mechanic fixing you car," says DiDio.

She added that commercial Linuxes have improved greatly in terms of documentation and that this is being reflected in the average time it takes a system administrator to patch a server. Linux shops reported that it took them anywhere from 15 to 19 minutes to patch a server, with variation depending on the Linux. (Ubuntu shows the greatest improvement among the Linuxes this year). The Power-based servers took around 11 minutes to patch, on average, whether they were running AIX or i, while Solaris machines took 31 minutes and HP-UX boxes took 33 minutes. Windows Server 2003 machines took an average of 32 minutes to patch, according to an average of survey respondents, and Windows Server 2008 machines took 38 minutes to patch.

"The lesson to learn from all of this," says DiDio, "is that companies should not skimp on training and certification. That's penny wise, but pound foolish."

You can find out more about the ITIC server reliability survey here. ®

Beginner's guide to SSL certificates

More from The Register

next story
Ellison: Sparc M7 is Oracle's most important silicon EVER
'Acceleration engines' key to performance, security, Larry says
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Lenovo to finish $2.1bn IBM x86 server gobble in October
A lighter snack than expected – but what's a few $100m between friends, eh?
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Hey, what's a STORAGE company doing working on Internet-of-Cars?
Boo - it's not a terabyte car, it's just predictive maintenance and that
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.