Feeds

You don't know disk about storage failures

Research shows platters a victim of spin

Boost IT visibility and business value

Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs.

But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the whole story. Other components in a storage subsystem are often the point of failure, although their failings are still treated as disk faults. This results in unnecessary disk replacements — and inevitably an incomplete perspective on storage system resiliency.

The study, titled "Are Disk the Dominant Contributor for Storage Failures?" was authored by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou of the University of Illinois department of computer science and Arkady Kanevsky of Network Appliance.

Over a period of 44 months, the group analyzed storage logs of about 39,000 commercially deployed storage systems. They estimate the systems in total were composed of about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. The researchers examined near-line (backup) disks, low-end, mid-range, and high-end hardware.

While the findings do show disk failures contribute to 20-55 per cent of storage subsystem failures, other components such as physical interconnects (broken wires, shelf enclosure power outages, HBA failures, etc) and protocol stacks (software bugs and compatibility issues) also account for a significant percentage of problems.

The group states that recent studies on storage failures have failed due to an excessive focus on disk malfunctions. For example, in June, Google released a paper that disputed the reliability claims of disk manufacturers from a user perspective. A good start, write the researchers.

"But as this study indicates, there are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements," their paper claims.

The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects. Between 5-10 per cent are a result of protocol stack errors. Due to component failures, even slower, more reliable disks like near-line backup have higher failure rates.

"These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g. RAID) is not enough," the study states. "We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components."

As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR.

Near-line storage disks (mostly SATA) show a 1.9 per cent AFR, but again the whole storage subsystem failure is higher, at 3.4 per cent.

So, on their own, low-end disks fail less often than higher-end SATA disks, but total SATA systems fail less often than lower-end systems.

The researchers argue this indicates that "disk failure rate is not indicative of the storage subsystem failure rate," meaning there's other factors for failures such as shelf enclosure model and network configurations that strongly affect reliability.

The research team concludes that storage subsystem components cannot be ignored when designing a reliable storage box. They offer some suggestions to improve reliability.

Redundancy mechanisms such as mulitpathing were able to reduce AFR for storage systems by 30-40 per cent when paths were increased from one to two.

The researchers also recommend spanning a RAID group across multiple shelves — and using fewer disks per shelf, with more shelves in the system. This helps reduce the chances of a shelf failure taking out an entire RAID group.

The full paper is available here at the Usenix website. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.