Feeds

You don't know disk about storage failures

Research shows platters a victim of spin

The essential guide to IT transformation

Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs.

But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the whole story. Other components in a storage subsystem are often the point of failure, although their failings are still treated as disk faults. This results in unnecessary disk replacements — and inevitably an incomplete perspective on storage system resiliency.

The study, titled "Are Disk the Dominant Contributor for Storage Failures?" was authored by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou of the University of Illinois department of computer science and Arkady Kanevsky of Network Appliance.

Over a period of 44 months, the group analyzed storage logs of about 39,000 commercially deployed storage systems. They estimate the systems in total were composed of about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. The researchers examined near-line (backup) disks, low-end, mid-range, and high-end hardware.

While the findings do show disk failures contribute to 20-55 per cent of storage subsystem failures, other components such as physical interconnects (broken wires, shelf enclosure power outages, HBA failures, etc) and protocol stacks (software bugs and compatibility issues) also account for a significant percentage of problems.

The group states that recent studies on storage failures have failed due to an excessive focus on disk malfunctions. For example, in June, Google released a paper that disputed the reliability claims of disk manufacturers from a user perspective. A good start, write the researchers.

"But as this study indicates, there are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements," their paper claims.

The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects. Between 5-10 per cent are a result of protocol stack errors. Due to component failures, even slower, more reliable disks like near-line backup have higher failure rates.

"These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g. RAID) is not enough," the study states. "We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components."

As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR.

Near-line storage disks (mostly SATA) show a 1.9 per cent AFR, but again the whole storage subsystem failure is higher, at 3.4 per cent.

So, on their own, low-end disks fail less often than higher-end SATA disks, but total SATA systems fail less often than lower-end systems.

The researchers argue this indicates that "disk failure rate is not indicative of the storage subsystem failure rate," meaning there's other factors for failures such as shelf enclosure model and network configurations that strongly affect reliability.

The research team concludes that storage subsystem components cannot be ignored when designing a reliable storage box. They offer some suggestions to improve reliability.

Redundancy mechanisms such as mulitpathing were able to reduce AFR for storage systems by 30-40 per cent when paths were increased from one to two.

The researchers also recommend spanning a RAID group across multiple shelves — and using fewer disks per shelf, with more shelves in the system. This helps reduce the chances of a shelf failure taking out an entire RAID group.

The full paper is available here at the Usenix website. ®

Boost IT visibility and business value

More from The Register

next story
Pay to play: The hidden cost of software defined everything
Enter credit card details if you want that system you bought to actually be useful
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
HP busts out new ProLiant Gen9 servers
Think those are cool? Wait till you get a load of our racks
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
VMware's high-wire balancing act: EVO might drag us ALL down
Get it right, EMC, or there'll be STORAGE CIVIL WAR. Mark my words
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story

Whitepapers

Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up distributed data
Eliminating the redundant use of bandwidth and storage capacity and application consolidation in the modern data center.
The essential guide to IT transformation
ServiceNow discusses three IT transformations that can help CIOs automate IT services to transform IT and the enterprise
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.