Feeds

You don't know disk about storage failures

Research shows platters a victim of spin

Beginner's guide to SSL certificates

Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs.

But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the whole story. Other components in a storage subsystem are often the point of failure, although their failings are still treated as disk faults. This results in unnecessary disk replacements — and inevitably an incomplete perspective on storage system resiliency.

The study, titled "Are Disk the Dominant Contributor for Storage Failures?" was authored by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou of the University of Illinois department of computer science and Arkady Kanevsky of Network Appliance.

Over a period of 44 months, the group analyzed storage logs of about 39,000 commercially deployed storage systems. They estimate the systems in total were composed of about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. The researchers examined near-line (backup) disks, low-end, mid-range, and high-end hardware.

While the findings do show disk failures contribute to 20-55 per cent of storage subsystem failures, other components such as physical interconnects (broken wires, shelf enclosure power outages, HBA failures, etc) and protocol stacks (software bugs and compatibility issues) also account for a significant percentage of problems.

The group states that recent studies on storage failures have failed due to an excessive focus on disk malfunctions. For example, in June, Google released a paper that disputed the reliability claims of disk manufacturers from a user perspective. A good start, write the researchers.

"But as this study indicates, there are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements," their paper claims.

The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects. Between 5-10 per cent are a result of protocol stack errors. Due to component failures, even slower, more reliable disks like near-line backup have higher failure rates.

"These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g. RAID) is not enough," the study states. "We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components."

As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR.

Near-line storage disks (mostly SATA) show a 1.9 per cent AFR, but again the whole storage subsystem failure is higher, at 3.4 per cent.

So, on their own, low-end disks fail less often than higher-end SATA disks, but total SATA systems fail less often than lower-end systems.

The researchers argue this indicates that "disk failure rate is not indicative of the storage subsystem failure rate," meaning there's other factors for failures such as shelf enclosure model and network configurations that strongly affect reliability.

The research team concludes that storage subsystem components cannot be ignored when designing a reliable storage box. They offer some suggestions to improve reliability.

Redundancy mechanisms such as mulitpathing were able to reduce AFR for storage systems by 30-40 per cent when paths were increased from one to two.

The researchers also recommend spanning a RAID group across multiple shelves — and using fewer disks per shelf, with more shelves in the system. This helps reduce the chances of a shelf failure taking out an entire RAID group.

The full paper is available here at the Usenix website. ®

Security for virtualized datacentres

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Bitcasa bins $10-a-month Infinite storage offer
Firm cites 'low demand' plus 'abusers'
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.