The Register® — Biting the hand that feeds IT

Feeds

You don't know disk about storage failures

Research shows platters a victim of spin

SaaS data loss: The problem you didn’t know you had

Knowing just what breaks a storage box is of obvious interest to data center admins. It's quite reasonable to conclude the blame should be heaped on the 80-some platters spinning all day at 7200 RPMs.

But a recent study presented at the USENIX Conference on File and Storage Technologies argues that disk failure isn't nearly the whole story. Other components in a storage subsystem are often the point of failure, although their failings are still treated as disk faults. This results in unnecessary disk replacements — and inevitably an incomplete perspective on storage system resiliency.

The study, titled "Are Disk the Dominant Contributor for Storage Failures?" was authored by Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou of the University of Illinois department of computer science and Arkady Kanevsky of Network Appliance.

Over a period of 44 months, the group analyzed storage logs of about 39,000 commercially deployed storage systems. They estimate the systems in total were composed of about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. The researchers examined near-line (backup) disks, low-end, mid-range, and high-end hardware.

While the findings do show disk failures contribute to 20-55 per cent of storage subsystem failures, other components such as physical interconnects (broken wires, shelf enclosure power outages, HBA failures, etc) and protocol stacks (software bugs and compatibility issues) also account for a significant percentage of problems.

The group states that recent studies on storage failures have failed due to an excessive focus on disk malfunctions. For example, in June, Google released a paper that disputed the reliability claims of disk manufacturers from a user perspective. A good start, write the researchers.

"But as this study indicates, there are other storage subsystem failures besides disk failures that are treated as disk faults and lead to unnecessary disk replacements," their paper claims.

The research indicates between 27-68 per cent of storage subsystem failures come from physical interconnects. Between 5-10 per cent are a result of protocol stack errors. Due to component failures, even slower, more reliable disks like near-line backup have higher failure rates.

"These results indicate that, to build highly reliable and available storage systems, only using resiliency mechanisms targeting disk failures (e.g. RAID) is not enough," the study states. "We also need to build resiliency mechanisms such as redundant physical interconnects and self-checking protocol stacks to tolerate failures in these storage components."

As an example, in low-end storage systems (defined as having embedded storage heads with shelf enclosures) the annualized failure rate (AFR) is about 4.6 per cent. The AFR for the disks only is 0.9 per cent, or only 20 per cent of overall AFR.

Near-line storage disks (mostly SATA) show a 1.9 per cent AFR, but again the whole storage subsystem failure is higher, at 3.4 per cent.

So, on their own, low-end disks fail less often than higher-end SATA disks, but total SATA systems fail less often than lower-end systems.

The researchers argue this indicates that "disk failure rate is not indicative of the storage subsystem failure rate," meaning there's other factors for failures such as shelf enclosure model and network configurations that strongly affect reliability.

The research team concludes that storage subsystem components cannot be ignored when designing a reliable storage box. They offer some suggestions to improve reliability.

Redundancy mechanisms such as mulitpathing were able to reduce AFR for storage systems by 30-40 per cent when paths were increased from one to two.

The researchers also recommend spanning a RAID group across multiple shelves — and using fewer disks per shelf, with more shelves in the system. This helps reduce the chances of a shelf failure taking out an entire RAID group.

The full paper is available here at the Usenix website. ®

Steps to Take Before Choosing a Business Continuity Partner

Latest Comments

Often thought...

... that a lot of claimed "disk failures" are frequently nothing of the sort.

However it's just easier to pull the disc and get it replaced if it's got corrupted.

Home users I bet tend to just blame the disc, and professional storage admins likely are under pressure to get the thing fixed quick. If it's a random corruption from a software glitch then likely they'll just assume it's fixed.

0
0

@Boris H.

NetApp ones.

Their data source is the NetApp AutoSupport database.

0
0

What systems are they testing this theory of theirs?

I can tell you right now that disk failures make up about 35-55% of storage failures.

The only reason to misdiagnose a disk failure is when the Control card is bad a very specific way so it brings only 1 disk down instead of multiples and when the enclosure slot itself is dead. Those 2 would make about sub 5% of all disk failures.

I am talking about EMC equipment that already utilizes multiple host paths as well as multiple paths to each disk within the enclosure itself.

Not a very useful study unless your storage system is bought in electronics store like BestBuy

0
0

More from The Register

SCO vs. IBM battle resumes over ownership of Unix
Zombie lawsuit back and wants to suck the brains out of Linux
 breaking news
You don't need phone lines or cable for ANYTHING, says Dish
The satellite-dish man can sort you out with phone and broadband over the air too
 breaking news
What's HP got under wraps? Looks awfully flash and tape shaped
What happens in Vegas won't stay there - we've got the details
Microsoft borks botnet takedown in Citadel snafu
Stupid Redmond kicked over our honeypots, wail white hats
IBM's $1bn layoffs latest: Now axe swings in US, Canada - reports
Union claims 121 storage bods canned after dismal sales
NetApp musters muscular cluster bluster for ONTAP busters
Storage array OS overhauled to juggle more nodes, go down on you, er, less
HP adds 'Haswell' Xeon E3s to entry ProLiant servers
Gussies up MicroServer for SMBs, adds baby switches