How NOT to evaluate hard disk reliability: Backblaze vs world+dog
Consumer drives beat data centre versions... Yeah, let's put that to bed
HPC blog A few months ago, Brian Beach, a distinguished engineer at cloud backup joint Backblaze, published a set of study-like blog postings relating to his firm's experiences with hard drive lifespan in its 25,000+ spindle environment.
The blogs garnered quite a bit of interest due to the subject matter, and provocative titles like: How Long Do Hard Drives Last?, Enterprise Drives: Fact or Fiction? and What Hard Drive Should I Buy? The blogs raise interesting questions and put forward controversial conclusions.
One of most contentious claims came from the first blog (El Reg's Simon Sharwood covers it here) where Beach asserts that consumer-grade hard drives are actually more reliable than their supposedly industrial strength (and definitely more pricey) enterprise drive cousins.
According to Backblaze's research, enterprise drives failed at an annual rate of 4.6 per cent vs. 4.2 per cent for the consumer versions.
The bottom line, according to Beach, is that consumer drives are a better choice (even after factoring in the longer enterprise warranty) due to their higher reliability and lower cost.
Even more contentious is the last blog, which showed Backblaze failure rates by drive manufacturer. The results were pretty stark, with an “Annual Failure Rate” chart that showed Hitachi drives at less than 2 per cent; WD spinners at around 3 per cent; and Seagate drives at an astounding 14 per cent for the 1.5TB flavour, ~9 per cent for 3TB, and a high 3.8 per cent or so for the 4GB version. Yikes! We should stay away from Seagate, then, right?
A bit of digging into the firm's analysis reveals that the foundations underlying the Backblaze conclusions aren’t all that sturdy. Take the data centre vs consumer drive failure rate statistic, for example. To compute annual failure rates, Backblaze compares failures per "drive-years of service", which is the number of each type of drive they have multiplied by years of service – simple, eh?
The problem is that it is comparing 14,719 drive-years of service on its consumer disks vs only 368 drive-years of service on data centre-grade drives. Overall, the enterprise drives had 17 (4.6 per cent) failures while the consumer drives bricked 613 times (4.2 per cent).
This is a damned small sample on the data centre drive side of the equation. The difference between a 4.2 per cent and 4.6 per cent annual failure rates on 368 drive-years worth of service is only 1.5 spindles. Meaning that if only two more enterprise drives had survived, then their analysis would have shown data centre drives to be more reliable than consumer drives.
Moreover, Backblaze has only run the enterprise drives for two years, compared to the more than four years of mileage on their consumer disks. Beach does acknowledge this fact, but doesn’t see any reason to believe that their enterprise drives will become more reliable in the next three years or to the end of their warranty period.
So what hard drive should I buy? Tell me, tell me!
This blog post (What Hard Drive Should I Buy?) is the one that really got my attention. Looking at big colourful charts tells me that I should avoid Seagate drives like email from a Nigerian bureaucrat who’s just looking for a bit of help getting some money out of his country.
But the real story is a lot more nuanced and complicated. This article from Instrumental CEO Henry Newman does a great job of digging into the guts of the Backblaze analysis and pointing out the shortcomings in their approach.
Henry Newman is a bit of an institution in the HPC and storage world. He’s not what I’d call "reserved" when it comes to sharing his opinions – not a guy who pulls his punches. But he also backs up his opinions with facts and solid research, making him one of my go-to sources.
In his analysis of the analysis, Henry points out that Backblaze’s Seagate results are hugely skewed by two drive models – the 1.5TB Barracuda and Barracuda Green SKUs. Seagate publicly disclosed problems with this drive family back in 2008, so it’s not surprising that these drives have, well, problems, right?
There are also some issues with exactly how they’re evaluating the drives, how much traffic a consumer drive should be expected to handle, and things along those lines. It’s all pretty interesting stuff and points to the need for more rigorous research and testing when it comes to drives and reliability.
One final point: at the end of the “What Drives Should You Buy” post, Beach discussed what Backblaze is buying today. Right now, their most favoured drive is the ... wait for it ... Seagate 4TB Barracuda – even though it supposedly is less reliable than the WD or Hitachi drives. Huh?
Brian Wilson, CTO and founder of Backblaze (he shares a name with the 71-year-old Beach Boys front man - though we do not think they are one and the same) explains it this way:
Double the reliability is only worth 1/10th of 1 percent cost increase. I posted this in a different forum:
Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
So the value of higher reliability – in their unique situation – isn’t nearly as much as one might think. Using Brian’s analysis above, this means that a drive that offered double the reliability of the 4TB Seagates (which currently cost around $160) is only worth an additional $.016 (yeah, sixteen cents) to Backblaze.
A quick check of Western Digital and Hitachi (now owned by WD) 4TB spindles reveals that retail prices of these are roughly $30 - $50 more than the Seagate alternative.
So what can we learn from all of this? I think the most important point is that you need to carefully evaluate your information sources. While it’s easy to say “do your own testing”, it’s just not practical in most cases. You’re going to have to rely on third-party sources of information to some extent.
When looking at user experiences, reviews, case studies, etc, you have to factor how they’re using the product and how well that lines up with your unique needs and requirements.
And remember, as always, that past performance doesn’t necessarily dictate future results and that your mileage will vary. Caveat emptor, y’all... ®
Sponsored: Benefits from the lessons learned in HPC