Boffin finds formula for four-year-five-nines disk arrays
Build arrays with spare disks - LOTS of 'em - it's cheaper than fixing drives
Forty-five disk drives, ten parity drives, and 33 spare disks: that's the optimum array size to protect data for four years with no service visits, according to a study published at Arxiv.
The problem the study addresses is that the world's rush towards hyperscale data centres puts an awful lot of disks in one place, and the cost of calling someone to replace a dead drive far outweighs the price of the disk.
Led by the University of Houston's Jehan-François Pâris, the group proposes a new disk scheme they reckon can deliver five-nines performance across the four-year life of commercial disk drives.
They say write that with the right model, it's better to design drive arrays that can churn away for the lifetime of the disks without bothering to replace dead drives – an objective they say “cannot be reached with RAID level 6 organisations and would require RAID stripes that could tolerate triple disk failures”.
Standard RAID architectures use parity disks to protect the whole of the array against a failure. However, as they note in the example of four parity disks protecting six data disks, there are still potential triple-failures that can kill the array, such as the loss of one data disk and its two parity disks; or the loss of three data disks.
However, even smaller failures are troublesome for a data centre operator. Until a tech can be summoned to slot in a new drive, a RAID array running a drive short is less protected than it should be – an expensive exercise for a lights-out operation in a relatively remote location.
The paper comes up with a two-fold result. The researchers propose that spare disks be built into drive arrays from day one; and on that basis, they modelled their arrangement to try and work out the optimum array configuration.
To build that model, the study used failure rate data of Backblaze disks – 5.1 per cent in the first 18 months, 1.4 per cent in the second 18 months, and 11.8 per cent in the following year – and that it will take 24 hours for an array to repair itself, assuming 4 TB drives and a 200 MB/s data transfer rate.
The model suggests that the combination of reliability and low space overhead is at its optimum when an array has 45 data disks, 10 parity disks, and 33 spare disks: there's a high confidence that the array will deliver five-nines reliability, and the space overhead is less than 49 per cent compared to a RAID array with no spares.
With fewer data disks, the proportion of spares needed to reach five-nines rises, so that the space overhead can rise to more than 56 per cent – and surprisingly, the model suggests that even an “infinite” number of spare disks in the array doesn't much improve the four-year reliability prediction.
The researchers say their arrangement would only fall short of its claimed five-nines reliability if an array had a bunch of drives that were part of a bad batch.
Importantly, the researchers note, the models suggest their space overheads is comparable mirroring the whole storage environment. ®