Magic hash maths: Dedupe does not have to mean high compute. Wait, what?
X-IO maths man claims it can minimise mill hash work with buckets of blooms
Analysis A new and deduping X-IO ISE 900 all-flash array has puzzling puny processors yet kicks out good performance when deduping.
We wondered about that, and were pointed to a video of X-IO chief scientist Richard Lary presenting at a Storage Field Day in Denver earlier this year.
The maths is complex but the points made are logical and show how X-IO's way of dedupe means you don't need such powerful processors as are commonly found in other deduping arrays.
Improving Deduplication via Mathematics with Richard Lary via Vimeo.
In the 37.5 minute video, Lary says dedupe is computationally intensive but there are ways to make it less so without losing much in the way of deduplication efficiency, and so leaving more processor cycles for running data access or other code.
In deduplicating, you calculate a computationally intensive and unique mathematical hash or signature of an incoming lump of data and compare to a table of existing hashes. If there is a match it is a duplicate lump and can be replaced with a small reference to that hash, thus saving disk or flash space.
If there isn't a match then it's unique data and needs to be stored with its hash added to the table. A system could have 109 to 1011 entries in its dedupe hash table.
Lary says deduplication is a drive space optimisation that incurs a performance penalty. An implementation can trade off space efficiency for a lower performance penalty.
How? He goes on to talk about proxy signatures, non-crypto-signatures, bucketed hash tables, well-sized Bloom Filters, an array of small blooms in a bouquet rather than a bucket, and then a bucket of blooms.
Basically his method calculates less computationally intensive hashes and performs less processing work in deciding whether they are unique or not. Deduplication space optimisation is compromised a little, but performance goes up significantly.
He also says deduplication systems waste a lot of resources cataloguing unique user data that will be overwritten in the near future. X-IO says it has a technique to short-cut this wasted effort but doesn't want to talk about it yet, meaning X-IO's dedupe processing burden will get smaller still.
Buckets of blooms - fragrant or flagrant? You can weigh in below. ®