The cost of not deduping
Think pouring budget down drain
In our homes and offices duplicated information is such a fact of life we don’t even think about it. In the digital world though we can think about it, and should, because it can stop a lot of wasteful spending.
Imagine a department of twenty people. They each have their own filed copies of their employment contract and a pensions scheme guide. You can see them filed in twenty desk drawers. More than ninety percent of the information in these documents is duplicated, but we individuals don’t think much about it because we have our own copy of the documents.
Now let's computerise this and give the twenty people electronic copies of their HR contract and pension scheme guide, stored on a departmental server’s disk drive, and consuming 10MB of capacity per person; that’s 200MB. With deduplication technology we can detect that there are multiple copies of 90 per cent the information and eliminate them, replacing them by pointers to a single or master copy of the data. Our 200MB just decreased to 29MB, a roughly 7:1 deduplication ratio.
In the digital world duplication is a money-wasting sin
Deduplication technology can be applied with spectacular results to backup data, where, with repetitive daily and weekly backups, there can be a huge amount of redundant data. There can be terabytes of backup data stored on disk in a reasonably large organisation and effective deduplication ratios for full backups can approach 20:1. That means a 20TB backup store can be reduced to 1TB, and a 200TB one to 10TB.
Assume we are using 1TB disk drives; we don’t need 200 of them for the 200TB store; instead we only need 10. This is the main deduplication saving. You don’t need to buy as much disk capacity. If the backup data storage drive array has RAID protection then you need a smaller amount of capacity for the RAID parity data and copied data.
If you send a copy of your backup data off-site for disaster recovery and business continuity reasons, then you only need a network link capable of transmitting 10TB in a reasonable time rather than a much more expensive 200TB one. And the destination data centre’s storage capacity need for this data is 10TB and not 200TB: another saving.
There are further savings as a result. Your data centre power costs go down because, instead of spinning 200 disk drives you only need to have ten spinning. That means less heat is generated and so your datacentre cooling bill is lowered as well.
With fewer disk drives spinning, the chances of one of them failing is lessened and your data are therefore more available.
The savings from deduplication are multiplied by these effects.
The technology is beginning to be applied to nearline and primary data as well as to backup data, as processing power increases and the technology is improved. Multi-core storage controllers can deduplicate data pretty quickly and newer technologies such as Permabit’s Albireo take deduplication out of the data access path.
As and when data is stored in the cloud - remote data centres at the end of a wide area network link - then deduplication means you only pay for the storage of unique data there, and not multiple copies of a PowerPoint presentation or an image that has been identically attached to twenty emails. All those repeated clauses in the twenty HR employment contracts and pension scheme guides we mentioned earlier can be stripped out leaving just the unique data.
In our homes and offices duplication of paper-stored information is a fact of life and we don’t even think about it. In the digital world such duplication is a money-wasting sin and we should and must think about it, so we can spend our money where it can do good instead of being poured down a drain. ®
De-dupe is done at block level
De-duplication is a great tool for eliminating the storage cost for exact copies of files. However, as soon as the content is changed in typical office files, they cannot be de-duplicated at the block level. For example, a Word document that has had one character changed somewhere in it will add an "edited by" tag at the top of the document, thereby "pushing" all the content down a few bytes, and therefore all the blocks will not match up for de-dupe.
God help anyone who believes they can achieve anything over 15% storage reduction in a typical environment. The exception might be a large Exchange 2010 installation.
When Winston was working at his terminal re-writing history by 'correcting' specific newspaper articles, I though the technology to do this was too fanciful. With perfect de-duping in place across the nation, now I see it.
It's about "who" and "whom"
or 1984 or something.
I'd want a contact or something from HR as my own copy preferably on hard copy so there's no need to wonder if things have changed between me getting it and me needing to read it.
I'm not ecstatic about "dematerialisation" in the Land Registry either, but they didn't ask me...