Go forth and deduplicate
Will it benefit my data centre?
Deduplication use cases
There are many types of data that can benefit from this impressive capacity-reduction potential, including backups, where each stream of backup data is very similar to the last backup, with only small percentage of data changing between each backup. Backups can show deduplication ratios of 20 to one, and are normally much greater. Virtual-machine images, where each image is largely similar to every other, also deduplicate well, with savings of 90 per cent or more in practice.
Deduplication can be used for backup, primary storage, WAN optimisation, archiving, and disaster recovery. In fact, any point where data is stored and transmitted is a candidate.
Points to consider
Deduplication looks like a winner – but, like all technologies, getting the best from it requires understanding where it works well and where it isn't effective as well as the flavours offered by the various vendors.
Not all data types deduplicate as well as others. Some are problematic, such as video streams or geophysical data, for example. Many of these types of data have little to no repetitive data, and may already be compressed. On the other hand, regardless of their data type, backups – which contain large amounts of data that doesn't change on a regular basis – deduplicate well.
But generally most data types and sources of data have some potential for deduplication – home directories and VM images, for example. Deduplicated data may also be slower to access because reconstituting the data (sometimes referred to as "rehydration") may require more processing resources on the storage system than a file that's not been deduplicated, typically in the form of more CPU cycles.
On the other hand, deduped data may be faster to access since less data movement from slow disks is involved. Caching at the storage controller on flash storage devices or in the network itself can considerably reduce the overall I/O load on the disk subsystem. But your mileage may vary, and evaluation of the benefits needs an understanding of the service you are delivering and the data you are managing.
Most data types will benefit from deduplication, as the overheads are small and outweighed by the significant savings, but high-performance applications that require very fast access to their data are not generally good candidates for deduplication.
The bottom line
Data deduplication helps by managing data growth, reducing network bandwidth requirements, and therefore improves capacity and performance efficiencies. Significant cost reductions can be made, from lower administration costs (there's less to manage) to space, power, and cooling outgoings – deduplication helps data centres become greener by reducing the carbon footprint per stored byte.
When evaluating deduplication the answer to the question "Will it benefit my data centre?" generally is: "It will." The success of deduplication technologies to date should encourage every storage administrator to "go forth and deduplicate". ®
This article was written by Alex McDonald, SNIA Europe UK country committee member, NetApp, based on an existing SNIA material. To explore deduplication further, check out this SNIA tutorial: Advanced deduplication concepts (pdf).
To view all of the SNIA tutorials on Data Protection and Management, visit the SNIA Europe website at www.snia-europe.org/en/technology-topics/snia-tutorials/data-protection-and-management.cfm.
Sponsored: Hyper-scale data management