Feeds

Permabit's killer dedupe technology

Component dedupe for primary data storage OEMs

Maximizing your infrastructure through virtualization

Fast primary data deduplication, a Holy Grail of storage vendors, is set to be a practical reality with Permabit's Albireo product, a software library that can be integrated as a component by storage and application OEMs.

Primary data deduplication has been seen as a CPU-cycle sucking monster that slows down application performance, applies to data with less redundancy than backup sets, and doesn't scale well due to the amount of potential data to be indexed in the search for duplicates. Yet the spread of virtualised multi-core servers sharing drive arrays mean primary data sets are getting larger. The opportunities to improve array storage efficiency by removing block-level duplicates are becoming more attractive as this happens.

NetApp and Nimbus data are the only storage array suppliers offering primary data deduplication, with NetApp's A-SIS in widespread use and the Nimbus Data flash-only product untried. Various ZFS-based products, such as ones from GreenBytes and a coming one from Compellent using Nexenta software, offer primary data deduplication but have yet to be tested with mainstream and critical business applications such as serving virtual desktop images (VDIs) to client desktops.

Permabit's marketing VP, Mike Ivanov, said: "One of the main drivers is NetApp implementing A-SIS. Everyone else needs to catch up ... What NetApp does now with primary dedupe is the only game in town and so widely successful ... But it's not that scalable ... only 16TB of indexing per volume. We need to scale to many hundreds of terabytes.

"ZFS is open source and puts all of hash tables put into standard filesystem metadata. It falls out of memory eventually, causing disk arrays to do multiple seeks. That's very inefficient. Putting metadata onto solid state drive (SSD) ... increases cost and still affects performance."

Deduplication without rehydration

Permabit, an archiving software supplier, has devised patented indexing technology with a low memory footprint that can function in single servers and multi-node grid systems, enabling it to scale to hundreds of terabytes of data. Conceptually it executes parallel to the data path between a host server and the storage array, and is not an inline, bump-in-the-wire appliance. The software receives a copy of data for the array or in the array, generates unique content fingerprints with a SHA-256 hash, and checks its index to see if sub-file-level block groups have been stored already, meaning they have an index entry. If they have than Albireo tells the array controller software so it can use that in its standard content metadata.

This software can then replace the duplicated blocks with a block reference. There is no need to rehydrate deduplicated data when it is next accessed because the array controller assembles blocks in response to a read request as it normally does - a block is a block is a block. If Albireo was an appliance then it would have to rehydrate data as it is the only place in the stack that knows how to find deduplicated blocks and rebuild their data.

Typically a storage array will respond to a read request by fetching blocks from disk to assemble the requested data. With Permabit's approach this is unchanged and when deduplicated blocks have to be fetched they are picked up like any other block. Ivanov said: "Block replacement is in the vendor's metadata … New stuff is added to index. Existing stuff is detected and Albireo sends a signal to the vendor's stack saying it is a known block at a known location. The vendor merges the blocks or extents and frees the allocated blocks for the deduplicated data. Different vendors do it differently and it works with their thin provisioning, whatever."

The technology works at a level below file or object storage and need not interfere at all with snapshots, replication or thin provisioning. It does not capture its own data and become a single point of failure, it not compromising data integrity.

Ivanov said: "The block interface has fixed chunks. File interface is stream of data to which we apply our segmentation engine and identify optimal boundaries for dedupe. Albireo is content-aware and can pull out images from office files and knows about tar files, things like that."

The technology can be used for inline dedupe, post-process dedupe or parallel deduplication, a combination of the two.

The Power of One eBook: Top reasons to choose HP BladeSystem

Next page: Scaling up and down

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.