Feeds

Permabit's killer dedupe technology

Component dedupe for primary data storage OEMs

Beginner's guide to SSL certificates

Fast primary data deduplication, a Holy Grail of storage vendors, is set to be a practical reality with Permabit's Albireo product, a software library that can be integrated as a component by storage and application OEMs.

Primary data deduplication has been seen as a CPU-cycle sucking monster that slows down application performance, applies to data with less redundancy than backup sets, and doesn't scale well due to the amount of potential data to be indexed in the search for duplicates. Yet the spread of virtualised multi-core servers sharing drive arrays mean primary data sets are getting larger. The opportunities to improve array storage efficiency by removing block-level duplicates are becoming more attractive as this happens.

NetApp and Nimbus data are the only storage array suppliers offering primary data deduplication, with NetApp's A-SIS in widespread use and the Nimbus Data flash-only product untried. Various ZFS-based products, such as ones from GreenBytes and a coming one from Compellent using Nexenta software, offer primary data deduplication but have yet to be tested with mainstream and critical business applications such as serving virtual desktop images (VDIs) to client desktops.

Permabit's marketing VP, Mike Ivanov, said: "One of the main drivers is NetApp implementing A-SIS. Everyone else needs to catch up ... What NetApp does now with primary dedupe is the only game in town and so widely successful ... But it's not that scalable ... only 16TB of indexing per volume. We need to scale to many hundreds of terabytes.

"ZFS is open source and puts all of hash tables put into standard filesystem metadata. It falls out of memory eventually, causing disk arrays to do multiple seeks. That's very inefficient. Putting metadata onto solid state drive (SSD) ... increases cost and still affects performance."

Deduplication without rehydration

Permabit, an archiving software supplier, has devised patented indexing technology with a low memory footprint that can function in single servers and multi-node grid systems, enabling it to scale to hundreds of terabytes of data. Conceptually it executes parallel to the data path between a host server and the storage array, and is not an inline, bump-in-the-wire appliance. The software receives a copy of data for the array or in the array, generates unique content fingerprints with a SHA-256 hash, and checks its index to see if sub-file-level block groups have been stored already, meaning they have an index entry. If they have than Albireo tells the array controller software so it can use that in its standard content metadata.

This software can then replace the duplicated blocks with a block reference. There is no need to rehydrate deduplicated data when it is next accessed because the array controller assembles blocks in response to a read request as it normally does - a block is a block is a block. If Albireo was an appliance then it would have to rehydrate data as it is the only place in the stack that knows how to find deduplicated blocks and rebuild their data.

Typically a storage array will respond to a read request by fetching blocks from disk to assemble the requested data. With Permabit's approach this is unchanged and when deduplicated blocks have to be fetched they are picked up like any other block. Ivanov said: "Block replacement is in the vendor's metadata … New stuff is added to index. Existing stuff is detected and Albireo sends a signal to the vendor's stack saying it is a known block at a known location. The vendor merges the blocks or extents and frees the allocated blocks for the deduplicated data. Different vendors do it differently and it works with their thin provisioning, whatever."

The technology works at a level below file or object storage and need not interfere at all with snapshots, replication or thin provisioning. It does not capture its own data and become a single point of failure, it not compromising data integrity.

Ivanov said: "The block interface has fixed chunks. File interface is stream of data to which we apply our segmentation engine and identify optimal boundaries for dedupe. Albireo is content-aware and can pull out images from office files and knows about tar files, things like that."

The technology can be used for inline dedupe, post-process dedupe or parallel deduplication, a combination of the two.

Top 5 reasons to deploy VMware with Tegile

Next page: Scaling up and down

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Cray-cray Met Office spaffs £97m on VERY AVERAGE HPC box
Only 250th most powerful in the world? Bring back Michael Fish
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.