Feeds

Permabit's killer dedupe technology

Component dedupe for primary data storage OEMs

Next gen security for virtualised datacentres

Fast primary data deduplication, a Holy Grail of storage vendors, is set to be a practical reality with Permabit's Albireo product, a software library that can be integrated as a component by storage and application OEMs.

Primary data deduplication has been seen as a CPU-cycle sucking monster that slows down application performance, applies to data with less redundancy than backup sets, and doesn't scale well due to the amount of potential data to be indexed in the search for duplicates. Yet the spread of virtualised multi-core servers sharing drive arrays mean primary data sets are getting larger. The opportunities to improve array storage efficiency by removing block-level duplicates are becoming more attractive as this happens.

NetApp and Nimbus data are the only storage array suppliers offering primary data deduplication, with NetApp's A-SIS in widespread use and the Nimbus Data flash-only product untried. Various ZFS-based products, such as ones from GreenBytes and a coming one from Compellent using Nexenta software, offer primary data deduplication but have yet to be tested with mainstream and critical business applications such as serving virtual desktop images (VDIs) to client desktops.

Permabit's marketing VP, Mike Ivanov, said: "One of the main drivers is NetApp implementing A-SIS. Everyone else needs to catch up ... What NetApp does now with primary dedupe is the only game in town and so widely successful ... But it's not that scalable ... only 16TB of indexing per volume. We need to scale to many hundreds of terabytes.

"ZFS is open source and puts all of hash tables put into standard filesystem metadata. It falls out of memory eventually, causing disk arrays to do multiple seeks. That's very inefficient. Putting metadata onto solid state drive (SSD) ... increases cost and still affects performance."

Deduplication without rehydration

Permabit, an archiving software supplier, has devised patented indexing technology with a low memory footprint that can function in single servers and multi-node grid systems, enabling it to scale to hundreds of terabytes of data. Conceptually it executes parallel to the data path between a host server and the storage array, and is not an inline, bump-in-the-wire appliance. The software receives a copy of data for the array or in the array, generates unique content fingerprints with a SHA-256 hash, and checks its index to see if sub-file-level block groups have been stored already, meaning they have an index entry. If they have than Albireo tells the array controller software so it can use that in its standard content metadata.

This software can then replace the duplicated blocks with a block reference. There is no need to rehydrate deduplicated data when it is next accessed because the array controller assembles blocks in response to a read request as it normally does - a block is a block is a block. If Albireo was an appliance then it would have to rehydrate data as it is the only place in the stack that knows how to find deduplicated blocks and rebuild their data.

Typically a storage array will respond to a read request by fetching blocks from disk to assemble the requested data. With Permabit's approach this is unchanged and when deduplicated blocks have to be fetched they are picked up like any other block. Ivanov said: "Block replacement is in the vendor's metadata … New stuff is added to index. Existing stuff is detected and Albireo sends a signal to the vendor's stack saying it is a known block at a known location. The vendor merges the blocks or extents and frees the allocated blocks for the deduplicated data. Different vendors do it differently and it works with their thin provisioning, whatever."

The technology works at a level below file or object storage and need not interfere at all with snapshots, replication or thin provisioning. It does not capture its own data and become a single point of failure, it not compromising data integrity.

Ivanov said: "The block interface has fixed chunks. File interface is stream of data to which we apply our segmentation engine and identify optimal boundaries for dedupe. Albireo is content-aware and can pull out images from office files and knows about tar files, things like that."

The technology can be used for inline dedupe, post-process dedupe or parallel deduplication, a combination of the two.

Gartner critical capabilities for enterprise endpoint backup

Next page: Scaling up and down

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
Cutting cancer rates: Data, models and a happy ending?
How surgery might be making cancer prognoses worse
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?