Permabit's killer dedupe technology

Component dedupe for primary data storage OEMs

Choosing a cloud hosting partner with confidence

Fast primary data deduplication, a Holy Grail of storage vendors, is set to be a practical reality with Permabit's Albireo product, a software library that can be integrated as a component by storage and application OEMs.

Primary data deduplication has been seen as a CPU-cycle sucking monster that slows down application performance, applies to data with less redundancy than backup sets, and doesn't scale well due to the amount of potential data to be indexed in the search for duplicates. Yet the spread of virtualised multi-core servers sharing drive arrays mean primary data sets are getting larger. The opportunities to improve array storage efficiency by removing block-level duplicates are becoming more attractive as this happens.

NetApp and Nimbus data are the only storage array suppliers offering primary data deduplication, with NetApp's A-SIS in widespread use and the Nimbus Data flash-only product untried. Various ZFS-based products, such as ones from GreenBytes and a coming one from Compellent using Nexenta software, offer primary data deduplication but have yet to be tested with mainstream and critical business applications such as serving virtual desktop images (VDIs) to client desktops.

Permabit's marketing VP, Mike Ivanov, said: "One of the main drivers is NetApp implementing A-SIS. Everyone else needs to catch up ... What NetApp does now with primary dedupe is the only game in town and so widely successful ... But it's not that scalable ... only 16TB of indexing per volume. We need to scale to many hundreds of terabytes.

"ZFS is open source and puts all of hash tables put into standard filesystem metadata. It falls out of memory eventually, causing disk arrays to do multiple seeks. That's very inefficient. Putting metadata onto solid state drive (SSD) ... increases cost and still affects performance."

Deduplication without rehydration

Permabit, an archiving software supplier, has devised patented indexing technology with a low memory footprint that can function in single servers and multi-node grid systems, enabling it to scale to hundreds of terabytes of data. Conceptually it executes parallel to the data path between a host server and the storage array, and is not an inline, bump-in-the-wire appliance. The software receives a copy of data for the array or in the array, generates unique content fingerprints with a SHA-256 hash, and checks its index to see if sub-file-level block groups have been stored already, meaning they have an index entry. If they have than Albireo tells the array controller software so it can use that in its standard content metadata.

This software can then replace the duplicated blocks with a block reference. There is no need to rehydrate deduplicated data when it is next accessed because the array controller assembles blocks in response to a read request as it normally does - a block is a block is a block. If Albireo was an appliance then it would have to rehydrate data as it is the only place in the stack that knows how to find deduplicated blocks and rebuild their data.

Typically a storage array will respond to a read request by fetching blocks from disk to assemble the requested data. With Permabit's approach this is unchanged and when deduplicated blocks have to be fetched they are picked up like any other block. Ivanov said: "Block replacement is in the vendor's metadata … New stuff is added to index. Existing stuff is detected and Albireo sends a signal to the vendor's stack saying it is a known block at a known location. The vendor merges the blocks or extents and frees the allocated blocks for the deduplicated data. Different vendors do it differently and it works with their thin provisioning, whatever."

The technology works at a level below file or object storage and need not interfere at all with snapshots, replication or thin provisioning. It does not capture its own data and become a single point of failure, it not compromising data integrity.

Ivanov said: "The block interface has fixed chunks. File interface is stream of data to which we apply our segmentation engine and identify optimal boundaries for dedupe. Albireo is content-aware and can pull out images from office files and knows about tar files, things like that."

The technology can be used for inline dedupe, post-process dedupe or parallel deduplication, a combination of the two.

Secure remote control for conventional and virtual desktops

Next page: Scaling up and down

More from The Register

next story
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Trio of XSS turns attackers into admins
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
prev story


Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Designing and building an open ITOA architecture
Learn about a new IT data taxonomy defined by the four data sources of IT visibility: wire, machine, agent, and synthetic data sets.
How to determine if cloud backup is right for your servers
Two key factors, technical feasibility and TCO economics, that backup and IT operations managers should consider when assessing cloud backup.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.