Feeds

Ex-Sun Micro CTO reveals Greenbytes 'world-beating' dedupe

He IS on advisory board of flash cache latency smasher, but...

Top 5 reasons to deploy VMware with Tegile

El Reg has managed to take a peek at an as-yet-unpublished white paper, written by former Sun Microsystems CTO Randall Chalfant, which claims the storage company's deduplication tech has near-zero latency and possibly offers the world's fastest inline deduplication.

It works with 4K blocks of data. This is the sequence of events once data comes in to a Greenbytes system, we're told by Chalfant – a lecturer who sits on the company's industry advisory board:

  1. The Greenbytes system receives the data and it is stored in a write log, which exists on one or more solid-state drives, and then a write acknowledgement is returned to the client so its OS and application can carry on and not have to wait for the data write to complete.
  2. Greenbytes' software specifies three stages of data input: the open stage, the quiescing stage [rendering inactive], and the synching stage. In the open stage, the client is free to write as much data as possible into the memory buffer. Then, every few seconds, a snapshot is taken to freeze and then quiesce [disable] the buffer, which is done in preparation for writing data to disk. During the synch stage, the a 256-bit hash is computed for each data block.
  3. The hash is stored in a d-cache (deduplication cache), an assembly of one or more solid-state drives that can be extended easily over time to increase the size of the storage system. The d-cache only holds the dedupe search tables and has a fixed data access latency.
  4. GreenBytes' technology determines in a virtually constant time whether there is a block match in the storage system using the hashes. It calls the look-up algorithm it uses its probabilistic constant time search.
  5. Computed hashes are looked up to see if they exist already in the d-cache. The d-cache returns an answer in constant time, and, if there is no match, a new 4k block of data is written to storage. If there is a match a pointer gets written instead.

To add more detail here, Greenbytes' CTO and founder Bob Petrocelli says:

The width of the hash is actually tunable. We currently allow 128 bit,192 bit and 256 bit hashes. The default is 256 bits. One of the patent claims deals with the searching approach using hashes. ...

The important point is to realise that the write-coalescing and the actual determination of which blocks to write happens during the transactional phase of the pipeline. There are a lot of complex considerations during this phase.

For example a block that is overwritten many times, say by an application log etc, will only be written once, [using the] final state of the block as all blocks writes are collapsed). A temporary in memory AVL tree is used for this write coalescing.

The system is zero latency because we are able to back the write immediately, protected by the intent log, and then only later during the transactional phase do we absorb the cost of de-duplication. When we have any duplicate data in the stream, we come out ahead of the game because we end up committing less data to disk.

CEO and chairman Steve O'Donnell added: "The searching process uses small parts (64 bits) of the hash to rapidly determine the likelihood of the need to write a block, this dramatically reduces the amount of RAM needed to store the hash and enables the tiny footprint that vIO [Virtual Desktop software] uses inside the Hypervisor."

Greenbytes has protected its dedupe software with many patents and will defend its patents using legal eagles. In fact it has already done so.

Back in 2009, Sun Microsystems, which at the time was being acquired by Oracle, sued Greenbytes for infringing its deduplication patents, after Greenbytes claimed Sun had used Greenbytes' own deduplication scheme.

By 2010 this legal dispute was settled and Greenbytes continued to sell and develop its deduplication technology.

The Oracle/Sun ZFS deduplication technology seems not to have been much developed since then. ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
AWS pulls desktop-as-a-service from the PC
Support for PCoIP protocol means zero clients can run cloudy desktops
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.