Ex-Sun Micro CTO reveals Greenbytes 'world-beating' dedupe

He IS on advisory board of flash cache latency smasher, but...

Top 5 reasons to deploy VMware with Tegile

El Reg has managed to take a peek at an as-yet-unpublished white paper, written by former Sun Microsystems CTO Randall Chalfant, which claims the storage company's deduplication tech has near-zero latency and possibly offers the world's fastest inline deduplication.

It works with 4K blocks of data. This is the sequence of events once data comes in to a Greenbytes system, we're told by Chalfant – a lecturer who sits on the company's industry advisory board:

  1. The Greenbytes system receives the data and it is stored in a write log, which exists on one or more solid-state drives, and then a write acknowledgement is returned to the client so its OS and application can carry on and not have to wait for the data write to complete.
  2. Greenbytes' software specifies three stages of data input: the open stage, the quiescing stage [rendering inactive], and the synching stage. In the open stage, the client is free to write as much data as possible into the memory buffer. Then, every few seconds, a snapshot is taken to freeze and then quiesce [disable] the buffer, which is done in preparation for writing data to disk. During the synch stage, the a 256-bit hash is computed for each data block.
  3. The hash is stored in a d-cache (deduplication cache), an assembly of one or more solid-state drives that can be extended easily over time to increase the size of the storage system. The d-cache only holds the dedupe search tables and has a fixed data access latency.
  4. GreenBytes' technology determines in a virtually constant time whether there is a block match in the storage system using the hashes. It calls the look-up algorithm it uses its probabilistic constant time search.
  5. Computed hashes are looked up to see if they exist already in the d-cache. The d-cache returns an answer in constant time, and, if there is no match, a new 4k block of data is written to storage. If there is a match a pointer gets written instead.

To add more detail here, Greenbytes' CTO and founder Bob Petrocelli says:

The width of the hash is actually tunable. We currently allow 128 bit,192 bit and 256 bit hashes. The default is 256 bits. One of the patent claims deals with the searching approach using hashes. ...

The important point is to realise that the write-coalescing and the actual determination of which blocks to write happens during the transactional phase of the pipeline. There are a lot of complex considerations during this phase.

For example a block that is overwritten many times, say by an application log etc, will only be written once, [using the] final state of the block as all blocks writes are collapsed). A temporary in memory AVL tree is used for this write coalescing.

The system is zero latency because we are able to back the write immediately, protected by the intent log, and then only later during the transactional phase do we absorb the cost of de-duplication. When we have any duplicate data in the stream, we come out ahead of the game because we end up committing less data to disk.

CEO and chairman Steve O'Donnell added: "The searching process uses small parts (64 bits) of the hash to rapidly determine the likelihood of the need to write a block, this dramatically reduces the amount of RAM needed to store the hash and enables the tiny footprint that vIO [Virtual Desktop software] uses inside the Hypervisor."

Greenbytes has protected its dedupe software with many patents and will defend its patents using legal eagles. In fact it has already done so.

Back in 2009, Sun Microsystems, which at the time was being acquired by Oracle, sued Greenbytes for infringing its deduplication patents, after Greenbytes claimed Sun had used Greenbytes' own deduplication scheme.

By 2010 this legal dispute was settled and Greenbytes continued to sell and develop its deduplication technology.

The Oracle/Sun ZFS deduplication technology seems not to have been much developed since then. ®

Beginner's guide to SSL certificates

More from The Register

next story
Ellison: Sparc M7 is Oracle's most important silicon EVER
'Acceleration engines' key to performance, security, Larry says
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Lenovo to finish $2.1bn IBM x86 server gobble in October
A lighter snack than expected – but what's a few $100m between friends, eh?
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Oracle crashes all-flash bash: Behold, our hybrid FS1 arrays
Mutant flash/disk box a pillar of storage: It's axiomatic
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.