Feeds

Ex-Sun Micro CTO reveals Greenbytes 'world-beating' dedupe

He IS on advisory board of flash cache latency smasher, but...

Build a business case: developing custom apps

El Reg has managed to take a peek at an as-yet-unpublished white paper, written by former Sun Microsystems CTO Randall Chalfant, which claims the storage company's deduplication tech has near-zero latency and possibly offers the world's fastest inline deduplication.

It works with 4K blocks of data. This is the sequence of events once data comes in to a Greenbytes system, we're told by Chalfant – a lecturer who sits on the company's industry advisory board:

  1. The Greenbytes system receives the data and it is stored in a write log, which exists on one or more solid-state drives, and then a write acknowledgement is returned to the client so its OS and application can carry on and not have to wait for the data write to complete.
  2. Greenbytes' software specifies three stages of data input: the open stage, the quiescing stage [rendering inactive], and the synching stage. In the open stage, the client is free to write as much data as possible into the memory buffer. Then, every few seconds, a snapshot is taken to freeze and then quiesce [disable] the buffer, which is done in preparation for writing data to disk. During the synch stage, the a 256-bit hash is computed for each data block.
  3. The hash is stored in a d-cache (deduplication cache), an assembly of one or more solid-state drives that can be extended easily over time to increase the size of the storage system. The d-cache only holds the dedupe search tables and has a fixed data access latency.
  4. GreenBytes' technology determines in a virtually constant time whether there is a block match in the storage system using the hashes. It calls the look-up algorithm it uses its probabilistic constant time search.
  5. Computed hashes are looked up to see if they exist already in the d-cache. The d-cache returns an answer in constant time, and, if there is no match, a new 4k block of data is written to storage. If there is a match a pointer gets written instead.

To add more detail here, Greenbytes' CTO and founder Bob Petrocelli says:

The width of the hash is actually tunable. We currently allow 128 bit,192 bit and 256 bit hashes. The default is 256 bits. One of the patent claims deals with the searching approach using hashes. ...

The important point is to realise that the write-coalescing and the actual determination of which blocks to write happens during the transactional phase of the pipeline. There are a lot of complex considerations during this phase.

For example a block that is overwritten many times, say by an application log etc, will only be written once, [using the] final state of the block as all blocks writes are collapsed). A temporary in memory AVL tree is used for this write coalescing.

The system is zero latency because we are able to back the write immediately, protected by the intent log, and then only later during the transactional phase do we absorb the cost of de-duplication. When we have any duplicate data in the stream, we come out ahead of the game because we end up committing less data to disk.

CEO and chairman Steve O'Donnell added: "The searching process uses small parts (64 bits) of the hash to rapidly determine the likelihood of the need to write a block, this dramatically reduces the amount of RAM needed to store the hash and enables the tiny footprint that vIO [Virtual Desktop software] uses inside the Hypervisor."

Greenbytes has protected its dedupe software with many patents and will defend its patents using legal eagles. In fact it has already done so.

Back in 2009, Sun Microsystems, which at the time was being acquired by Oracle, sued Greenbytes for infringing its deduplication patents, after Greenbytes claimed Sun had used Greenbytes' own deduplication scheme.

By 2010 this legal dispute was settled and Greenbytes continued to sell and develop its deduplication technology.

The Oracle/Sun ZFS deduplication technology seems not to have been much developed since then. ®

Boost IT visibility and business value

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
Cutting cancer rates: Data, models and a happy ending?
How surgery might be making cancer prognoses worse
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
VMware's high-wire balancing act: EVO might drag us ALL down
Get it right, EMC, or there'll be STORAGE CIVIL WAR. Mark my words
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Scale data protection with your virtual environment
To scale at the rate of virtualization growth, data protection solutions need to adopt new capabilities and simplify current features.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?