Feeds

Ex-Sun Micro CTO reveals Greenbytes 'world-beating' dedupe

He IS on advisory board of flash cache latency smasher, but...

Internet Security Threat Report 2014

El Reg has managed to take a peek at an as-yet-unpublished white paper, written by former Sun Microsystems CTO Randall Chalfant, which claims the storage company's deduplication tech has near-zero latency and possibly offers the world's fastest inline deduplication.

It works with 4K blocks of data. This is the sequence of events once data comes in to a Greenbytes system, we're told by Chalfant – a lecturer who sits on the company's industry advisory board:

  1. The Greenbytes system receives the data and it is stored in a write log, which exists on one or more solid-state drives, and then a write acknowledgement is returned to the client so its OS and application can carry on and not have to wait for the data write to complete.
  2. Greenbytes' software specifies three stages of data input: the open stage, the quiescing stage [rendering inactive], and the synching stage. In the open stage, the client is free to write as much data as possible into the memory buffer. Then, every few seconds, a snapshot is taken to freeze and then quiesce [disable] the buffer, which is done in preparation for writing data to disk. During the synch stage, the a 256-bit hash is computed for each data block.
  3. The hash is stored in a d-cache (deduplication cache), an assembly of one or more solid-state drives that can be extended easily over time to increase the size of the storage system. The d-cache only holds the dedupe search tables and has a fixed data access latency.
  4. GreenBytes' technology determines in a virtually constant time whether there is a block match in the storage system using the hashes. It calls the look-up algorithm it uses its probabilistic constant time search.
  5. Computed hashes are looked up to see if they exist already in the d-cache. The d-cache returns an answer in constant time, and, if there is no match, a new 4k block of data is written to storage. If there is a match a pointer gets written instead.

To add more detail here, Greenbytes' CTO and founder Bob Petrocelli says:

The width of the hash is actually tunable. We currently allow 128 bit,192 bit and 256 bit hashes. The default is 256 bits. One of the patent claims deals with the searching approach using hashes. ...

The important point is to realise that the write-coalescing and the actual determination of which blocks to write happens during the transactional phase of the pipeline. There are a lot of complex considerations during this phase.

For example a block that is overwritten many times, say by an application log etc, will only be written once, [using the] final state of the block as all blocks writes are collapsed). A temporary in memory AVL tree is used for this write coalescing.

The system is zero latency because we are able to back the write immediately, protected by the intent log, and then only later during the transactional phase do we absorb the cost of de-duplication. When we have any duplicate data in the stream, we come out ahead of the game because we end up committing less data to disk.

CEO and chairman Steve O'Donnell added: "The searching process uses small parts (64 bits) of the hash to rapidly determine the likelihood of the need to write a block, this dramatically reduces the amount of RAM needed to store the hash and enables the tiny footprint that vIO [Virtual Desktop software] uses inside the Hypervisor."

Greenbytes has protected its dedupe software with many patents and will defend its patents using legal eagles. In fact it has already done so.

Back in 2009, Sun Microsystems, which at the time was being acquired by Oracle, sued Greenbytes for infringing its deduplication patents, after Greenbytes claimed Sun had used Greenbytes' own deduplication scheme.

By 2010 this legal dispute was settled and Greenbytes continued to sell and develop its deduplication technology.

The Oracle/Sun ZFS deduplication technology seems not to have been much developed since then. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.