This storage startup dedupes what to do what? How?
Yep, it's complicated. Our man Trev unravels one new firm's tech
Sysadmin blog Two months ago hyper-converged infrastructure appliance startup SimpliVity sued its rival Springpath. I know next to nothing about the lawsuit – nor do I want to – but the whole thing has caused commentators and armchair analysts everywhere to ask what exactly SimpliVity does that is unique.
I have spent the past two months learning exactly this and so, naturally, I will share all of you.
Putting the marketing fluff away, SimpliVity's special sauce is its storage. Everything else it does someone else can do, but its approach to storage is unique. In this, SimpliVity serves as a model organism for the consideration of a technology infrastructure startup.
One of the problems with explaining these startups to others is that words represent more than an single limited meaning. Words such as deduplication, snapshot and replication, while technically accurate, are saddled with decades of baggage surrounding implementation and theoretical design that increasingly don’t apply.
I am going to be perfectly honest when I say that I don't understand how every company out there does their storage. I grok many, maybe even most of them. But even though I do briefings 20 hours a week, I simply don't have enough time to do briefings with all the startups out there and really learn their tech.
Even amongst those startups with whom I do get the opportunity to do briefings, very few have been willing to get really deep into the nitty gritty of what makes them unique. Marketing dominates the calls and they insist on using terminology that doesn't actually convey the differences in implementation that make the startup or product in question worth anyone's attention.
This is one of the reasons that storage reporting is so very inaccurate, and why I deeply respect individuals like The Register's own Chris Mellor, or industry pundits such as Alastair Cooke or Howard Marks. They make a strong effort to learn the new, even if it doesn't easily fit with how the old used to work.
One of the most fun moments in recent months was listening in on a briefing about exactly how SimpliVity's storage works and being able to audibly hear “click” when Mr. Marks fully understood how it all worked. If only all tech journalists and analysts had that time and talent to invest.
To examine the problem further, let’s compare the old with the new.
Traditional deduplication is simple. While implementations vary the fundamental principle is always the same: every block that is written is hashed, and the hash is stored in an index. If a new block is supposed to be written that matches one of the existing hashes in the index, then that new block is not written; a pointer is simply added to the existing block.
This is a balancing act on many levels. You need to choose a hash size and index Globally Unique ID (GUID) size small enough that you can store the entire index in RAM, but large enough that you don’t run out of unique identifiers for blocks.
Similarly, you need to choose block sizes that are small enough that there is a decent chance of blocks being the same (thus realising a space savings from deduplication) but large enough that the overhead of the metadata doesn’t cancel out the space savings.
Ultimately, however, this is a very primitive approach to space savings. It is resource intensive and doesn’t scale well. It works reasonably well if all storage is owned by a single system, but starts becoming a nightmare when we talk about scale-out clustered storage, let alone storage clusters with nodes in different geographic areas.
Now let's solve some of these problems by being a little more conscious of how data is used than traditional deduplication approaches. To explain, I’ll need to lean on an analogy and as this is a blog I’ll pick the written word as my analogy. (At least it’s not cars!)
Let’s envision each unique data block in a traditional deduplication system as a Unicode character. To write this blog I need letters (both majuscule and minuscule), punctuation, spaces, some quotation marks, and so forth. Many characters are reused frequently, with the letter “e” being the most frequently used in English.
Now, if each character were an 8KiB block then using up, for example, 512 bits to store the metadata about that block will produce notable savings. Even if all we’re doing is mapping the usage of our “e” blocks.
But what if we step back a little and look at the text instead of just the letters? How many times have I used the word “the” or “space” in this article? What if, in addition to creating metadata about each letter of the text I also created metadata about each word? Each sentence? Whole paragraphs and even entire documents?
This approach has some pretty big ramifications. The first is that the amount of metadata in use climbs. However, that metadata isn’t entirely bound to the individual blocks that are stored. This means that if you want to store a whole bunch of very small, mostly unique bits of data then this new approach to storage is probably going to be even worse than traditional deduplication which is in turn worse for that use case than not using deduplication at all.
In the real world, however, data is almost never that unique. Not only is deduplication effective for a huge number of workloads but the data that is deduplicated almost always has patterns or aggregations of blocks that lend themselves to this enhanced metadata approach. This makes scale out and geographically dispersed storage not only possible, but frighteningly efficient.
Sponsored: What next after Netezza?