Isilon and a question of Big Data

Or was that ingestion?

The Essential Guide to IT Transformation

El Reg Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?

Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalization) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data. At large scale, deduplication metadata becomes very significant.  For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies 4 trillion items of metadata for a data repository of small size, 4PB.

If each hash structure (CRC & disk pointer, i.e. given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times. One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It’s cost-prohibitive to have each node with 32TB of RAM just to hold hashes. Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedup hash check - and that searching alone is non-trivial, taking significant time.

The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique.  It may save some ‘end’ space, but consider ‘big data’ as discussed before.  This data is often highly unique and rarely can be deduplicated.  For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.

Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for ‘big data’, deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn’t save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.

El Reg Could a company have a single logical file store with big data being a sectioned off part of that, rather than having a physically separate big data silo?

Rob Peglar: Most certainly. The ‘sectioning’ is trivial – it could be as simple as a single directory, under which all ‘big data’ is held. Isilon has a huge advantage here, in that the entire filesystem is coherent and sits under a single name – ‘ifs’. One could easily have /ifs/bigdata for all analytic data, and then /ifs/users for home directories and such, /ifs/exchange for mail, and so on. Each directory has its own attributes (metadata) regarding protection levels, tiering/residence, movement, QoS, replication, snaps, and so on.

One realizes the advantages of having multiple file systems for different purposes without the management nightmare of having to administer hundreds or thousands of different filesystems under different mount points, held on different nodes, and so on. At scale, there is a clear advantage to single namespace and single filesystem.

El Reg It was surprising to find out that deduplication was not a useful technology for big data. The dismissal of tape as the best big data archive media was also interesting to hear. We wonder if big data system vendors such as IBM and Oracle, with tape libraries in their product portfolio, will have the same view.

Also, flash is set to play an increasingly important role in big data storage as it will in enterprise storage, generally. Lots of grist here for Rob Peglar's Isilon mill to grind out for customers as he undertakes the CTO America's role. ®

The Essential Guide to IT Transformation

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Microsoft says 'weird things' can happen during Windows Server 2003 migrations
Fix coming for bug that makes Kerberos croak when you run two domain controllers
Cisco says network virtualisation won't pay off everywhere
Another sign of strain in the Borg/VMware relationship?
VVOL update: Are any vendors NOT leaping into bed with VMware?
It's not yet been released but everyone thinks it's the dog's danglies
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.