Isilon and a question of Big Data

Or was that ingestion?

Boost IT visibility and business value

El Reg Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?

Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalization) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data. At large scale, deduplication metadata becomes very significant.  For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies 4 trillion items of metadata for a data repository of small size, 4PB.

If each hash structure (CRC & disk pointer, i.e. given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times. One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It’s cost-prohibitive to have each node with 32TB of RAM just to hold hashes. Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedup hash check - and that searching alone is non-trivial, taking significant time.

The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique.  It may save some ‘end’ space, but consider ‘big data’ as discussed before.  This data is often highly unique and rarely can be deduplicated.  For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.

Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for ‘big data’, deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn’t save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.

El Reg Could a company have a single logical file store with big data being a sectioned off part of that, rather than having a physically separate big data silo?

Rob Peglar: Most certainly. The ‘sectioning’ is trivial – it could be as simple as a single directory, under which all ‘big data’ is held. Isilon has a huge advantage here, in that the entire filesystem is coherent and sits under a single name – ‘ifs’. One could easily have /ifs/bigdata for all analytic data, and then /ifs/users for home directories and such, /ifs/exchange for mail, and so on. Each directory has its own attributes (metadata) regarding protection levels, tiering/residence, movement, QoS, replication, snaps, and so on.

One realizes the advantages of having multiple file systems for different purposes without the management nightmare of having to administer hundreds or thousands of different filesystems under different mount points, held on different nodes, and so on. At scale, there is a clear advantage to single namespace and single filesystem.

El Reg It was surprising to find out that deduplication was not a useful technology for big data. The dismissal of tape as the best big data archive media was also interesting to hear. We wonder if big data system vendors such as IBM and Oracle, with tape libraries in their product portfolio, will have the same view.

Also, flash is set to play an increasingly important role in big data storage as it will in enterprise storage, generally. Lots of grist here for Rob Peglar's Isilon mill to grind out for customers as he undertakes the CTO America's role. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story


5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.