Feeds

Isilon and a question of Big Data

Or was that ingestion?

Internet Security Threat Report 2014

El Reg Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?

Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalization) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data. At large scale, deduplication metadata becomes very significant.  For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies 4 trillion items of metadata for a data repository of small size, 4PB.

If each hash structure (CRC & disk pointer, i.e. given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times. One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It’s cost-prohibitive to have each node with 32TB of RAM just to hold hashes. Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedup hash check - and that searching alone is non-trivial, taking significant time.

The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique.  It may save some ‘end’ space, but consider ‘big data’ as discussed before.  This data is often highly unique and rarely can be deduplicated.  For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.

Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for ‘big data’, deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn’t save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.

El Reg Could a company have a single logical file store with big data being a sectioned off part of that, rather than having a physically separate big data silo?

Rob Peglar: Most certainly. The ‘sectioning’ is trivial – it could be as simple as a single directory, under which all ‘big data’ is held. Isilon has a huge advantage here, in that the entire filesystem is coherent and sits under a single name – ‘ifs’. One could easily have /ifs/bigdata for all analytic data, and then /ifs/users for home directories and such, /ifs/exchange for mail, and so on. Each directory has its own attributes (metadata) regarding protection levels, tiering/residence, movement, QoS, replication, snaps, and so on.

One realizes the advantages of having multiple file systems for different purposes without the management nightmare of having to administer hundreds or thousands of different filesystems under different mount points, held on different nodes, and so on. At scale, there is a clear advantage to single namespace and single filesystem.

El Reg It was surprising to find out that deduplication was not a useful technology for big data. The dismissal of tape as the best big data archive media was also interesting to hear. We wonder if big data system vendors such as IBM and Oracle, with tape libraries in their product portfolio, will have the same view.

Also, flash is set to play an increasingly important role in big data storage as it will in enterprise storage, generally. Lots of grist here for Rob Peglar's Isilon mill to grind out for customers as he undertakes the CTO America's role. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.