Isilon and a question of Big Data

Or was that ingestion?

3 Big data security analytics techniques

El Reg Would scale-out filers benefit from having deduplicated files, assuming that did not reduce performance?

Rob Peglar: In general, the answer is no. First, the assumption is incorrect; any data reduction technique of the three known (compression, deduplication, incrementalization) has a performance (time) implication. In addition, deduplication also has a space implication; the tradeoff is metadata versus data. At large scale, deduplication metadata becomes very significant.  For example, holding hashes (CRCs) of each 4KB of data – a very common granularity – implies 4 trillion items of metadata for a data repository of small size, 4PB.

If each hash structure (CRC & disk pointer, i.e. given a hash, where is its data?) is only 64 bits, or 8 bytes, which is quite small, this means 32TB of hash metadata which must be completely consistent across all nodes at all times. One must not only store that 32TB of data in stable and protected storage, but that storage must also be very fast, nearly as fast as the CPU’s ability to ingest it. It’s cost-prohibitive to have each node with 32TB of RAM just to hold hashes. Plus, even if you did have 32TB of RAM, it also means the CPUs in each node having to read 16TB worth of metadata (in the worst case) for each and every write access to a file, no matter how small – to perform the dedup hash check - and that searching alone is non-trivial, taking significant time.

The basic problem is that data is growing faster than the CPU’s ability to efficiently process the metadata for it. This is why at scale, deduplication is a non-optimal technique.  It may save some ‘end’ space, but consider ‘big data’ as discussed before.  This data is often highly unique and rarely can be deduplicated.  For example, web hits and traffic from end users. Each end user is unique – by definition – and must be able to be identified as such to analytic software. Each hit is at a different point in time, always changing, always incrementing.

Constant streams of new data being ingested are therefore rarely duplicated across the user universe. So, for ‘big data’, deduplication is most often a bad trade-off – even if the CPU were infinitely fast, you wouldn’t save much space on disk. Contrast this with traditional VM or VDI usage, where OS images are immutable and mostly read-only; here, deduplication is a good trade-off. But that is not the problem the industry is trying to solve now. The problem is ingestion, analysis and long-term storage and protection of massive (and growing!) amounts of unique and ever-changing data.

El Reg Could a company have a single logical file store with big data being a sectioned off part of that, rather than having a physically separate big data silo?

Rob Peglar: Most certainly. The ‘sectioning’ is trivial – it could be as simple as a single directory, under which all ‘big data’ is held. Isilon has a huge advantage here, in that the entire filesystem is coherent and sits under a single name – ‘ifs’. One could easily have /ifs/bigdata for all analytic data, and then /ifs/users for home directories and such, /ifs/exchange for mail, and so on. Each directory has its own attributes (metadata) regarding protection levels, tiering/residence, movement, QoS, replication, snaps, and so on.

One realizes the advantages of having multiple file systems for different purposes without the management nightmare of having to administer hundreds or thousands of different filesystems under different mount points, held on different nodes, and so on. At scale, there is a clear advantage to single namespace and single filesystem.

El Reg It was surprising to find out that deduplication was not a useful technology for big data. The dismissal of tape as the best big data archive media was also interesting to hear. We wonder if big data system vendors such as IBM and Oracle, with tape libraries in their product portfolio, will have the same view.

Also, flash is set to play an increasingly important role in big data storage as it will in enterprise storage, generally. Lots of grist here for Rob Peglar's Isilon mill to grind out for customers as he undertakes the CTO America's role. ®

SANS - Survey on application security programs

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Bored with trading oil and gold? Why not flog some CLOUD servers?
Chicago Mercantile Exchange plans cloud spot exchange
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
prev story


Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.