Big Data is now TOO BIG - and we're drowning in toxic information

Just why are we hoarding every last binary bit?

Boost IT visibility and business value

Open ... and Shut Unless you have found a clever way of avoiding the internet completely, you no doubt have been warned that THERE IS A BIG DATA EXPLOSION! By many accounts, we are currently drowning in information - from log files to stock charts to customer profiles - and face a host of new products cropping up to help us manage the onslaught. Unfortunately, our fixation on hoarding and storing data may actually be making the problem worse, not better.

This is the message of Nassim Taleb's forthcoming book, Antifragile. Taleb made his name with the influential book The Black Swan, and his theory is bound to ruffle some feathers. As he explains in an excerpt from the book:

In business and economic decision-making, data causes severe side effects - data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities - even in moderate quantities.

How can this be? We're told at every turn that more data equals better decisions. Yes, we need to parse all that binary to derive "actionable insights", which is the buzzphrase currently making the rounds of every Big Data startup's VC pitch deck. But once you do, ka-BOOM! Your business will immediately have superhero powers.

Except, of course, that it won't.

According to a Gartner survey [PDF], while the volume of corporate data is growing by upwards of 60 per cent each year, the vast majority of respondents (73 per cent) feel their competitors make better use of data than they do. And a mere 17 per cent reveal that they use more than 75 per cent of their data, which suggests most companies collect lots of data and have no clue what to do with them all.

But imagine what will happen when everyone uses data efficiently and to maximum potency: by definition, any competitive advantage will dissipate as all companies (and competitors) become Big Data maestros together. Of course, this will happen at different speeds for different companies, making the race to make sense of corporate data worthwhile.

But it still doesn't tackle Taleb's larger point: the more data we analyse, the more likely our insights from the data will be wrong. Quoting Taleb at length to ensure his point is not lost:

The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio. And there is a confusion, that is not psychological at all, but inherent in the data itself.

Say you look at information on a yearly basis, for stock prices or the fertilizer sales of your father-in-law’s factory, or inflation numbers in Vladivostock. Assume further that for what you are observing, at the yearly frequency the ratio of signal to noise is about one to one (say half noise, half signal) — it means that about half of changes are real improvements or degradations, the other half comes from randomness. This ratio is what you get from yearly observations.

But if you look at the very same data on a daily basis, the composition would change to 95 per cent noise, 5 per cent signal. And if you observe data on an hourly basis, as people immersed in the news and markets price variations do, the split becomes 99.5 per cent noise to .5 per cent signal. That is two hundred times more noise than signal — which is why anyone who listens to news (except when very, very significant events take place) is one step below sucker. ...

Now let’s add the psychological to this: we are not made to understand the point, so we overreact emotionally to noise. The best solution is to only look at very large changes in data or conditions, never small ones.

None of which is to suggest that there's no value in Big Data. One of the brand-name companies in Big Data, Cloudera, showcases a range of customer stories that describe ways real companies have derived real value from their data. (Disclosure: Cloudera's CEO is on the board of directors of my company, Nodeable.)

But let's not miss the trees for the forest. As Nick Carr writes, commenting on Taleb's findings: "Because we humans seem to be natural-born signal hunters, we're terrible at regulating our intake of information. We'll consume a ton of noise if we sense we may discover an added ounce of signal. So our instinct is at war with our capacity for making sense."

In other words, the problem isn't the data: it's our ability to know when we have enough data.

We're in the midst of a gold rush, when there's such a fever to collect data that we may be overextending ourselves. One former senior IT executive with one of Silicon Valley's largest web companies acknowledged that his company stores every log file - and does absolutely nothing with them. Never had, and likely never will. Some people suggest the answer is to start deleting this data to keep it manageable and to avoid security breaches. Maybe.

But perhaps a better solution would be to carefully consider which data are likely to be of use, and focus on these data. Yes, this runs the risk of overlooking data that could be useful but may not be immediately recognised as such. Splunk, after all, went public on the premise that log files from machine data are a gold mine for insight into one's business and IT operations, a gold mine that many had previously overlooked.

But we're not currently struggling to collect data. The industry's big need right now is to parse data, and a big part of that surely must be paring down the amount of data we collect in the first place. ®

Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story


5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.