Big Data is now TOO BIG - and we're drowning in toxic information

Just why are we hoarding every last binary bit?

Choosing a cloud hosting partner with confidence

Open ... and Shut Unless you have found a clever way of avoiding the internet completely, you no doubt have been warned that THERE IS A BIG DATA EXPLOSION! By many accounts, we are currently drowning in information - from log files to stock charts to customer profiles - and face a host of new products cropping up to help us manage the onslaught. Unfortunately, our fixation on hoarding and storing data may actually be making the problem worse, not better.

This is the message of Nassim Taleb's forthcoming book, Antifragile. Taleb made his name with the influential book The Black Swan, and his theory is bound to ruffle some feathers. As he explains in an excerpt from the book:

In business and economic decision-making, data causes severe side effects - data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities - even in moderate quantities.

How can this be? We're told at every turn that more data equals better decisions. Yes, we need to parse all that binary to derive "actionable insights", which is the buzzphrase currently making the rounds of every Big Data startup's VC pitch deck. But once you do, ka-BOOM! Your business will immediately have superhero powers.

Except, of course, that it won't.

According to a Gartner survey [PDF], while the volume of corporate data is growing by upwards of 60 per cent each year, the vast majority of respondents (73 per cent) feel their competitors make better use of data than they do. And a mere 17 per cent reveal that they use more than 75 per cent of their data, which suggests most companies collect lots of data and have no clue what to do with them all.

But imagine what will happen when everyone uses data efficiently and to maximum potency: by definition, any competitive advantage will dissipate as all companies (and competitors) become Big Data maestros together. Of course, this will happen at different speeds for different companies, making the race to make sense of corporate data worthwhile.

But it still doesn't tackle Taleb's larger point: the more data we analyse, the more likely our insights from the data will be wrong. Quoting Taleb at length to ensure his point is not lost:

The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio. And there is a confusion, that is not psychological at all, but inherent in the data itself.

Say you look at information on a yearly basis, for stock prices or the fertilizer sales of your father-in-law’s factory, or inflation numbers in Vladivostock. Assume further that for what you are observing, at the yearly frequency the ratio of signal to noise is about one to one (say half noise, half signal) — it means that about half of changes are real improvements or degradations, the other half comes from randomness. This ratio is what you get from yearly observations.

But if you look at the very same data on a daily basis, the composition would change to 95 per cent noise, 5 per cent signal. And if you observe data on an hourly basis, as people immersed in the news and markets price variations do, the split becomes 99.5 per cent noise to .5 per cent signal. That is two hundred times more noise than signal — which is why anyone who listens to news (except when very, very significant events take place) is one step below sucker. ...

Now let’s add the psychological to this: we are not made to understand the point, so we overreact emotionally to noise. The best solution is to only look at very large changes in data or conditions, never small ones.

None of which is to suggest that there's no value in Big Data. One of the brand-name companies in Big Data, Cloudera, showcases a range of customer stories that describe ways real companies have derived real value from their data. (Disclosure: Cloudera's CEO is on the board of directors of my company, Nodeable.)

But let's not miss the trees for the forest. As Nick Carr writes, commenting on Taleb's findings: "Because we humans seem to be natural-born signal hunters, we're terrible at regulating our intake of information. We'll consume a ton of noise if we sense we may discover an added ounce of signal. So our instinct is at war with our capacity for making sense."

In other words, the problem isn't the data: it's our ability to know when we have enough data.

We're in the midst of a gold rush, when there's such a fever to collect data that we may be overextending ourselves. One former senior IT executive with one of Silicon Valley's largest web companies acknowledged that his company stores every log file - and does absolutely nothing with them. Never had, and likely never will. Some people suggest the answer is to start deleting this data to keep it manageable and to avoid security breaches. Maybe.

But perhaps a better solution would be to carefully consider which data are likely to be of use, and focus on these data. Yes, this runs the risk of overlooking data that could be useful but may not be immediately recognised as such. Splunk, after all, went public on the premise that log files from machine data are a gold mine for insight into one's business and IT operations, a gold mine that many had previously overlooked.

But we're not currently struggling to collect data. The industry's big need right now is to parse data, and a big part of that surely must be paring down the amount of data we collect in the first place. ®

Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
The cloud that goes puff: Seagate Central home NAS woes
4TB of home storage is great, until you wake up to a dead device
Fat fingered geo-block kept Aussies in the dark
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Intel offers ingenious piece of 10TB 3D NAND chippery
The race for next generation flash capacity now on
Want to STUFF Facebook with blatant ADVERTISING? Fine! But you must PAY
Pony up or push off, Zuck tells social marketeers
Oi, Europe! Tell US feds to GTFO of our servers, say Microsoft and pals
By writing a really angry letter about how it's harming our cloud business, ta
SAVE ME, NASA system builder, from my DEAD WORKSTATION
Anal-retentive hardware nerd in paws-on workstation crisis
prev story


Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Getting ahead of the compliance curve
Learn about new services that make it easy to discover and manage certificates across the enterprise and how to get ahead of the compliance curve.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.