Big Data: Why it's not always that big nor even that clever
And as for data scientists being sexy, well...
Can you imagine your bank calculating your current account using tweets and Facebook posts?
The other change contributing to the purported Big Data paradigm-shift is an explosion in the variety of data. As mentioned above, companies like Google and Facebook have to juggle and create “graphs” for profiles or demographic information from a very large number of sources in as many formats. But this certainly isn’t true of everyone. When people talk about these new, messy collections of unstructured data, they’re almost always referring to data sourced from social networks and blogs.
Will core systems used at banks (to do transaction-processing, say, an area still completely dominated by old-school relational databases) really need to use social-media data? What about inventory systems, or digital catalogs, or systems used by cancer researchers? And what about data that, for whatever reason, can’t be processed in a distributed, stateless way?
Highly unstructured data still occupies a specialized and relatively tiny niche, though it’s a very flashy one. And it’s not as if no system in the world has had to do parsing, translating, or pre-processing to merge data from multiple sources before.
If a company that’s been around for years suddenly argues that it needs Big Data techniques to run its business, it must mean that either the nature of that business has changed fundamentally overnight or it's been hobbling along forever with systems that don’t quite work. Either of those claims would be hard to believe.
Make your system scalable and, hey presto, you can do Big Data
The term Big Data is often used sloppily, if not downright incorrectly. The mere increase in the amount of data being made available for a given application, no matter how big that increase may be, doesn’t automatically make it a Big Data application. The system will need to handle more capacity, but that might require nothing more than a few design tweaks, and maybe not even that if the system was designed to be scalable.
One article I read recently on the subject of textual analysis (looking for patterns across all books by a given author, for example) gave the impression that this was something that was never done before, because it couldn’t be.
But computer-based textual analysis has been going on forever. I remember being fascinated by studies of the frequencies and patterns of words in Shakespeare’s plays in the 1970s. What is new, if anything, is the number of texts available in digital form, which may itself have spurred interest in super-large-scale book-crunching.
If these texts were available 20 years ago, and there was interest then in looking for these kinds of patterns, it’s almost inconceivable that scientists would have thrown up their hands and said, “we just can’t do it with the technology we have.” And even without knowing the details of how these kinds of analysis are being done today, I find it very unlikely that there aren’t traditional sorting and searching algorithms used somewhere in the code. There’s nothing inherently Big Data about this.
As for the gesture-level information being collected by Netflix for marketing or advertising purposes (say, what parts of a movie people were likely to skip over, where they paused, or what scenes they watched repeatedly), that data is being analyzed and possibly used for nefarious purposes simply because it’s available.
If DVD-rental companies could somehow have seized that kind of low-level information in the 1990s they probably would have been capable of analyzing it, if they’d chosen to. What’s responsible for this disturbing trend is the ability of movie-rental companies to capture gesture information, their relatively new interest in using it, and the collaboration between companies to fit their individual pieces of these social puzzles together.
“Data” hasn’t become the root of all evil overnight, any more than it’s become the only thing that matters. And blaming Big Data for everything that’s wrong with the world is no better than scrapping all your existing, non-Big Data technology because it’s suddenly “obsolete.” ®
You're kidding, right ?
"If a company that’s been around for years suddenly argues that it needs Big Data techniques to run its business, it must mean that either [...] or it's been hobbling along forever with systems that don’t quite work. Either of those claims would be hard to believe."
The second is all too believeable, and is keeping me in a job right now ...
It isn't what you've got, its how you use it.
I talked to a recruitment consultant a while ago who pointed out that all the recruitment companies have gone "big data." That is, they do word frequency analysis on CVs and just search on a big pile of "stuff" and take the top CVs on the list.
So now you have to keep repeating keywords, add abbreviations in brackets and that sort of thing to make sure your CV ends up on page 1 of the search results.
They have replaced personal knowledge and relationships with a technical solution which will inevitably lead to poorer quality but greater quantity of words in people's CVs. I'd be surprised if people weren't already using white-on-white text to bump their CV's visibility to the search engine.
By destructuring the data they've increased their storage costs, removed information from the system and now they have to keep tweaking the systems to stop them being gamed. Sending a slightly irrelevant advert to someone is one thing, but making business decisions about personnel suitability based on this stuff is dangerous. The reason we have structure is because it organises data into easily understood information. A word-cloud from a comment box might be fine for an initial analysis of what people are talking about, but it doesn't tell you what they are saying - the data is there, the information has been removed.
Every client I've ever worked for has used one or both of the phrases "We handle a huge amount of data" and "I bet you've never seen it this bad". Almost without exception they're processing a very normal amount of data, sometimes in very inefficient ways. The science behind MapReduce is far more important than that specific technology - often there are equally useful techniques that are better suited to a client's needs, however much they might want to install Hadoop etc.
"You may not realize it, but data is far and away the most critical element in any computer system. Data is all-important. It’s the center of the universe."
If you don't realise it then you probably shouldn't be working in IT.
+1 for sarky quotes. Later I'll buy a hot drink from a "coffee scientist".