Big Data: Why it's not always that big nor even that clever
And as for data scientists being sexy, well...
You may not realize it, but data is far and away the most critical element in any computer system. Data is all-important. It’s the center of the universe.
A managing director at JPMorgan Chase was quoted as calling data “the lifeblood of the company.” A major tech conference held recently (with data as its primary focus) included a presentation on how to become “a data-driven organization.”
The Harvard Business Review says “data scientist” will be “the sexiest job of the 21st century.” A separate recent article describing how Netflix is harvesting information about our every gesture, and may transform us from “happy subscribers to mindless puppets”, warned that “the sheer amount of data available to crunch is already phenomenal and is growing at an extraordinary rate.”
Reckless, clueless uses of the term 'Big Data'
All the above quotes come from articles touting, selling or gaping in awe at Big Data, this year’s Mother of All Tech Trends. If you’re a technologist, it’s easy to feel a little inadequate if you’re not singing its praises, which is all the more bewildering because no one seems to know exactly what it is. Well, that’s not quite true. Big Data, strictly speaking, is the product of several forces:
- The vast increase in the quantity of information being collected (and stored, and processed, and analyzed) due to the insatiable appetite of Big Brothers including Google, Facebook and Amazon.
- The heterogeneous nature of this information, which can come from online purchases, Facebook status updates, tweets, shared photos, and check-ins, among other places.
- The demand to crunch these mountains of data as quickly and efficiently as possible.
However, the term seems to get thrown around recklessly and cluelessly more often than not and, even when it’s used appropriately, applied much more widely than is warranted.
The three forces mentioned above are real. Google is trying to suck up every bit of information it possibly can, from whatever source, in an attempt to create profiles on as many people as possible. (This can be for good or evil: from products such as Google Now that “anticipate your needs before you do" to selling you as a package to advertisers — but I won’t get into the moral issues here.)
By definition, that torrent of data from every source in the world is not going to be neat, uniform and rectangular. So, yes, Google probably needs a special set of tools to deal with this data, which may be unlike any data processed in the past in volume and variety.
The best-known of these tools are Hadoop - a non-relational, distributed database framework - and MapReduce - a set of algorithms developed by Google to mash heterogeneous data from multiple sources into a single set of key/value pairs. Using Hadoop and MapReduce, Google can break massive datasets into manageable chunks and process those chunks independently and statelessly on a server farm.
Is it true that this kind of data can’t be managed easily, quickly and without painful pre-processing using a relational database, the designated dinosaur of the Big Data crowd? Possibly.
Google's special needs
Is MapReduce the game-changing data-consolidation technology that its champions claim it is? Almost definitely not: The legitimacy of Google’s patent on the process has been questioned on the grounds that existing products can easily perform the same relatively simple functions. Basic MapReduce examples published on the web consist of a few dozen lines of Java code. There’s nothing particularly revolutionary going on here.
But let’s assume Google requires these tools to meet its very special needs. And let’s assume all existing tools and database frameworks are inadequate for their purposes. That doesn’t mean Big Data is something that (as its proponents claim) nearly every organization running a big-ish computer application has to confront and deal with using new database and software models.
Large quantities of data, even huge quantities of data, are nothing new. In the investment-banking world, high-frequency-trading systems have always had to handle tremendous numbers of transactions at speeds measured to the microsecond; market-data engines that store and process thousands of price ticks per second have existed for years.
Speaking recently to my friend Ken Caldeira, who runs a climate-science lab at Stanford’s Carnegie Institution for Science, I found out, not surprisingly, that he regularly has to deal with “petabytes of data.” Another colleague of mine, a Wall St. quant trained as a physicist who spent several years doing genome work in the 2000s, claims that in his genomics research there were “staggering amounts” of data to analyze.
In the era of Big Data, larger-than-ever datasets are often cited as an issue that nearly everyone has to contend with, and for which the previous generation of tools is practically useless.
But for the most part, Caldeira and my quant friend use… Python scripts and C++. It’s true that many huge data-consumers now make use of massively parallel architecture, clusters, and the cloud, but this move has been going on for more than a decade and, as my quant friend points out, “people confuse doing things in the cloud with what you do in the cloud. Just because the data is in the cloud doesn’t mean you’re doing something different.” Using distributed databases for speed and redundancy makes sense no matter what kind of work you’re doing, given the ever-plummeting cost of hardware.
Can you imagine your bank calculating your current account using tweets and Facebook posts?
The other change contributing to the purported Big Data paradigm-shift is an explosion in the variety of data. As mentioned above, companies like Google and Facebook have to juggle and create “graphs” for profiles or demographic information from a very large number of sources in as many formats. But this certainly isn’t true of everyone. When people talk about these new, messy collections of unstructured data, they’re almost always referring to data sourced from social networks and blogs.
Will core systems used at banks (to do transaction-processing, say, an area still completely dominated by old-school relational databases) really need to use social-media data? What about inventory systems, or digital catalogs, or systems used by cancer researchers? And what about data that, for whatever reason, can’t be processed in a distributed, stateless way?
Highly unstructured data still occupies a specialized and relatively tiny niche, though it’s a very flashy one. And it’s not as if no system in the world has had to do parsing, translating, or pre-processing to merge data from multiple sources before.
If a company that’s been around for years suddenly argues that it needs Big Data techniques to run its business, it must mean that either the nature of that business has changed fundamentally overnight or it's been hobbling along forever with systems that don’t quite work. Either of those claims would be hard to believe.
Make your system scalable and, hey presto, you can do Big Data
The term Big Data is often used sloppily, if not downright incorrectly. The mere increase in the amount of data being made available for a given application, no matter how big that increase may be, doesn’t automatically make it a Big Data application. The system will need to handle more capacity, but that might require nothing more than a few design tweaks, and maybe not even that if the system was designed to be scalable.
One article I read recently on the subject of textual analysis (looking for patterns across all books by a given author, for example) gave the impression that this was something that was never done before, because it couldn’t be.
But computer-based textual analysis has been going on forever. I remember being fascinated by studies of the frequencies and patterns of words in Shakespeare’s plays in the 1970s. What is new, if anything, is the number of texts available in digital form, which may itself have spurred interest in super-large-scale book-crunching.
If these texts were available 20 years ago, and there was interest then in looking for these kinds of patterns, it’s almost inconceivable that scientists would have thrown up their hands and said, “we just can’t do it with the technology we have.” And even without knowing the details of how these kinds of analysis are being done today, I find it very unlikely that there aren’t traditional sorting and searching algorithms used somewhere in the code. There’s nothing inherently Big Data about this.
As for the gesture-level information being collected by Netflix for marketing or advertising purposes (say, what parts of a movie people were likely to skip over, where they paused, or what scenes they watched repeatedly), that data is being analyzed and possibly used for nefarious purposes simply because it’s available.
If DVD-rental companies could somehow have seized that kind of low-level information in the 1990s they probably would have been capable of analyzing it, if they’d chosen to. What’s responsible for this disturbing trend is the ability of movie-rental companies to capture gesture information, their relatively new interest in using it, and the collaboration between companies to fit their individual pieces of these social puzzles together.
“Data” hasn’t become the root of all evil overnight, any more than it’s become the only thing that matters. And blaming Big Data for everything that’s wrong with the world is no better than scrapping all your existing, non-Big Data technology because it’s suddenly “obsolete.” ®