Tell me, professor, what is big data?
A whole different dimension, says data scientist Mark Whitehorn
Big Data may be misunderstood and overhyped - but the promise of data growth enabling a goldmine of insight is compelling. Professor Mark Whitehorn, the eminent data scientist, author and occasional Register columnist, explains what big data is and why it is important.
Sometimes life is generous and hands you an unexpected gift on a plate. Our esteemed editor asked me to write about how big data is misunderstood.
On the very day I was writing, the Grauniad reported that the Oxford English Dictionary has just defined big data as “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges".
Nooooo!!!!!!!! No. No. No. Really, no.
And I certainly don’t think that defining big data by the three 'V's (velocity, volume and variety) helps to add veracity, validation or value either. So, what is big data?
Little and large
In the beginning there was data. Data is just data.
Data is not large and it is not small
It does not live and it does not die
It does not offer truth and neither does it lie
(with apologies to Michael Moorcock)
In other words, data has always existed as both big data and small data (OK, so “small data” is not a real term, but it is useful here as a distinguisher).
But an often overlooked fact is that all data is very difficult to handle properly. We have used databases since the 1960s but it wasn’t until 1993 that we even understood transactions properly.
I interviewed the late, great, Jim Gray, who said: “I spent decades working on the problem of getting transaction integrity to work at all and on ACID [atomicity, consistency, isolation, durability] properties and how they can be implemented efficiently.”
Jim was awarded a Turing Award (the computing equivalent of a Nobel) for this work. The bottom line is that storing, editing and querying data properly is very, very difficult.
So we may have always had both big and small data but in the early days we focused on the data that was easiest to manipulate. This turned out to be the data which is simple (atomic) and fits neatly into columns and rows. In other words, we focused on data that fits neatly into tables. This is small data.
Another feature of small data is that we typically want to query it by simple sub-setting.
Think about a SQL query:
SELECT Name, DateOfBirth
WHERE Salary > 40,000;
The FROM chooses the table, the SELECT subsets the columns and the WHERE subsets the rows.
SQL is exceptionally good at sub-setting and wildly incompetent at comparing sequential rows. This is because SQL is for set manipulation and sets are by definition unordered; there is no concept of sequential rows in a table.
Leave the table
So, for the past 40 years we have focused on the easy stuff – tabular data that can be manipulated by sub-setting. In other words, we have focused on small data.
One characteristic of big data is that it does not fit easily into tables – good examples are image and audio files.
Another is that we don’t want to query it by sub-setting; the queries are more complex than that. Some data has both of these characteristics, some just one.
As an example of the former, image files can be broken down into individual pixels and each one stored as a row in a table, but you probably don’t want to do that. Even if you did, a query looking for all the pictures that contain a vintage Bentley is not simply sub-setting by rows and columns.
As an example of the latter, sensor data can be tabularised quite easily (although you will end up with a very narrow, mind-bogglingly deep table), but the queries we run against this kind of data are typically looking for patterns formed by the consideration of sequential rows rather than sub-setting. So SQL is not the query language of choice here.
OK, so big data doesn’t fit well into tables and we need to be able to run queries that are more complex than sub-setting.
Why is big data worth pursuing?
At this point I usually start to describe social data analysis or scanning aerial photographs for hidden aeroplanes (both excellent examples). But I also love machinery and am very taken with the melding of big data and combine harvesters. So we will look at that.
Suppose you manufacture combine harvesters. You notice that sensors are getting very cheap, so you think about covering your next model in sensors (temperature, vibration, pressure, GPS and so on) that can stream data back to headquarters.
OK, before you go any further with this article, stop and think about what use we could make of that data. Try to think outside the box.
In no particular order we could:
- Track the position of the combine in a field, look at the yield per unit area and use that information to give the farmer valuable information about how to, for example, distribute the fertiliser for next year, drain the field and so on.
- Use the yield patterns to predict when the combine will be full so that it can actively communicate with the tractors and trailers that collect the grain and they can turn up at the appointed time and exact place.
- Data collected over several years could be used, along with weather data, to predict the yield in any one year for the entire field.
- Combines are very expensive bits of kit that are used infrequently during the year but intensively in the summer. A broken combine can cost a fortune in lost revenue and missed harvest. It is now possible to scan the sensor data looking for temperature and vibration patterns that are characteristic of an imminent component failure and get the part shipped and fitted before it fails.
- Driving habits can be closely monitored, cross-correlated with fuel consumption and used to advise on driving style and technique.
- The machine can be set to shut down if it moves outside a given area (in other words, is stolen).
Of course, crop growth is influenced by a multitude of factors, not just fertiliser, so if only we fitted a probe that took moisture samples we could …
And so it goes on. Much of this is already being done (Google "telematics combine harvesters".)
Another characteristic of big data is not simply how we collect the data, but why we collect it. In the old days (the small data days) we started trying to answer specific questions such as “how do we track all of our transactions?” and collect the appropriate data to do that.
We have the technology to collect vast amounts of data so the strategy can shift if we want it to
We have the technology to collect vast amounts of data so the strategy can shift if we want it to. We can say: “We certainly could collect all of this data at this kind of cost. If we did so, what could we do with it?”
I hasten to add I am not advocating the collection of data purely for the sake of it. Nor am I suggesting putting in vast arrays of sensors, collecting the data and then trying to work out what to do with it.
I am suggesting that you perform mind experiments where you say, “OK, we now have the technology to collect data such as this. What advantage could we reap if we did so?”
However I am not being dogmatic here. There certainly are times when we have the problem first and big data is the answer.
You have probably heard of the human genome project. A mere $3bn and 13 years work means that we are the only species we know that has sequenced its own genome. Which is, I believe, a major turning point for any species (let’s see the dolphins do that one).
But where does it get us? Well, apart from winning the “cleverest species on the planet” award, not very far. You see, it isn’t usually your genome that gets sick; it is your proteome (the sum of all the proteins in the body).
So work is progressing to baseline the human proteome and I am privileged to be a very small part of that project. Computationally, this is a much, much more complex undertaking.
Why? Well, to identify the proteins you have to chop them up into sections (peptides), pop the peptides into a mass spectrometer (basically a very accurate weighing machine) and measure the amount of each fragment. Then you have to back-extrapolate from the identified peptides (in itself a probabilistic process) to the proteins.
One of the many challenges here is that the mass spectrometers output big data. The output is essentially a trace consisting of about 800,000 XY coordinates. A laboratory with 10 instruments could easily produce in the order of 15 to 20 billion data points per day.
To be technical, the mass spectrometers do multiple scans per second so these are actually three-dimensional peaks. And we have to calculate the volume under the peaks.
So, in this case, we did have a problem to solve and collecting specific big data was the way to solve it. However, once we had the data, all manner of other uses for it became apparent to us.
For example, we could use multiple data runs to watch the individual machines for calibration drift. We could take multiple runs using identical samples and cancel out the noise. We could… and so on.
Big data is real. It has always been there, it will always be there. We now have the technology to deal with it. It will change our lives. It is already doing so – from combines to revolutionising medicine. Get some today. ®
Professor Mark Whitehorn is chair of analytics at Dundee University's School of Computing