Feeds

Tell me, professor, what is big data?

A whole different dimension, says data scientist Mark Whitehorn

Application security programs and practises

Mean machine

Suppose you manufacture combine harvesters. You notice that sensors are getting very cheap, so you think about covering your next model in sensors (temperature, vibration, pressure, GPS and so on) that can stream data back to headquarters.

OK, before you go any further with this article, stop and think about what use we could make of that data. Try to think outside the box.

In no particular order we could:

  • Track the position of the combine in a field, look at the yield per unit area and use that information to give the farmer valuable information about how to, for example, distribute the fertiliser for next year, drain the field and so on.
  • Use the yield patterns to predict when the combine will be full so that it can actively communicate with the tractors and trailers that collect the grain and they can turn up at the appointed time and exact place.
  • Data collected over several years could be used, along with weather data, to predict the yield in any one year for the entire field.
  • Combines are very expensive bits of kit that are used infrequently during the year but intensively in the summer. A broken combine can cost a fortune in lost revenue and missed harvest. It is now possible to scan the sensor data looking for temperature and vibration patterns that are characteristic of an imminent component failure and get the part shipped and fitted before it fails.
  • Driving habits can be closely monitored, cross-correlated with fuel consumption and used to advise on driving style and technique.
  • The machine can be set to shut down if it moves outside a given area (in other words, is stolen).

Of course, crop growth is influenced by a multitude of factors, not just fertiliser, so if only we fitted a probe that took moisture samples we could …

And so it goes on. Much of this is already being done (Google "telematics combine harvesters".)

Image of google image

Another characteristic of big data is not simply how we collect the data, but why we collect it. In the old days (the small data days) we started trying to answer specific questions such as “how do we track all of our transactions?” and collect the appropriate data to do that.

We have the technology to collect vast amounts of data so the strategy can shift if we want it to

We have the technology to collect vast amounts of data so the strategy can shift if we want it to. We can say: “We certainly could collect all of this data at this kind of cost. If we did so, what could we do with it?”

I hasten to add I am not advocating the collection of data purely for the sake of it. Nor am I suggesting putting in vast arrays of sensors, collecting the data and then trying to work out what to do with it.

I am suggesting that you perform mind experiments where you say, “OK, we now have the technology to collect data such as this. What advantage could we reap if we did so?”

However I am not being dogmatic here. There certainly are times when we have the problem first and big data is the answer.

You have probably heard of the human genome project. A mere $3bn and 13 years work means that we are the only species we know that has sequenced its own genome. Which is, I believe, a major turning point for any species (let’s see the dolphins do that one).

But where does it get us? Well, apart from winning the “cleverest species on the planet” award, not very far. You see, it isn’t usually your genome that gets sick; it is your proteome (the sum of all the proteins in the body).

So work is progressing to baseline the human proteome and I am privileged to be a very small part of that project. Computationally, this is a much, much more complex undertaking.

Why? Well, to identify the proteins you have to chop them up into sections (peptides), pop the peptides into a mass spectrometer (basically a very accurate weighing machine) and measure the amount of each fragment. Then you have to back-extrapolate from the identified peptides (in itself a probabilistic process) to the proteins.

One of the many challenges here is that the mass spectrometers output big data. The output is essentially a trace consisting of about 800,000 XY coordinates. A laboratory with 10 instruments could easily produce in the order of 15 to 20 billion data points per day.

graph showing mass spectrometers' output

Medical advance

To be technical, the mass spectrometers do multiple scans per second so these are actually three-dimensional peaks. And we have to calculate the volume under the peaks.

graphs showing mass spectrometers output as 3D peaks

So, in this case, we did have a problem to solve and collecting specific big data was the way to solve it. However, once we had the data, all manner of other uses for it became apparent to us.

For example, we could use multiple data runs to watch the individual machines for calibration drift. We could take multiple runs using identical samples and cancel out the noise. We could… and so on.

Big data is real. It has always been there, it will always be there. We now have the technology to deal with it. It will change our lives. It is already doing so – from combines to revolutionising medicine. Get some today. ®

Professor Mark Whitehorn is chair of analytics at Dundee University's School of Computing

Application security programs and practises

More from The Register

next story
BBC goes offline in MASSIVE COCKUP: Stephen Fry partly muzzled
Auntie tight-lipped as major outage rolls on
You! Pirate! Stop pirating, or we shall admonish you politely. Repeatedly, if necessary
And we shall go about telling people you smell. No, not really
Airbus promises Wi-Fi – yay – and 3D movies (meh) in new A330
If the person in front reclines their seat, this could get interesting
UK Parliament rubber-stamps EMERGENCY data grab 'n' keep bill
Just 49 MPs oppose Drip's rushed timetable
There's NOTHING on TV in Europe – American video DOMINATES
Even France's mega subsidies don't stop US content onslaught
Samsung threatens to cut ties with supplier over child labour allegations
Vows to uphold 'zero tolerance' policy on underage workers
Dude, you're getting a Dell – with BITCOIN: IT giant slurps cryptocash
1. Buy PC with Bitcoin. 2. Mine more coins. 3. Goto step 1
ITC: Seagate and LSI can infringe Realtek patents because Realtek isn't in the US
Land of the (get off scot) free, when it's a foreign owner
prev story

Whitepapers

Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.