The Register® — Biting the hand that feeds IT

Feeds

Tell me, professor, what is big data?

A whole different dimension, says data scientist Mark Whitehorn

5 ways to reduce advertising network latency

Big Data may be misunderstood and overhyped - but the promise of data growth enabling a goldmine of insight is compelling. Professor Mark Whitehorn, the eminent data scientist, author and occasional Register columnist, explains what big data is and why it is important.

Sometimes life is generous and hands you an unexpected gift on a plate. Our esteemed editor asked me to write about how big data is misunderstood.

On the very day I was writing, the Grauniad reported that the Oxford English Dictionary has just defined big data as “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges".

Nooooo!!!!!!!! No. No. No. Really, no.

And I certainly don’t think that defining big data by the three 'V's (velocity, volume and variety) helps to add veracity, validation or value either. So, what is big data?

Little and large

In the beginning there was data. Data is just data.

Data is not large and it is not small
It does not live and it does not die
It does not offer truth and neither does it lie
(with apologies to Michael Moorcock)

In other words, data has always existed as both big data and small data (OK, so “small data” is not a real term, but it is useful here as a distinguisher).

But an often overlooked fact is that all data is very difficult to handle properly. We have used databases since the 1960s but it wasn’t until 1993 that we even understood transactions properly.

I interviewed the late, great, Jim Gray, who said: “I spent decades working on the problem of getting transaction integrity to work at all and on ACID [atomicity, consistency, isolation, durability] properties and how they can be implemented efficiently.”

Jim was awarded a Turing Award (the computing equivalent of a Nobel) for this work. The bottom line is that storing, editing and querying data properly is very, very difficult.

So we may have always had both big and small data but in the early days we focused on the data that was easiest to manipulate. This turned out to be the data which is simple (atomic) and fits neatly into columns and rows. In other words, we focused on data that fits neatly into tables. This is small data.

Another feature of small data is that we typically want to query it by simple sub-setting.

Think about a SQL query:

SELECT Name, DateOfBirth
FROM Employee
WHERE Salary > 40,000;

The FROM chooses the table, the SELECT subsets the columns and the WHERE subsets the rows.

SQL is exceptionally good at sub-setting and wildly incompetent at comparing sequential rows. This is because SQL is for set manipulation and sets are by definition unordered; there is no concept of sequential rows in a table.

Leave the table

So, for the past 40 years we have focused on the easy stuff – tabular data that can be manipulated by sub-setting. In other words, we have focused on small data.

One characteristic of big data is that it does not fit easily into tables – good examples are image and audio files.

Another is that we don’t want to query it by sub-setting; the queries are more complex than that. Some data has both of these characteristics, some just one.

As an example of the former, image files can be broken down into individual pixels and each one stored as a row in a table, but you probably don’t want to do that. Even if you did, a query looking for all the pictures that contain a vintage Bentley is not simply sub-setting by rows and columns.

As an example of the latter, sensor data can be tabularised quite easily (although you will end up with a very narrow, mind-bogglingly deep table), but the queries we run against this kind of data are typically looking for patterns formed by the consideration of sequential rows rather than sub-setting. So SQL is not the query language of choice here.

OK, so big data doesn’t fit well into tables and we need to be able to run queries that are more complex than sub-setting.

Why is big data worth pursuing?

At this point I usually start to describe social data analysis or scanning aerial photographs for hidden aeroplanes (both excellent examples). But I also love machinery and am very taken with the melding of big data and combine harvesters. So we will look at that.

Supercharge your infrastructure

Next page: Mean machine

Whitepapers

5 ways to reduce advertising network latency
Implementing the tactics laid out in this whitepaper can help reduce your overall advertising network latency.
Avere FXT with FlashMove and FlashMirror
This ESG Lab validation report documents hands-on testing of the Avere FXT Series Edge Filer with the AOS 3.0 operating environment.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Email delivery: 4 steps to get more email to the inbox
This whitepaper lists some steps and information that will give you the best opportunity to achieve an amazing sender reputation.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?

More from The Register

next story
Elop's enlarged package claim was a cock-up, admits Nokia chairman
'Twas an 'accident' to say whopping £15.6m payoff was unremarkable
Oracle's Ellison talks up 'ungodly speeds' of in-memory database. SAP: *Cough* Hana
Plus new, RAM-heavy hardware promises 100x performance improvement
BlackBerry Black Friday: $1bn loss as warehouses bulge with hated Z10s
Biz plan in full: (1) Keep pumping out phones NO ONE WANTS (2) ??? (3) Er, no profit
Would you hire a hacker to run your security? 'Yes' say Brit IT bosses
We don't have enough securo bods in the industry either, reckon gloomy BOFHs
OUCH: Google preps ad goo injection for Android mobile Gmail app
Don't worry, fandroids, wallet-plumping serum won't hurt a bit
Global execs name Apple 'most innovative company' – again
Google bumped down to number three by Apple arch-rival Samsung
Google tentacle slips over YouTube comments: Now YOUR MUM is at the top
Ad giant tries to dab some polish on the cesspit of the internet
prev story