Big Data versus small data: Unpicking the paradox
Can NoSQL and relational both be adaptable?
But while concepts such as key value stores and content-specific stores have certainly enriched our environments, the downside to their arrival is that it has created quite a bit of zealotry as people on opposing sides of the technology camp have argued theirs is the only way.
The resulting noise chamber has produced statements that confuse and mislead those caught in the middle.
Here are two such statements: "The relational model imposes a strict schema on the data to ensure that any question can be asked and answered" and: “NOSQL systems employ 'schema-less' data storage to ensure that you will be able to ask, and answer, any question of the data.”
This is a paradox, a paradox that seems to be saying that both "a strict schema" and "no schema" promote adaptability in terms of querying.
This article tries to resolve this paradox and highlight the differences between Big Data and the “other” kind – something we ought to really call "small data".
Atomic and non-atomic data
One of the many ways of classifying data is to split it into two types: atomic and non-atomic. As it happens I think this is one of the most fundamental divisions we can apply; we’ll use it here because I believe it helps to explain the paradox.
Much of the data in the business world is atomic, for example, HR data. So, what do we mean by atomic?
Imagine that the original specification from the users required the following to be stored about the employees:
- First Name
- Last Name
- Date of Birth
- National Insurance number
Imagine that this transactional database is just at the planning stage. We haven’t gotten as far as an ER model and certainly nowhere near thinking about tables - we simply have a spec from the users.
What structure do we think the data has?
We can argue that it has none because we may think of “structure” in terms of tables and blobs, but it clearly has no structure in that sense because we have yet to make those decisions.
Or we can take another tack. Suppose that the user’s view (when asked) is that each piece of data is always going to be treated as if it is indivisible. In other words, we are told that they will always query on the complete data item (First Name) and no one will ever ask “How many employees have a first name in which the third letter is ‘n’?” In that case we can say that the data is atomic.
Now suppose that the original spec also included, say, an image of the staff member. We might, depending on the spec that we get from the users, also regard this data as atomic. Note that our decision here has little to do with the internal structure of the image file - we don’t care if it is a JPG or a BMP; it is much more to do with the use to which the data will be put.
Suppose that the image is simply there to be displayed. It is never going to be queried as in “Find me all the employees who were photographed in a yellow tie.” In this case the image is atomic in the sense that it will never be decomposed into smaller units. To put that another way, no further information is to be extracted from "inside" the image – it will simply be treated as a whole.
The classification of this data as "atomic" is based on how the users want to query the data, not on whether I (as a database designer) think that the data has any internal structure.
If we put this data into a relational table, will that form of storage restrict the queries that can be run against the data? No, because all queries will run against the complete contents of each column in the tables.
We can run queries that, for example, find all the employees in department X who are paid less than $45,000. Indeed a relational database should be able to answer this and all queries that subset, group and aggregate the data based on the atomic values. It is in this sense that we say "relational databases don’t restrict the queries that can be run".
It is also true there are massive quantities of data out there in the real world that are non-atomic. Consider – yes - the humble Tweet. It has a rich (and sometimes bewildering) internal structure. There is the date/time it was sent, and the text string that it contains (that crucial message to the world “Got up late today but still had time for a shower.”). Or take satellite images. They aren’t just displayed like an HR photograph; they are there to be dissected. Then there are mass spectrometer files, web logs and on and on and on.
Pick a query, any query...
How am I to store these so that I can query them? Well, one approach is, before I store them, to run functions against them that look inside the non-atomic data and pull out, essentially, atomic data. So, I might create a function that scans satellite images and looks for aircraft. It might return data like:
- Delta winged – 0
- Swept winged – 3
- Straight winged - 2
- Rotating winged – 2
I can store those numbers as atomic data in an elegant, well-formed, relational database and throw away the original image.
You can see where this is going. In selecting the functions I choose to run I have defined exactly the questions that I can ask of the data. If I later want to know the number of King Penguins in the image, tough; I can’t.
If only I had thought ahead! If only I had stored the image file itself (in some non-tabular way), I could write and run my new function and count the King Penguins. But no, I was foolish; I believed those idiots who told me that the relational model allows any question to be asked of the data.
Resolving the paradox
So now we can square the circle and resolve the paradox. The paradox is in the lack of precision in the statements above. More accurate statements would be:
“The relational model imposes a strict schema on the data to ensure that any question can be asked and answered of atomic data.”
“NOSQL systems employ ‘schema-less’ data storage to ensure that you will be able to ask, and answer, any question of Big Data.”
The paradox disappears.
That’s the theory, but suppose I want to store tweets. Let’s imagine a Tweet: "I'm the CFO of a UK bank and I don't like plums.” This is English and it has a grammatical structure, and I would also say that it is not atomic.
So one answer is simply to store the string and not make any decisions about the questions we are going to ask. As we need to answer particular questions we can write specific functions and/or programs which run against the string and extract data – two such programs might be called, for example, ExtractJobTitle, FindFruitFanciers.
Some people might use something like Aster Data to do this and they might say that they had stored some unstructured data and applied the schema later.
Somebody else might use SQL Server to store the string as a text field in a relational table and then write a set of functions called ExtractJobTitle and FindFruitFanciers. That person might say that they had stored some structured data in an agreed schema.
My view is that neither of these statements is a completely accurate description of what is going on.
I would happily store the string in Aster Data but I don’t think a string is inherently unstructured so I wouldn’t say that the data was unstructured. I would, though, agree that Big Data is being stored and the schema is being applied later.
And I would also happily store the string in SQL Server. Since I am aware that the desired analysis requires us to pull information from inside the string, I agree the data is being stored by a relational engine but that this particular database was not relational.
Inside stored strings
I am being very, very pedantic here. Of course in real life I have stored strings and written functions to look inside them and I would never say: “But this isn’t a relational database because this single column here contains non-atomic data.” Life is far too short to be that picky under normal circumstances; I am just being very precise here because we are discussing structure so specifically.
It is interesting, though, to ponder how different we think the two approaches really are. Both store the string and pull it apart later. You can, it seems. Argue that - in the case of the relational example - we are storing the string as non-atomic data and then applying the schema later when we write the function and run it.
Isn’t that a schema-later approach? And doesn’t this help bridge the gap between different schools of thought? ®