Big Data versus small data: Unpicking the paradox
Can NoSQL and relational both be adaptable?
Pick a query, any query...
How am I to store these so that I can query them? Well, one approach is, before I store them, to run functions against them that look inside the non-atomic data and pull out, essentially, atomic data. So, I might create a function that scans satellite images and looks for aircraft. It might return data like:
- Delta winged – 0
- Swept winged – 3
- Straight winged - 2
- Rotating winged – 2
I can store those numbers as atomic data in an elegant, well-formed, relational database and throw away the original image.
You can see where this is going. In selecting the functions I choose to run I have defined exactly the questions that I can ask of the data. If I later want to know the number of King Penguins in the image, tough; I can’t.
If only I had thought ahead! If only I had stored the image file itself (in some non-tabular way), I could write and run my new function and count the King Penguins. But no, I was foolish; I believed those idiots who told me that the relational model allows any question to be asked of the data.
Resolving the paradox
So now we can square the circle and resolve the paradox. The paradox is in the lack of precision in the statements above. More accurate statements would be:
“The relational model imposes a strict schema on the data to ensure that any question can be asked and answered of atomic data.”
“NOSQL systems employ ‘schema-less’ data storage to ensure that you will be able to ask, and answer, any question of Big Data.”
The paradox disappears.
That’s the theory, but suppose I want to store tweets. Let’s imagine a Tweet: "I'm the CFO of a UK bank and I don't like plums.” This is English and it has a grammatical structure, and I would also say that it is not atomic.
So one answer is simply to store the string and not make any decisions about the questions we are going to ask. As we need to answer particular questions we can write specific functions and/or programs which run against the string and extract data – two such programs might be called, for example, ExtractJobTitle, FindFruitFanciers.
Some people might use something like Aster Data to do this and they might say that they had stored some unstructured data and applied the schema later.
Somebody else might use SQL Server to store the string as a text field in a relational table and then write a set of functions called ExtractJobTitle and FindFruitFanciers. That person might say that they had stored some structured data in an agreed schema.
My view is that neither of these statements is a completely accurate description of what is going on.
I would happily store the string in Aster Data but I don’t think a string is inherently unstructured so I wouldn’t say that the data was unstructured. I would, though, agree that Big Data is being stored and the schema is being applied later.
And I would also happily store the string in SQL Server. Since I am aware that the desired analysis requires us to pull information from inside the string, I agree the data is being stored by a relational engine but that this particular database was not relational.
Inside stored strings
I am being very, very pedantic here. Of course in real life I have stored strings and written functions to look inside them and I would never say: “But this isn’t a relational database because this single column here contains non-atomic data.” Life is far too short to be that picky under normal circumstances; I am just being very precise here because we are discussing structure so specifically.
It is interesting, though, to ponder how different we think the two approaches really are. Both store the string and pull it apart later. You can, it seems. Argue that - in the case of the relational example - we are storing the string as non-atomic data and then applying the schema later when we write the function and run it.
Isn’t that a schema-later approach? And doesn’t this help bridge the gap between different schools of thought? ®
In an RDBMS system the level of at which the data is deemed to be atomic is determined by the database design. Each column holds one piece of information.
If the system using the database needs to be concerned with the individual characters in a string then a string is no longer atomic and storing that data as a string is breaking a fundamental RDBMS rule.
To put it another way, you wouldn't store comma separated values in a database column, you would split those values out into proper columns or rows depending upon the structure of the data. The exception to this would be if the database or the systems that used it were designed to not care what was in the string or care that it was comma separated values; it is just a string to be stored, read and reproduced as is.
This is also why database designers HATE free text fields, they always end up being abused and used as data stores for non-atomic data as a short cut for changing the database and/or user interface
(I think the DOB was a typo that got past the editor)
As a simple mental exercise...
...every time you see "NoSQL" substitute "dBase". I find it tends to break down the arguments quite succinctly and immediately calls to mind all the perils - lack of any real integrity constraints and being strongly tied to a particular system for starters (true, SQL isn't portable either but that's a little fettling as opposed to complete rewrites). It also shows that it's not a new idea and indeed it is one largely rejected 20 years ago.
What's changed? It's a new generation of cowboys that never learned the lessons of the past and that intrinsically assumes that their particular requirements are somehow unique. Instead of "we've got a lot of critical data here so we need to protect it at all costs" the argument becomes "we've got a lot of critical data here so the rules about how to look after it don't apply to us - they're for lesser beings than us".
Database de-normalization Tradeoff
Both strategies will enable you to do what you want, they are just implemented differently. The different implementations are a tradeoff between code manageability and data scalability. E.g. the question 'find all the employees in department X who are paid less than $45,000' can be implemented in single database or two databases, one storing employee data and the other payment data. In the first case it will be a simple Select From Where. The second case requires selecting 'paid less than $45,000' from the Payment DB and then doing look-ups in the Employee DB.
The first method is better for code maintenance whilst the second may enable you to scale higher. Even with a single DB you can scale by table partitioning, multiple CPUs and multiple Tablespaces located on different SANs. The second solution allows you to scale higher with cheap hardware.
The single DB solution is better for complex queries such as 'find all the employees in department X who are paid less than $45,000 and have been on X, Y, and Z training courses. This will be a simple Select From Where. In the Multiple DB scenario you have to know a little about the underlying DB to write an efficient query. E.G. how many people are paid less than $45,000 and how many people have been on X, Y, and Z training courses. Not knowing this information will result in extremely poor DB performance.
In the end a competent database developer knows what tradeoffs to make to ensure scalability whilst maintaining code manageability.
A string is none atomic because you might like to search inside it. So if you have a table, one column stores 140 character tweets, you might like to search for all tweets that contain mention of "the Register" and get all the rows that contain that condition. In that case the string we are interested in, can appear anywhere in the field we are looking at and the position that it appears is unimportant. For sentiment analysis we may be interested in the position of the string, so we might want to write a query that returns all the tweets with "the register" at the end of the tweet, because these tend to be more positive. Again we are looking inside the field and are interested in the properties that the text has.
Numerical values will normally be compared against external values (count >10) or against a value in another field. You would very rarely in the real world (and I struggle to find an example) want to return all rows where the 10's value is greater than the units value, I.E compare within the decimal number. Nor would we typically look for numbers that contain a value anywhere in the number. So would we ever write a query that returns all the stores that held a stock of oranges where the number of oranges had 12 in it ? I.e 12 oranges, 120 oranges, 312 etc ?
And why did he throw away his original data? That one went right over my head - only a moron assumes that he has extracted every piece of information from original data.
In both models you can only get answers to questions where you (a) have the information necessary to provide the answer and (b) know where the data is. All this crap about schema and non-schema is just babbling nonsense.
If you built an RDB where you are locked into the questions and type of questions you can ask by the structure of the DB or the datatypes you specified in design then you shouldn't be building RDBs. The only reason to lockdown data in the way described is to screw the customer later.
Both of the data models discussed have their uses - the difference is that one approach uses a database and the other one uses data indexing. They are not even mutually exclusive. You could easily use both approaches in the same data model. If you won't use RDBs cos you're a NoSQL fanboi (or vice versa) then you don't know enough to be trusted with anyone's data - go back to school and learn about abstract thinking - most people get it by their teenage years.