Big Data versus small data: Unpicking the paradox
Can NoSQL and relational both be adaptable?
But while concepts such as key value stores and content-specific stores have certainly enriched our environments, the downside to their arrival is that it has created quite a bit of zealotry as people on opposing sides of the technology camp have argued theirs is the only way.
The resulting noise chamber has produced statements that confuse and mislead those caught in the middle.
Here are two such statements: "The relational model imposes a strict schema on the data to ensure that any question can be asked and answered" and: “NOSQL systems employ 'schema-less' data storage to ensure that you will be able to ask, and answer, any question of the data.”
This is a paradox, a paradox that seems to be saying that both "a strict schema" and "no schema" promote adaptability in terms of querying.
This article tries to resolve this paradox and highlight the differences between Big Data and the “other” kind – something we ought to really call "small data".
Atomic and non-atomic data
One of the many ways of classifying data is to split it into two types: atomic and non-atomic. As it happens I think this is one of the most fundamental divisions we can apply; we’ll use it here because I believe it helps to explain the paradox.
Much of the data in the business world is atomic, for example, HR data. So, what do we mean by atomic?
Imagine that the original specification from the users required the following to be stored about the employees:
- First Name
- Last Name
- Date of Birth
- National Insurance number
Imagine that this transactional database is just at the planning stage. We haven’t gotten as far as an ER model and certainly nowhere near thinking about tables - we simply have a spec from the users.
What structure do we think the data has?
We can argue that it has none because we may think of “structure” in terms of tables and blobs, but it clearly has no structure in that sense because we have yet to make those decisions.
Or we can take another tack. Suppose that the user’s view (when asked) is that each piece of data is always going to be treated as if it is indivisible. In other words, we are told that they will always query on the complete data item (First Name) and no one will ever ask “How many employees have a first name in which the third letter is ‘n’?” In that case we can say that the data is atomic.
Now suppose that the original spec also included, say, an image of the staff member. We might, depending on the spec that we get from the users, also regard this data as atomic. Note that our decision here has little to do with the internal structure of the image file - we don’t care if it is a JPG or a BMP; it is much more to do with the use to which the data will be put.
Suppose that the image is simply there to be displayed. It is never going to be queried as in “Find me all the employees who were photographed in a yellow tie.” In this case the image is atomic in the sense that it will never be decomposed into smaller units. To put that another way, no further information is to be extracted from "inside" the image – it will simply be treated as a whole.
The classification of this data as "atomic" is based on how the users want to query the data, not on whether I (as a database designer) think that the data has any internal structure.
If we put this data into a relational table, will that form of storage restrict the queries that can be run against the data? No, because all queries will run against the complete contents of each column in the tables.
We can run queries that, for example, find all the employees in department X who are paid less than $45,000. Indeed a relational database should be able to answer this and all queries that subset, group and aggregate the data based on the atomic values. It is in this sense that we say "relational databases don’t restrict the queries that can be run".
It is also true there are massive quantities of data out there in the real world that are non-atomic. Consider – yes - the humble Tweet. It has a rich (and sometimes bewildering) internal structure. There is the date/time it was sent, and the text string that it contains (that crucial message to the world “Got up late today but still had time for a shower.”). Or take satellite images. They aren’t just displayed like an HR photograph; they are there to be dissected. Then there are mass spectrometer files, web logs and on and on and on.
Next page: Pick a query, any query...
In an RDBMS system the level of at which the data is deemed to be atomic is determined by the database design. Each column holds one piece of information.
If the system using the database needs to be concerned with the individual characters in a string then a string is no longer atomic and storing that data as a string is breaking a fundamental RDBMS rule.
To put it another way, you wouldn't store comma separated values in a database column, you would split those values out into proper columns or rows depending upon the structure of the data. The exception to this would be if the database or the systems that used it were designed to not care what was in the string or care that it was comma separated values; it is just a string to be stored, read and reproduced as is.
This is also why database designers HATE free text fields, they always end up being abused and used as data stores for non-atomic data as a short cut for changing the database and/or user interface
(I think the DOB was a typo that got past the editor)
As a simple mental exercise...
...every time you see "NoSQL" substitute "dBase". I find it tends to break down the arguments quite succinctly and immediately calls to mind all the perils - lack of any real integrity constraints and being strongly tied to a particular system for starters (true, SQL isn't portable either but that's a little fettling as opposed to complete rewrites). It also shows that it's not a new idea and indeed it is one largely rejected 20 years ago.
What's changed? It's a new generation of cowboys that never learned the lessons of the past and that intrinsically assumes that their particular requirements are somehow unique. Instead of "we've got a lot of critical data here so we need to protect it at all costs" the argument becomes "we've got a lot of critical data here so the rules about how to look after it don't apply to us - they're for lesser beings than us".
Database de-normalization Tradeoff
Both strategies will enable you to do what you want, they are just implemented differently. The different implementations are a tradeoff between code manageability and data scalability. E.g. the question 'find all the employees in department X who are paid less than $45,000' can be implemented in single database or two databases, one storing employee data and the other payment data. In the first case it will be a simple Select From Where. The second case requires selecting 'paid less than $45,000' from the Payment DB and then doing look-ups in the Employee DB.
The first method is better for code maintenance whilst the second may enable you to scale higher. Even with a single DB you can scale by table partitioning, multiple CPUs and multiple Tablespaces located on different SANs. The second solution allows you to scale higher with cheap hardware.
The single DB solution is better for complex queries such as 'find all the employees in department X who are paid less than $45,000 and have been on X, Y, and Z training courses. This will be a simple Select From Where. In the Multiple DB scenario you have to know a little about the underlying DB to write an efficient query. E.G. how many people are paid less than $45,000 and how many people have been on X, Y, and Z training courses. Not knowing this information will result in extremely poor DB performance.
In the end a competent database developer knows what tradeoffs to make to ensure scalability whilst maintaining code manageability.
A string is none atomic because you might like to search inside it. So if you have a table, one column stores 140 character tweets, you might like to search for all tweets that contain mention of "the Register" and get all the rows that contain that condition. In that case the string we are interested in, can appear anywhere in the field we are looking at and the position that it appears is unimportant. For sentiment analysis we may be interested in the position of the string, so we might want to write a query that returns all the tweets with "the register" at the end of the tweet, because these tend to be more positive. Again we are looking inside the field and are interested in the properties that the text has.
Numerical values will normally be compared against external values (count >10) or against a value in another field. You would very rarely in the real world (and I struggle to find an example) want to return all rows where the 10's value is greater than the units value, I.E compare within the decimal number. Nor would we typically look for numbers that contain a value anywhere in the number. So would we ever write a query that returns all the stores that held a stock of oranges where the number of oranges had 12 in it ? I.e 12 oranges, 120 oranges, 312 etc ?
And why did he throw away his original data? That one went right over my head - only a moron assumes that he has extracted every piece of information from original data.
In both models you can only get answers to questions where you (a) have the information necessary to provide the answer and (b) know where the data is. All this crap about schema and non-schema is just babbling nonsense.
If you built an RDB where you are locked into the questions and type of questions you can ask by the structure of the DB or the datatypes you specified in design then you shouldn't be building RDBs. The only reason to lockdown data in the way described is to screw the customer later.
Both of the data models discussed have their uses - the difference is that one approach uses a database and the other one uses data indexing. They are not even mutually exclusive. You could easily use both approaches in the same data model. If you won't use RDBs cos you're a NoSQL fanboi (or vice versa) then you don't know enough to be trusted with anyone's data - go back to school and learn about abstract thinking - most people get it by their teenage years.