Feeds

Big Data versus small data: Unpicking the paradox

Can NoSQL and relational both be adaptable?

Intelligent flash storage arrays

Pick a query, any query...

How am I to store these so that I can query them? Well, one approach is, before I store them, to run functions against them that look inside the non-atomic data and pull out, essentially, atomic data. So, I might create a function that scans satellite images and looks for aircraft. It might return data like:

  • Delta winged – 0
  • Swept winged – 3
  • Straight winged - 2
  • Rotating winged – 2

I can store those numbers as atomic data in an elegant, well-formed, relational database and throw away the original image.

You can see where this is going. In selecting the functions I choose to run I have defined exactly the questions that I can ask of the data. If I later want to know the number of King Penguins in the image, tough; I can’t.

If only I had thought ahead! If only I had stored the image file itself (in some non-tabular way), I could write and run my new function and count the King Penguins. But no, I was foolish; I believed those idiots who told me that the relational model allows any question to be asked of the data.

Resolving the paradox

So now we can square the circle and resolve the paradox. The paradox is in the lack of precision in the statements above. More accurate statements would be:

“The relational model imposes a strict schema on the data to ensure that any question can be asked and answered of atomic data.”

“NOSQL systems employ ‘schema-less’ data storage to ensure that you will be able to ask, and answer, any question of Big Data.”

The paradox disappears.

That’s the theory, but suppose I want to store tweets. Let’s imagine a Tweet: "I'm the CFO of a UK bank and I don't like plums.” This is English and it has a grammatical structure, and I would also say that it is not atomic.

So one answer is simply to store the string and not make any decisions about the questions we are going to ask. As we need to answer particular questions we can write specific functions and/or programs which run against the string and extract data – two such programs might be called, for example, ExtractJobTitle, FindFruitFanciers.

Some people might use something like Aster Data to do this and they might say that they had stored some unstructured data and applied the schema later.

Somebody else might use SQL Server to store the string as a text field in a relational table and then write a set of functions called ExtractJobTitle and FindFruitFanciers. That person might say that they had stored some structured data in an agreed schema.

My view is that neither of these statements is a completely accurate description of what is going on.

I would happily store the string in Aster Data but I don’t think a string is inherently unstructured so I wouldn’t say that the data was unstructured. I would, though, agree that Big Data is being stored and the schema is being applied later.

And I would also happily store the string in SQL Server. Since I am aware that the desired analysis requires us to pull information from inside the string, I agree the data is being stored by a relational engine but that this particular database was not relational.

Inside stored strings

I am being very, very pedantic here. Of course in real life I have stored strings and written functions to look inside them and I would never say: “But this isn’t a relational database because this single column here contains non-atomic data.” Life is far too short to be that picky under normal circumstances; I am just being very precise here because we are discussing structure so specifically.

It is interesting, though, to ponder how different we think the two approaches really are. Both store the string and pull it apart later. You can, it seems. Argue that - in the case of the relational example - we are storing the string as non-atomic data and then applying the schema later when we write the function and run it.

Isn’t that a schema-later approach? And doesn’t this help bridge the gap between different schools of thought? ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Ex-US Navy fighter pilot MIT prof: Drones beat humans - I should know
'Missy' Cummings on UAVs, smartcars and dying from boredom
Facebook, Apple: LADIES! Why not FREEZE your EGGS? It's on the company!
No biological clockwatching when you work in Silicon Valley
The 'fun-nification' of computer education – good idea?
Compulsory code schools, luvvies love it, but what about Maths and Physics?
Happiness economics is bollocks. Oh, UK.gov just adopted it? Er ...
Opportunity doesn't knock; it costs us instead
'Cowardly, venomous trolls' threatened with TWO-YEAR sentences for menacing posts
UK government: 'Taking a stand against a baying cyber-mob'
Doctor Who's Flatline: Cool monsters, yes, but utterly limp subplots
We know what the Doctor does, stop going on about it already
Sysadmin with EBOLA? Gartner's issued advice to debug your biz
Start hoarding cleaning supplies, analyst firm says, and assume your team will scatter
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.