How NoSQL graph databases still usurp relational dynasties

Turn the tables on tables

Analysis Despite being assaulted from all sides, the relational model for databases is still the king of the hill and it looks like it will not only survive, but thrive as well.

NoSQL databases have become increasingly popular and have been offering a number of data and deployment modes that have overcome the limitations – real or imagined – of their SQL cousins.

NoSQL databases come in a number of guises, but essentially they are designed either to make the life of the programmer easier or to overcome the problem of distributing data at scale.

Models such as key-value stores (where the value can be anything from a simple string to a complex serialized blob) can assist programmers who want to store a lot of simple data quickly, retrieve it at speed, and in many cases don't really care if it persists in the case of a power cut.

Document databases store JSON-style data whilst indexing it for easy retrieval, allowing the programmer to map RESTFUL interfaces to a store without the hassle of converting it to relational format. Programmers call this impedance mismatch, after the term in electronics which stops you plugging any old instrument into the microphone socket of a recording device. The model inside the programmer's head is so very different from the relational model that it's a real hassle to convert between the two, so use a store that matches the model more closely.

Distributing data at scale is harder for relational databases, but curiously it's not because they are relational. The problem is that ACID (atomicity, consistency, isolation, durability) transactions (which are essential if you want your database to be consistent and giving correct answers) are very hard to do at scale, locking tables across continents is not easy, and the process is made harder as you may have multiple write points.

There are ways to do it, two-phase commit or consensus-do work, but they are clunky and will slow your system down. Scalable NoSQL databases solve the problem by either throwing transactions out or by admitting that your distributed system may not give the same answer across all nodes, or may simply refuse to give the answer.

It's not perfect, but at the moment it's what we have.

Where's the response?

Traditional database vendors, though, are fighting back. Microsoft's SQL server (as of version 2016) offers a way to store and retrieve JSON data in a relatively painless way, although the data itself is stored in the relational engine. You should get all the benefits of a SQL engine without that nasty impedance mismatch programmers hate. Oracle's NoSQL goes further, storing JSON or name value pairs across a sharded, shared nothing system.

Among the newer breed of tech firm – those in the cloud – Google has tackled distributed transactions with Spanner, and with F1 built on top of it, allows a SQL-like environment to be distributed (although it makes use of atomic clocks for time synchronization).

There is, however, one problem that relational databases are simply not good at: graphs. Here's a problem: design a relational database that supports threaded message boards that allow threads of any depth that can be queried from any message to give parents, children and siblings.

I've no doubt that many readers can do it and it's obvious from forum software that it can be done, but solutions are complex and can be hard to understand. This is just a special case of a graph where any node can be joined to any other node with a direction attached to the join.

Actually the joins aren't joins, they are edges; if they were joins then we might think of them as relational joins, which they aren't, as they have a direction which a relational join doesn't. In any case, a network of nodes is very hard (but not impossible) to model in a relational database, and can suffer from massive performance problems. Graph databases solve it by actually linking nodes with software edges. Think of your network of friends – you're in the center (naturally), you've got close friends you really like, some of the friends of your friends are OK, and there are some you dislike even if they seem to like each other. Draw it out and you'll soon see how complicated it is; graph databases eat this sort of data up.

Most modern graph databases can handle a huge number of nodes and edges, both of which can have a large number of properties, and better still, some of them even handle transactions. Neo4j, for instance, can handle tens of billions of nodes and relationships, although speed of querying will depend on the amount of RAM on your server.

Distributed graph databases such as Titan can be tuned to support ACID or eventual consistency and can be configured to run on top of popular NoSQL databases. If you look long enough down the back of your sofa, you might even get enough money to buy a Cray graph database appliance, or you could ask international security agencies if you could borrow theirs.

Although the underlying graph is easy to understand in both systems, the only downside is you will have to learn a new querying language, as SQL really isn't built to query this type of structure. Each of the databases has their own language, so choose wisely based on what it is you're trying to do. Also, you need to look at the architecture – will you need vast distribution or will a couple of high-power servers with lots of memory do the job?

Relational engines might someday find a way to optimize graph-style queries, but writing a SQL statement that will find all my friends' friends that are two nodes away should always cause a program a severe migraine. Graph databases and their languages, on the other hand, are specifically designed for the task. JSON, name value pairs, and columns stores could all get eaten by SQL engines and the problem of distributing transactions might be solved, but it's hard to see the relational engine eating the graph problem any time soon. ®


Biting the hand that feeds IT © 1998–2017