NoSQL's CAP theorem busters: We don't drop ACID
'We are not relational fans here'
CAP theorem* holds no fear for six engineers building FoundationDB, the industry’s latest NoSQL candidate. The difference? It adheres to the principles of ACID** found in relational, which previous NoSQLers have tried to replace.
“A lot of people developing NoSQL systems have been discouraged by the CAP theorem and used that as an excuse for not solving some of the hard problems,” FoundationDB co-founder and MIT computer science graduate Dave Rosenthal told The Reg. Rosenthal started FoundationDB in 2009 with Nick Lavezzo and and Dave Scherer.
“It’s a heck of lot easier to build a database without transaction integrity than with it. If I was staring down the barrel of building a really big database I would use CAP theorem as an excuse not to do that,” he says.
Rosenthal’s team is mid-way though what he calls a “soft” alpha, with a beta due in early 2013. Early interest has been substantial, surging to the top of Hacker News during the summer. “We got a lot of questions and requests for the software. We’ve been scrambling to keep up with that.” Rosenthal told us.
Also, Rosenthal reckons, FoundationDB’s seeing search traffic on the start-up’s site reflect the fact people are looking for “ACID NoSQL” or “ACID key value store.”
“The issue for devs is pretty simple: NoSQL helps solve scaling problems, but throws another monkey on your back - writing code without the guarantees of transactions... If you can solve both problems, it's a real win,” he told us.
'We started reading the fine print on the [NoSQL] databases and none provide transactional integrity' - FoundationDB co-founder Dave Rosenthal
The problem comes back to CAP theorem, articulated by University of California Berkeley professor Eric Brewer, which states it’s impossible for distributed computer system to simultaneously achieve consistency for data on all nodes, to provide availability of either data or a request regardless of a failure in the system, and partition tolerance – the ability to continue working should a part of the system break down.
It’s the consistency part that puts the “C” in ACID – atomicity, consistency, isolation and durability. ACID is is a fundamental principle of relational databases that has helped make them a multi-billion-dollar mainstay of computing.
CAP theorem clearly poses a theoretical problem for cloud computing, where services are being founded on massively distributed servers for their compute and storage.
Hence, we’ve seen a proliferation of NoSQL for use in large, distributed data centres that have jettisoned ACID to achieve scale - column store Cassandra from Facebook and Google’s BigTable, document stores MongoDB, and CouchDB.
But recently there’s been a dawning recognition among NoSQL practitioners and those working in Big Data that the fast-iterating data they process needs to be demonstrably reliable, too. The result has been NoSQL databases adding more relational functionality to their software.
Indicative of the growing interest is Spanner from Google, the poster child of web-scale and distribute computing that helped popularise Big Data with MapReduce. Spanner is Google technology that can dictate where data is stored in a distributed cluster - and then time-stamps it so an application knows which version is current.
FoundationDB is Rosenthal’s second tech venture – he was employee-number-one at web-analytics company Visual Sciences that was sold for $60m to WebSideStory. It’s now part of Adobe's online marketing and web analytics unit Omniture.
Rosenthal said he believes FoundationDB is the next generation of the database market. He says he decided to apply himself, and hire a team, dedicated to solving an obvious problem.
Read the fine print
“Several years ago we looked at the market – the NoSQL database market,” he says. “The attributes are attractive - we are not relational fans here, we are more software engineers than DBAs, so that resonated with us. Also the price, and scale, to deploy on the cloud resonated with us.
“But then we started reading the fine print on the databases and none provide transactional integrity.
"Of course we read about the CAP theorem like everybody else did. But what the CAP theorem says isn’t as strong as what some people make it out to be... CAP theorem sounded scary and scared people building NoSQL databases off doing ACID transactions. So we said: ‘What if we throw out all the features and just try to make a fundamental data structure like a key-value store that’s scalable and has true ACID transactions?'."
“We all come from a theoretical comp sci background - not the traditional start-up hack-it-together mindset. We have an East Coast computer scientists' mindset and we realised what the world needed was just not another website; the world needed a solution to solve this problem. We feel too much of smart people’s time is spent trying to half-solve this problem in their own databases, database centres and their companies. We looked at the problem and decided it can be solved.”
So what is Rosenthal and his team proposing?
Rosenthal describes FoundationDB as more like a back-end storage engine than the full SQL database. FoundationDB is a key-value store – the simplest form of NoSQL database. A key-value store is a simple API that has put, get and delete.
Having stripped back to basics, FoundationDB adds the ACID transactional capabilities using a network topology that breaks up and assigns workloads to different machines in a cluster, with machines using a new language – called Flow.
One class of machines in the database cluster are used for transition processing and conflict resolution, to ensure the integrity of data and transitions. Dumb notes are decoupled from smart transactional nodes, which know about ACID properties and transactions. Writes are filtered back to conflict-resolution machines, to enforce ACID.
“That’s very difficult but it enables a couple of really, really cool things,” Rosenthal tells us. “Once you have database with transactional integrity you can build up rich data models on top of simpler ones. That means for every other NoSQL database in market your app has to fit the model of the database.
'We think one of the reasons there are so many NoSQL databases on the market is because unless your application perfectly matches their data model, it becomes difficult to build data abstractions' - Rosenthal
“We think one of the reasons there are so many NoSQL databases on the market is because unless your application perfectly matches their data model, it becomes difficult to build data abstractions,” he says.
Take MongoDB: if you’re building an application for mobile phones and want to add location, you’d need a spatial index and any changes to documents would need to be made to both the documents and index. Without transactional capabilities, you cannot update the index data reliably and you have to go through the extra step of writing business and application logic. A transactional database would let you update the data and index and have confidence in the result.
Helping glue FoundationDB together is Flow, a new language that extends C++ with the addition of 10 Erlang keyword-types and an actor model concurrency for concurrent calculation. It took Rosenthal’s team three years to design and build Flow’s tools and the mathematical algorithms but just two weeks to actually build Flow, Rosenthal claimed. “Erlang has some really great programming tools for distributed systems but it’s sort of slow; C++ is fast but doesn’t have great tool. Flow ads tools and lets us write in a concise way,” Rosenthal said.
“The first year-and-a-half of development was spend exclusively in simulation – there was no real code to talk to a real network or hard disk, everything was simulated. Once we got programmatically solid in the simulation we started replacing components with their real world counterparts, and then we started working on the performance, which is crucial; performance can always be improved.”
The program was tested during hundreds of thousands of simulations run at night, with failures such as lost packets and rebooted machines simulated. Tests were run on a pair of 96 core clusters using a micro SSD and Intel-based servers.
Looking ahead, Rosenthal doesn’t see FoundationDB as replacing NoSQL or relational approaches, nor does he believe Flow will replace SQL. But he does see FoundationDB getting picked up by web start-ups and already established companies who want to grow rather than those trying to do something like replace an Oracle database.
The Stonebraker factor
You might not be surprised to learn, having read this, that Rosenthal agrees with relational database pioneer Michael Stonebraker – one of the main architects of the first relational database, Ingres, and the object-relational DBMS PostgreSQL. Stonebraker has been flamed by the pups of NoSQL for saying they are wrong to dump ACID, while he’s now behind VoltDB – a post-relational database. “A distributed data store without concurrency control is a toy and makes building things on it a lot harder,” he says.
FoundationDB is emerging as a new breed of NoSQL provider that, from the start, realises the rules of decades of computing have survived for a reason. As far as Rosenthal is concerned, FoundationDB will help bring those familiar with relational into the new world of web-scale NoSQL without making trade-offs.
“Relational is not going away - there’s always going to be apps for it,” Rosenthal says. “What we hear from lot of the people we talk to is they are looking at possibly using NoSQL and they know NOSQL will become more solid, but they know SQL will scale eventually - so they are hedging.
“We think our early adopters are going to be developers thinking about using other NoSQL systems and they’ll look at this and say the performance is great and it has full transaction. That’s worth switching to,” he said. ®
* The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
Consistency - equivalent to having a single up-to-date copy of the data;
- high Availability of that data (for updates); and
- tolerance to network Partitions.
** The ACID principles:
- Atomicity - a transaction is all or nothing
- Consistency - only valid data is written to the database
- Isolation - pretend all transactions are happening serially and the data is correct
- Durability - what you write is what you get