MongoDB straps SQL to Google's MapReduce
One toasting too many for NoSQL?
To NoSQLers he's the Devil who flames their work. Bring up his name while interviewing the CEO or founder of any NoSQL start-up, as I have, and the interviewee withers to a tight smile.
Say "Michael Stonebraker" to the database wizards of today, though, and they'll nod sagely at mention of the pioneer of relational database technology and main architect of INGRES; they believe the NoSQL pups of today are simply re-learning the hard lessons Stonebraker solved years ago.
Not so long ago, NoSQL was hailed by technology hipsters based both mentally and physically in Silicon Valley as the next evolutionary step of the database.
Stonebraker's relational baby had hit a wall, a system whose rows, columns, locks and triggers were unable to scale fast, cheaply or dynamically enough and unable to process fluidly enough the kind of unstructured data fragments Tweeting and Facebooking sent storming down pipe.
MongoDB, CouchDB, Cassandra, MapReduce, Hadoop and more: these were the future – scaling through software, not expensive hardware. Crucially, they also dispensed with needing to grapple with another language: SQL. CouchDB devs can code using Erlang instead while MongoDB uses C++.
The problem for NoSQL has been successfully breaking out from the high-octane, big-data worlds of Twitter and Facebook and into the every data world of enterprise IT. In this world, the job of building and running database systems cannot – as is currently the case with NoSQL – remain the preserve of a few rocket-scientist-type engineers. Here, salaried jobs rest on the fact database transactions operate reliably – and reliability is enshrined in the principles of ACID (atomicity, consistency, isolation and durability). But it is a principle that appears to have been sacrificed by NoSQL.
Yet the pendulum is swinging back, and I speak not just of greater tolerance for relational, as Reg regular Matt Asay writes here.
Later this year, MongoDB will creep closer to the world of relational and it will do so in a way that's designed to rectify one of the deficiencies in NoSQL pin-up MapReduce from Google.
MongoDB 2.2, which just hit testing and is due in a couple of months, introduces a programming framework that brings a particular SQL-like feature to this NoSQL database. That feature lets you easily group query results by one or more columns. Called the New Aggregation Framework, it will see MongoDB emulate the familiar SQL group-by function.
MongoDB uses Google's MapReduce for complex analytical tasks; MapReduce lets you batch process petabytes of data using parallel computing while abstracting away the complexity for the programmer. The "map" part of MapReduce provides data transformation while the "reduce" part, er, reduces...
That's where the Framework comes in; while still feeding on MapReduce it provides a declarative programming to cut down on the amount of code you hack for queries. It also maps to C++.
Dwight Merriman, the CEO of 10gen, which provides MongoDB support and training, told The Reg on a trip to London last week: "We are building the Aggregation Framework to group by - that's consistent with the way people are using MongoDB. It's more like SQL in that it's declarative."
Merriman, who cut his teeth as co-founder and chief technology officer for DoubleClick - the mega ads network bought by Google in 2007 for $3.1bn - doffed his hat to SQL and relational and while saying MapReduce is capable of so much, he also conceded it's "a little verbose".
"SQL and relational are really good at reporting. This [the Framework] is rounding out the solution to be great at that too. SQL group by is very powerful but MapReduce is much more."
Merriman keeps the NoSQL faith, though. He believes the New Aggregation Framework can be even simpler than using SQL as it's implemented in databases such as Oracle. "It's cleaner to build a query," he said. "If you want to build a query for Oracle, for example, you have to do string concatenation to do the SQL statement. We are writing a query generator."
Merriman also defended NoSQL's ACID compromise. MongoDB can do atomic operations on a document level because the "majority" of cases are covered.
"You can do atomic operations on a document in MongoDB, it is durable and it's consistent and isolated but it's only ACID at the document level. It won't do it outside of doc because it could be on different servers," Merriman said. "You have enough ACID for an e-commerce system but you wouldn't build a general ledger system.'
While there might be trade offs for web apps many - especially those bears in enterprise IT - would disagree something can be "ACID enough" and argue something is ACID or not; surrendering ACID is especially risky because devs will obviously assume its properties will exist in the database they're targeting and will save their apps or the data should there be a problem. Without ACID in the database, extra care must be taken.
Articulating this concern is MarkLogic, an XML database provider of 10 years that's non-relational document store that now plugs into Hadoop but that plays also it old-skool by also adhering to ACID. Vice president of product strategy David Gorbet called it "dangerous" to drop ACID.
"Most people assume ACID is across all documents or entities in the data store. If you have to put an asterisk next to that it’s a buyer beware situation," he told us. "If you are building a web site and you know what the data model looks like you can make that decision... if you are building multiple applications on top of a single instance of data and you don’t know all the scenarios - that can be dangerous."
Despite such considerations, Merriman's company claims customers are buying in. It quotes Spanish telco Telefonica as one of its customers, spinning up seven MongoDB projects up from an initial one.
Have the hatchets been buried? MongoDB at least accepts relational has some good points, but respect is selective it seems. Expect more flames and sulphur. ®