MongoDB straps SQL to Google's MapReduce
One toasting too many for NoSQL?
Regcast training : Hyper-V 3.0, VM high availability and disaster recovery
To NoSQLers he's the Devil who flames their work. Bring up his name while interviewing the CEO or founder of any NoSQL start-up, as I have, and the interviewee withers to a tight smile.
Say "Michael Stonebraker" to the database wizards of today, though, and they'll nod sagely at mention of the pioneer of relational database technology and main architect of INGRES; they believe the NoSQL pups of today are simply re-learning the hard lessons Stonebraker solved years ago.
Not so long ago, NoSQL was hailed by technology hipsters based both mentally and physically in Silicon Valley as the next evolutionary step of the database.
Stonebraker's relational baby had hit a wall, a system whose rows, columns, locks and triggers were unable to scale fast, cheaply or dynamically enough and unable to process fluidly enough the kind of unstructured data fragments Tweeting and Facebooking sent storming down pipe.
MongoDB, CouchDB, Cassandra, MapReduce, Hadoop and more: these were the future – scaling through software, not expensive hardware. Crucially, they also dispensed with needing to grapple with another language: SQL. CouchDB devs can code using Erlang instead while MongoDB uses C++.
The problem for NoSQL has been successfully breaking out from the high-octane, big-data worlds of Twitter and Facebook and into the every data world of enterprise IT. In this world, the job of building and running database systems cannot – as is currently the case with NoSQL – remain the preserve of a few rocket-scientist-type engineers. Here, salaried jobs rest on the fact database transactions operate reliably – and reliability is enshrined in the principles of ACID (atomicity, consistency, isolation and durability). But it is a principle that appears to have been sacrificed by NoSQL.
Yet the pendulum is swinging back, and I speak not just of greater tolerance for relational, as Reg regular Matt Asay writes here.
Later this year, MongoDB will creep closer to the world of relational and it will do so in a way that's designed to rectify one of the deficiencies in NoSQL pin-up MapReduce from Google.
MongoDB 2.2, which just hit testing and is due in a couple of months, introduces a programming framework that brings a particular SQL-like feature to this NoSQL database. That feature lets you easily group query results by one or more columns. Called the New Aggregation Framework, it will see MongoDB emulate the familiar SQL group-by function.
MongoDB uses Google's MapReduce for complex analytical tasks; MapReduce lets you batch process petabytes of data using parallel computing while abstracting away the complexity for the programmer. The "map" part of MapReduce provides data transformation while the "reduce" part, er, reduces...
MapReduce might be a rockstar for NoSQLers and an inspiration for Hadoop, but it's not good for batching up results. That's a problem because customers want the group-by functionality of SQL, the evil language used to manage data in relational databases. You can get this feature right now with MapReduce, yes, but not without custom coding some Javascript.
That's where the Framework comes in; while still feeding on MapReduce it provides a declarative programming to cut down on the amount of code you hack for queries. It also maps to C++.
Dwight Merriman, the CEO of 10gen, which provides MongoDB support and training, told The Reg on a trip to London last week: "We are building the Aggregation Framework to group by - that's consistent with the way people are using MongoDB. It's more like SQL in that it's declarative."
Merriman, who cut his teeth as co-founder and chief technology officer for DoubleClick - the mega ads network bought by Google in 2007 for $3.1bn - doffed his hat to SQL and relational and while saying MapReduce is capable of so much, he also conceded it's "a little verbose".
"SQL and relational are really good at reporting. This [the Framework] is rounding out the solution to be great at that too. SQL group by is very powerful but MapReduce is much more."
Merriman keeps the NoSQL faith, though. He believes the New Aggregation Framework can be even simpler than using SQL as it's implemented in databases such as Oracle. "It's cleaner to build a query," he said. "If you want to build a query for Oracle, for example, you have to do string concatenation to do the SQL statement. We are writing a query generator."
Merriman also defended NoSQL's ACID compromise. MongoDB can do atomic operations on a document level because the "majority" of cases are covered.
"You can do atomic operations on a document in MongoDB, it is durable and it's consistent and isolated but it's only ACID at the document level. It won't do it outside of doc because it could be on different servers," Merriman said. "You have enough ACID for an e-commerce system but you wouldn't build a general ledger system.'
While there might be trade offs for web apps many - especially those bears in enterprise IT - would disagree something can be "ACID enough" and argue something is ACID or not; surrendering ACID is especially risky because devs will obviously assume its properties will exist in the database they're targeting and will save their apps or the data should there be a problem. Without ACID in the database, extra care must be taken.
Articulating this concern is MarkLogic, an XML database provider of 10 years that's non-relational document store that now plugs into Hadoop but that plays also it old-skool by also adhering to ACID. Vice president of product strategy David Gorbet called it "dangerous" to drop ACID.
"Most people assume ACID is across all documents or entities in the data store. If you have to put an asterisk next to that it’s a buyer beware situation," he told us. "If you are building a web site and you know what the data model looks like you can make that decision... if you are building multiple applications on top of a single instance of data and you don’t know all the scenarios - that can be dangerous."
Despite such considerations, Merriman's company claims customers are buying in. It quotes Spanish telco Telefonica as one of its customers, spinning up seven MongoDB projects up from an initial one.
Have the hatchets been buried? MongoDB at least accepts relational has some good points, but respect is selective it seems. Expect more flames and sulphur. ®
COMMENTS
SQL
Just what is the problem so many people have with SQL and relational concepts in general?
SQL is a small, simple language. It's not without a few ugly features, but it's easy to learn, clearly defined and consistent. Relational databases are simple enough that Edgar Codd could define one in 12 rules. (Simple, that is, from the perspective of the schema designer and application developer - DB administration is something else.)
But numerous apparently intelligent developers seem to fear these things as children fear to go in the dark. I'm forever hearing bleats of "do we really have to have a primary key?", as if the cost of the key was being deducted from their salary. Time and again I start work with a database and find that the schema was created by people who couldn't see the point of foreign keys, so the tables are full of junk values. And don't get me started on normalization.
This stuff is all so difficult that we have to re-invent the wheel, only this time we'll make it elliptical.
NoSQL without ACID is like a Ford-Explorer brakes...
Both CouchDB and MongoDB have a friendly JavaScript Object Notional interface that integrates nicely with web-technology, but duck the last of the ACID (atomic, consistent, isolated, durable) criteria because “durable” is the only one that really distinguishes a “database” from a memory cache.
MongoDB has added logging primarily for recovery, and with SSD drives and fast tick-time gets pretty close to ACID durability, but without guarantee it is just like driving an old Ford Explorer that carries big loads in comfort and rarely crashes with brake failure: good for moving manure, but not for moving kids.
The current crop of NoSQL is good for social networks but not for financial transactions.. the answer is not necessarily transactional logging like {Ingres,DB2,Oracle,SQLServer}, but could use the distributed callback method of message-queues.
CAP Theorem
Y'know, there's a reason why it's called the "CAP Theorem", and not the "CAP Hypothesis".
A database can't be consistent and atomic and partition-tolerant. That means any database - SQL or not, cloudy or not - can't both be ACID and tolerant of partitioning (due to network problems, for example). And that means any ACID "cloud" database is not so cloudy after all; if it's really distributed, then it's not really ACID.
Many applications can do without true consistency or atomicity (often by repairing the data when the network is re-established), or can live with being unavailable if the network's partitioned. But an always-on, true-ACID database cannot be subject to partitioning. That's a fundamental limit on distributing data.

IT infrastructure monitoring strategies
Requirements Checklist for Choosing a Cloud Backup and Recovery Service Provider
Data control in the cloud
Cloud based data management
Enabling efficient data center monitoring