SciDB: Relational daddy answers Google, Hadoop, NoSQL
Stonebraker doesn't drop ACID
Battle of the rows
Stonebraker reckons the relational staples such as logging, locking, latching and buffer management that have helped pioneer and maintain a crucial feature of databases - data integrity according to the atomicity, consistency, isolation and durability (ACID) principles - have also become its biggest burden. Processing alone to make these features work soaks up 90 per cent of a transaction's time in terms of CPU cycles, slowing performance and wasting power.
The serial inventor's answer to this particular problem was initially VoltDB. His database speeds things up by moving data into memory and using distributed data partitioning with multi-core processors and server memory. ACID is retained because VoltDB uses single-threaded partitions that run autonomously while data is replicated in a cluster for high availability.
VoltDB claims to be 45 times faster than an Oracle relational database on a Dell PowerEdge R610 cluster based on Intel's Xeon 5550 with near-linear scaling on a 12-node cluster. VoltDB was the product of H-Store-project, a collaboration between Stonebraker's MIT home, Brown University, Yale University and Hewlett-Packard Labs.
Before VoltDB, there was Vertica. This used a column-oriented, shared-nothing architecture with a massively parallel processing (MPP) engine and data compression to reduce storage and speed queries. Vertica claims query results between 50 and 200 times faster than databases that store data in rows. Vertica started as the C-Store project also with Brown and MIT, plus Brandeis University and University of Massachusetts, Boston.
"Talk to the MapReduce guys and they are fanatical about 'not invented here'... MapReduce was written by people who don't understand databases at all."
Stonebraker reckons columnar-databases are quicker than relational databases because they know what they are looking for. They don't need to waste time sorting rows.
VoltDB, Versa, and - soon - SciDB take Stonebraker into a growing tussle against NoSQL over which architecture is "right" in a fight for mindshare and for customers. SciDB is listed as a NoSQL database, here.
Facing off against SciDB, Vertica and VoltDB in a range of scenarios are Hadoop, MapReduce, Cassandra, CouchDB, Amazon's SimpleDB and Memcached - the latter being the distributed memory caching companion to MySQL used for scale and speed. Helping push them are their creators such as Google and Amazon or startups like Cloudera, mega-scale customers such as Twitter and Facebook, and an army of evangelists convinced that NoSQL is the future.
Sparks flew between Stonebraker and the NoSQL movement in 2008 when the relational expert incensed MapReduce fans in a joint blog with DeWitt for calling MapReduce a "giant step backward in the programming paradigm for large-scale data intensive applications".
Stonebraker and DeWitt professed amazement at the hype over how MapReduce represented a "paradigm shift in the development of scalable, data-intensive applications" and called MapReduce a good idea for writing "certain types" of general-purpose computations but lacking many tools and features commonly associated with DBMS that users have come to depend on.
Bloggers stormed back, damning these "so-called" database experts for "not getting" data in the cloud and - like jealous suitors jumping to their lover's defense - demanded a retraction of this "highly inaccurate article" as if it had slandered their beloved MapReduce.
Most missed the point: Stonebraker and DeWitt weren't calling MapReduce a bad database. They were picking up on the fact that MapReduce - like its open-source clone Hadoop - are being used as if they are databases, with more data being dumped in them by customers on a daily basis and with those customers then needing to transact and analyze that data. It's a problem that's been creeping into Memcached and NoSQL, with people now trying to make Memcached and NoSQL work with relational databases.
Was Stonebraker surprised by the flames?
"The NoSQL guys are people who know nothing about databases and their first reaction is to lash out, so I'm not surprised [by the reaction]," he said.
"Talk to the MapReduce guys and they are fanatical about 'not invented here'... MapReduce was written by people who don't understand databases at all," an unapologetic Stonebraker continued. "They produced a thing that worked for their crawling applications. MapReduce was written to support the processing pipeline behind Google."
Turning MapReduce and Hadoop into databases would take a long time and a huge rewrite to inject things like data repositories, indexes, query languages and updates.
Does he recant in the face of such a flaming? Far from it. He's as critical as ever.
"If you are over 35, you are over the hill apparently in math," he claimed. "In computer science, the grey beards like me are still viable, and it's for this reason that what goes around comes around. The young guys haven't seen it before and the problem with our computer science education system is the lessons from the past seem to get lost."
And, it would seem, Google agrees with him.
Accidental SQL supporter
Stonebraker's got little time for those who claim it's the language that's slowing down databases serving big data. Hadoop is written in Java, CouchDB in Erlang, and in-memory key-value persistent storage engine Memcached in C. For Stonebraker, the interface is the problem, not the language. Hence Volt has been rewritten to remove 90 per cent of the overhead associated with OLTP.
"I'm not a particular fan of SQL but I don't mind it. Jettisoning it just to, say, "get record" is a huge mistake."
Interestingly, Stonebroker wrote Ingres in QUEL and left SQL to Ellison. The industry, and history, swung behind SQL, helping catapult Oracle to today's number-one position while Ingres didn't switch to SQL until version six in the mid 1990s - too late to catch Oracle.
Sponsored: RAID: End of an era?