Original URL: http://www.theregister.co.uk/2013/05/01/mapr_hadoop_m7_edition_solr/

MapR revs up HBase queries with M7 Hadoop distro

Solr search engine means elephants don't need to chew big data cud

By Timothy Prickett Morgan

Posted in Cloud, 1st May 2013 22:06 GMT

You are not just imagining it. Every commercial distributor of the Hadoop system for storing and chewing through unstructured data has come up with its own a different way to deliver something akin to SQL query functionality while at the same time boosting the speed of ad hoc queries.

MapR Technologies is one of the earlier Hadoop disties and cooked up a proprietary file system that can look like the familiar Network File System of Unix and Linux or the Hadoop Distributed File System. The company promised last fall to work its file system magic on the HBase database layer in Hadoop, and now that code is ready for prime-time.

MapR gave a preview of its HBase acceleration plans back in October when it previewed its top-end M7 Edition. MapR now has three different distributions.

M3 Edition is the freebie community version that allows users to mount HDFS just as if it was an NFS volume, among other extra goodies that are added to the Apache Hadoop distribution; it does not have commercial support, but only community support through the web.

With the M5 Edition, MapR offers tech support for a fee and turns on other features of its super-secret file system that allows for mirroring and snapshots of data sets, high availability clustering for the Hadoop JobTracker, a distributed NameNode (for both high availability and cluster scaling), and a heatmap to show what is going on inside of a Hadoop cluster. (Apache Hadoop has no such graphical tool, which makes a Hadoop cluster a bit of a black box when it is running.)

With the M7 Edition, which starts shipping today, MapR is speeding up the HBase database layer for HDFS by pushing it down into that proprietary and most certainly closed source file system.

Just like MapR has been able to speed up HDFS by breaking data into chunks and spreading it all around the MapR distributed file system, the M7 Edition takes chunks of HBase database tables and breaks them up and stores them in the MapR file system in close proximity to their raw files.

The beauty is that even with this big architectural change, MapR continues to support all of the same HDFS and HBase APIs, and thus applications running atop M7 Edition have no idea that they are not running on the plain vanilla Apache Hadoop.

Well, until they see the performance, says Jack Norris, vice president of marketing at MapR. The company has spent the past six months tweaking and tuning M7 to speed up HBase queries.

Sociocast, one of the early testers for the M7 Edition code, found that the MapR setup ran HBase queries about three times faster than the kosher HBase and was twice as fast as Cassandra, the NoSQL data store that Facebook developed as a replacement for HDFS because it was having such big performance issues with HBase.

Cassandra has what is called "eventual consistency" for its database, which is not a good fit for a lot of enterprise customers, who want absolute consistency. (We went into detail about how the M7 Edition works here.) MapR has done tests that show it can do over 1 million operations per second on M7 Edition on a ten node cluster, and the HBase clone is able to support up to 1 trillion tables.

It doesn't look like Apache Drill, the project that MapR created to clone Google's Dremel query tool, made it into the M7 Edition release. Dremel is used for Google's BigQuery service (which came out of beta last May beta), which does SQL queries against the BigTable database layer that rides atop Google's current file system. HBase is a clone of sorts of BigTable, and clearly you need to do Drill to clone Dremel if you want to be all Googley and such. The Hive SQL-alike query tool, which compiles queries down into MapReduce batch jobs, is better than nothing – but just barely.

In addition to rolling out the M7 Edition, MapR is also doing a private beta for an integrated search engine for Hadoop that it has developed in conjunction with LucidWorks, which was founded in 2007 and is the main company behind the Lucene and Solr search engines with about a quarter of its Apache project committers.

Just like HDFS data chunks and HBase table segments are being stored in that proprietary MapR file system, the search engine indexes for data stored in that file system are being spread around the file system and not housed outside of it. That means you can clone, snapshot, and do other natural acts of a file system on the search engine indices. You can also create indices for subsets of data rather than everything.

The Solr search engine capability will be woven into the M3, M5, and M7 editions. It is in private beta testing now, and Norris says there are no plans for a public beta at the moment. It is expected to be generally available in the next quarter and to be used by business managers and analysts that just want to poke around in the data to see what is in there.

MapR did not provide pricing for the M7 Edition, but says that it costs $4,000 per node per year for the M5 Edition and this you can expect to pay more for the one that gooses HBase speeds. ®