Feeds

MapR revs up HBase queries with M7 Hadoop distro

Solr search engine means elephants don't need to chew big data cud

7 Elements of Radically Simple OS Migration

You are not just imagining it. Every commercial distributor of the Hadoop system for storing and chewing through unstructured data has come up with its own a different way to deliver something akin to SQL query functionality while at the same time boosting the speed of ad hoc queries.

MapR Technologies is one of the earlier Hadoop disties and cooked up a proprietary file system that can look like the familiar Network File System of Unix and Linux or the Hadoop Distributed File System. The company promised last fall to work its file system magic on the HBase database layer in Hadoop, and now that code is ready for prime-time.

MapR gave a preview of its HBase acceleration plans back in October when it previewed its top-end M7 Edition. MapR now has three different distributions.

M3 Edition is the freebie community version that allows users to mount HDFS just as if it was an NFS volume, among other extra goodies that are added to the Apache Hadoop distribution; it does not have commercial support, but only community support through the web.

With the M5 Edition, MapR offers tech support for a fee and turns on other features of its super-secret file system that allows for mirroring and snapshots of data sets, high availability clustering for the Hadoop JobTracker, a distributed NameNode (for both high availability and cluster scaling), and a heatmap to show what is going on inside of a Hadoop cluster. (Apache Hadoop has no such graphical tool, which makes a Hadoop cluster a bit of a black box when it is running.)

With the M7 Edition, which starts shipping today, MapR is speeding up the HBase database layer for HDFS by pushing it down into that proprietary and most certainly closed source file system.

Just like MapR has been able to speed up HDFS by breaking data into chunks and spreading it all around the MapR distributed file system, the M7 Edition takes chunks of HBase database tables and breaks them up and stores them in the MapR file system in close proximity to their raw files.

The beauty is that even with this big architectural change, MapR continues to support all of the same HDFS and HBase APIs, and thus applications running atop M7 Edition have no idea that they are not running on the plain vanilla Apache Hadoop.

Well, until they see the performance, says Jack Norris, vice president of marketing at MapR. The company has spent the past six months tweaking and tuning M7 to speed up HBase queries.

Sociocast, one of the early testers for the M7 Edition code, found that the MapR setup ran HBase queries about three times faster than the kosher HBase and was twice as fast as Cassandra, the NoSQL data store that Facebook developed as a replacement for HDFS because it was having such big performance issues with HBase.

Cassandra has what is called "eventual consistency" for its database, which is not a good fit for a lot of enterprise customers, who want absolute consistency. (We went into detail about how the M7 Edition works here.) MapR has done tests that show it can do over 1 million operations per second on M7 Edition on a ten node cluster, and the HBase clone is able to support up to 1 trillion tables.

It doesn't look like Apache Drill, the project that MapR created to clone Google's Dremel query tool, made it into the M7 Edition release. Dremel is used for Google's BigQuery service (which came out of beta last May beta), which does SQL queries against the BigTable database layer that rides atop Google's current file system. HBase is a clone of sorts of BigTable, and clearly you need to do Drill to clone Dremel if you want to be all Googley and such. The Hive SQL-alike query tool, which compiles queries down into MapReduce batch jobs, is better than nothing – but just barely.

In addition to rolling out the M7 Edition, MapR is also doing a private beta for an integrated search engine for Hadoop that it has developed in conjunction with LucidWorks, which was founded in 2007 and is the main company behind the Lucene and Solr search engines with about a quarter of its Apache project committers.

Just like HDFS data chunks and HBase table segments are being stored in that proprietary MapR file system, the search engine indexes for data stored in that file system are being spread around the file system and not housed outside of it. That means you can clone, snapshot, and do other natural acts of a file system on the search engine indices. You can also create indices for subsets of data rather than everything.

The Solr search engine capability will be woven into the M3, M5, and M7 editions. It is in private beta testing now, and Norris says there are no plans for a public beta at the moment. It is expected to be generally available in the next quarter and to be used by business managers and analysts that just want to poke around in the data to see what is in there.

MapR did not provide pricing for the M7 Edition, but says that it costs $4,000 per node per year for the M5 Edition and this you can expect to pay more for the one that gooses HBase speeds. ®

Best practices for enterprise data

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
VMware builds product executables on 50 Mac Minis
And goes to the Genius Bar for support
Multipath TCP speeds up the internet so much that security breaks
Black Hat research says proposed protocol will bork network probes, flummox firewalls
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Microsoft's Euro cloud darkens: US FEDS can dig into foreign servers
They're not emails, they're business records, says court
Microsoft says 'weird things' can happen during Windows Server 2003 migrations
Fix coming for bug that makes Kerberos croak when you run two domain controllers
Cisco says network virtualisation won't pay off everywhere
Another sign of strain in the Borg/VMware relationship?
prev story

Whitepapers

7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
A new approach to endpoint data protection
What is the best way to ensure comprehensive visibility, management, and control of information on both company-owned and employee-owned devices?