Feeds

MapR revs up HBase queries with M7 Hadoop distro

Solr search engine means elephants don't need to chew big data cud

Secure remote control for conventional and virtual desktops

You are not just imagining it. Every commercial distributor of the Hadoop system for storing and chewing through unstructured data has come up with its own a different way to deliver something akin to SQL query functionality while at the same time boosting the speed of ad hoc queries.

MapR Technologies is one of the earlier Hadoop disties and cooked up a proprietary file system that can look like the familiar Network File System of Unix and Linux or the Hadoop Distributed File System. The company promised last fall to work its file system magic on the HBase database layer in Hadoop, and now that code is ready for prime-time.

MapR gave a preview of its HBase acceleration plans back in October when it previewed its top-end M7 Edition. MapR now has three different distributions.

M3 Edition is the freebie community version that allows users to mount HDFS just as if it was an NFS volume, among other extra goodies that are added to the Apache Hadoop distribution; it does not have commercial support, but only community support through the web.

With the M5 Edition, MapR offers tech support for a fee and turns on other features of its super-secret file system that allows for mirroring and snapshots of data sets, high availability clustering for the Hadoop JobTracker, a distributed NameNode (for both high availability and cluster scaling), and a heatmap to show what is going on inside of a Hadoop cluster. (Apache Hadoop has no such graphical tool, which makes a Hadoop cluster a bit of a black box when it is running.)

With the M7 Edition, which starts shipping today, MapR is speeding up the HBase database layer for HDFS by pushing it down into that proprietary and most certainly closed source file system.

Just like MapR has been able to speed up HDFS by breaking data into chunks and spreading it all around the MapR distributed file system, the M7 Edition takes chunks of HBase database tables and breaks them up and stores them in the MapR file system in close proximity to their raw files.

The beauty is that even with this big architectural change, MapR continues to support all of the same HDFS and HBase APIs, and thus applications running atop M7 Edition have no idea that they are not running on the plain vanilla Apache Hadoop.

Well, until they see the performance, says Jack Norris, vice president of marketing at MapR. The company has spent the past six months tweaking and tuning M7 to speed up HBase queries.

Sociocast, one of the early testers for the M7 Edition code, found that the MapR setup ran HBase queries about three times faster than the kosher HBase and was twice as fast as Cassandra, the NoSQL data store that Facebook developed as a replacement for HDFS because it was having such big performance issues with HBase.

Cassandra has what is called "eventual consistency" for its database, which is not a good fit for a lot of enterprise customers, who want absolute consistency. (We went into detail about how the M7 Edition works here.) MapR has done tests that show it can do over 1 million operations per second on M7 Edition on a ten node cluster, and the HBase clone is able to support up to 1 trillion tables.

It doesn't look like Apache Drill, the project that MapR created to clone Google's Dremel query tool, made it into the M7 Edition release. Dremel is used for Google's BigQuery service (which came out of beta last May beta), which does SQL queries against the BigTable database layer that rides atop Google's current file system. HBase is a clone of sorts of BigTable, and clearly you need to do Drill to clone Dremel if you want to be all Googley and such. The Hive SQL-alike query tool, which compiles queries down into MapReduce batch jobs, is better than nothing – but just barely.

In addition to rolling out the M7 Edition, MapR is also doing a private beta for an integrated search engine for Hadoop that it has developed in conjunction with LucidWorks, which was founded in 2007 and is the main company behind the Lucene and Solr search engines with about a quarter of its Apache project committers.

Just like HDFS data chunks and HBase table segments are being stored in that proprietary MapR file system, the search engine indexes for data stored in that file system are being spread around the file system and not housed outside of it. That means you can clone, snapshot, and do other natural acts of a file system on the search engine indices. You can also create indices for subsets of data rather than everything.

The Solr search engine capability will be woven into the M3, M5, and M7 editions. It is in private beta testing now, and Norris says there are no plans for a public beta at the moment. It is expected to be generally available in the next quarter and to be used by business managers and analysts that just want to poke around in the data to see what is in there.

MapR did not provide pricing for the M7 Edition, but says that it costs $4,000 per node per year for the M5 Edition and this you can expect to pay more for the one that gooses HBase speeds. ®

Intelligent flash storage arrays

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Hey - who wants 4.8 TERABYTES almost AS FAST AS MEMORY?
China's Memblaze says they've got it in PCIe. Yow
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
This time it's SO REAL: Overcoming the open-source orgasm myth with TODO
If the web giants need it to work, hey, maybe it'll work
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
Storage array giants can use Azure to evacuate their back ends
Site Recovery can help to move snapshots around
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Top 5 reasons to deploy VMware with Tegile
Data demand and the rise of virtualization is challenging IT teams to deliver storage performance, scalability and capacity that can keep up, while maximizing efficiency.
How to simplify SSL certificate management
Simple steps to take control of SSL certificates across the enterprise, and recommendations centralizing certificate management throughout their lifecycle.