Cloudera brings Hadoop to the masses with Solr search

Now anyone can query petabytes of gobbledygook

Gartner critical capabilities for enterprise endpoint backup

Commercial Hadoop supplier Cloudera is adding more methods to extract data stored in the Hadoop Distributed File System by rolling up the Solr search engine and hooking it into its CDH distro.

The company is also banging the drum that we are entering a new era of computing, one in which old-style relational databases will still have a role in transaction processing and analytics – but a much more diminished one.

Solr is an Apache project just like many of the elements of the Hadoop stack and is a variant of the Lucene search engine created by Doug Cutting, one of the creators of the Nutch web crawler fifteen years ago. Cutting was working at search engine giant Yahoo! (before it decided to piggyback on Microsoft Bing for search, and had its own search engine) when he read the Google paper on MapReduce, and extended Nutch with Tom White to create Hadoop. Solr was created by Yonik Seeley (not Cutting as this article originally said).

For the past several years, Cutting has been chief architect at Cloudera, and the wonder is what took so long for search to be added to the Hadoop stack officially. For whatever reason, there is no better person than Cutting to do the job.

Cloudera is not going to be the first commercial Hadoop distributor or large NoSQL data store supplier to adopt Solr as a search engine for unstructured data. DataStax, which has commercialized the Cassandra NoSQL data store originally created by Facebook added Solr search for Cassandra back in March 2012, and MapR Technologies added Solr search to its M7 Hadoop distro last month, but as is the case with Cloudera, it is only in beta at the moment.

Cloudera CEO Mike Olson says that Cloudera Search, as the feature will be known in the CDH stack, has been in private beta for a number of months and is being made available for public beta testing now. Cloudera Manager 4.6, the control freak for the CDH stack, has been tweaked to install Solr search and to monitor it as well.

Cloudera Search is being distributed as a separate download, but the next release of CDH will have Solr search rolled up alongside MapReduce batch and Impala SQL query methods for tickling data stored in HDFS. General availability is expected sometime in the third quarter, but Olson says it is subject to change. And like the Interactive Query (Impala) feature, Cloudera Search (Solr) will have an additional support fee above and beyond the base CDH support fee.

"The key benefit is that anybody can now use this platform," says Olson. "When Hadoop first appeared on the market, the knock against it by the existing analytics vendors was that you had to learn this new MapReduce thing and you have got to be a Java programmer. We have added SQL, but there are people who don't know that language, either. People want to search for data they know exists in their cluster, but with a petabyte of data, there is not set of folders that makes sense any more. What we have learned from Google is that we just want to type terms into a search box."

The Solr search engine can be used to index data as it is being ingested into HDFS or HBase and then embed it into HDFS for future searching. In some cases, using Solr to search through data will be sufficient to the task, and in others, end users will just use Solr to do data exploration before they write a MapReduce routine in Java, kick off a query against an HBase table, or even run Impala SQL queries against HDFS.

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Providing access to data in HDFS doesn't end at MapReduce, HBase, Impala, and Solr, says Olson. "Watch this space, because we will add other engines over time because what companies want is to access the same data without making copies."

With all of the expanded capabilities of Hadoop, which are speeding up query times as the system moves from batch to near-realtime processing, and the significantly lower cost of storing data in Hadoop compared to traditional data warehouses powered by parallel relational databases, Olson thinks that the center of gravity for analytics is shifting away from relational tools to Hadoop.

"If you are paying by the terabyte, then these numbers on data warehouses get pretty scary pretty fast," says Olson.

Moreover, customers have different kinds of data than these warehouses were designed to store, and they are asking different kinds of questions as well of a mix of data types from varied data sources. The street price of a data warehouse is something on the order of $20,000 per terabyte, according to Olson, while it is on the order of $500 per terabyte for a Hadoop cluster. And so, performing data cleansing and doing extraction/test/load operations on data in a traditional warehouse can be very pricey indeed.

"On a data warehouse, every workload you have is flying first class," he quips. "Hadoop is not only cheaper, but you get a faster time to insight. And, you can move data transformation and analysis to Hadoop and free up capacity on the warehouse to do other work without spending more money there."

It will be many years before most corporations are ready to give up their data marts and data warehouses, but the economics of the situation and the improving query and analytics tools in Hadoop are certainly going to make them stop and think. This is why Teradata and Oracle should probably have their own Hadoop distros at some point, like IBM has, instead of partnering with Hortonworks and Cloudera, respectively.

You want to be the next Red Hat more than you want to be its reseller. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.