Cloudera brings Hadoop to the masses with Solr search

Now anyone can query petabytes of gobbledygook

Boost IT visibility and business value

Commercial Hadoop supplier Cloudera is adding more methods to extract data stored in the Hadoop Distributed File System by rolling up the Solr search engine and hooking it into its CDH distro.

The company is also banging the drum that we are entering a new era of computing, one in which old-style relational databases will still have a role in transaction processing and analytics – but a much more diminished one.

Solr is an Apache project just like many of the elements of the Hadoop stack and is a variant of the Lucene search engine created by Doug Cutting, one of the creators of the Nutch web crawler fifteen years ago. Cutting was working at search engine giant Yahoo! (before it decided to piggyback on Microsoft Bing for search, and had its own search engine) when he read the Google paper on MapReduce, and extended Nutch with Tom White to create Hadoop. Solr was created by Yonik Seeley (not Cutting as this article originally said).

For the past several years, Cutting has been chief architect at Cloudera, and the wonder is what took so long for search to be added to the Hadoop stack officially. For whatever reason, there is no better person than Cutting to do the job.

Cloudera is not going to be the first commercial Hadoop distributor or large NoSQL data store supplier to adopt Solr as a search engine for unstructured data. DataStax, which has commercialized the Cassandra NoSQL data store originally created by Facebook added Solr search for Cassandra back in March 2012, and MapR Technologies added Solr search to its M7 Hadoop distro last month, but as is the case with Cloudera, it is only in beta at the moment.

Cloudera CEO Mike Olson says that Cloudera Search, as the feature will be known in the CDH stack, has been in private beta for a number of months and is being made available for public beta testing now. Cloudera Manager 4.6, the control freak for the CDH stack, has been tweaked to install Solr search and to monitor it as well.

Cloudera Search is being distributed as a separate download, but the next release of CDH will have Solr search rolled up alongside MapReduce batch and Impala SQL query methods for tickling data stored in HDFS. General availability is expected sometime in the third quarter, but Olson says it is subject to change. And like the Interactive Query (Impala) feature, Cloudera Search (Solr) will have an additional support fee above and beyond the base CDH support fee.

"The key benefit is that anybody can now use this platform," says Olson. "When Hadoop first appeared on the market, the knock against it by the existing analytics vendors was that you had to learn this new MapReduce thing and you have got to be a Java programmer. We have added SQL, but there are people who don't know that language, either. People want to search for data they know exists in their cluster, but with a petabyte of data, there is not set of folders that makes sense any more. What we have learned from Google is that we just want to type terms into a search box."

The Solr search engine can be used to index data as it is being ingested into HDFS or HBase and then embed it into HDFS for future searching. In some cases, using Solr to search through data will be sufficient to the task, and in others, end users will just use Solr to do data exploration before they write a MapReduce routine in Java, kick off a query against an HBase table, or even run Impala SQL queries against HDFS.

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Providing access to data in HDFS doesn't end at MapReduce, HBase, Impala, and Solr, says Olson. "Watch this space, because we will add other engines over time because what companies want is to access the same data without making copies."

With all of the expanded capabilities of Hadoop, which are speeding up query times as the system moves from batch to near-realtime processing, and the significantly lower cost of storing data in Hadoop compared to traditional data warehouses powered by parallel relational databases, Olson thinks that the center of gravity for analytics is shifting away from relational tools to Hadoop.

"If you are paying by the terabyte, then these numbers on data warehouses get pretty scary pretty fast," says Olson.

Moreover, customers have different kinds of data than these warehouses were designed to store, and they are asking different kinds of questions as well of a mix of data types from varied data sources. The street price of a data warehouse is something on the order of $20,000 per terabyte, according to Olson, while it is on the order of $500 per terabyte for a Hadoop cluster. And so, performing data cleansing and doing extraction/test/load operations on data in a traditional warehouse can be very pricey indeed.

"On a data warehouse, every workload you have is flying first class," he quips. "Hadoop is not only cheaper, but you get a faster time to insight. And, you can move data transformation and analysis to Hadoop and free up capacity on the warehouse to do other work without spending more money there."

It will be many years before most corporations are ready to give up their data marts and data warehouses, but the economics of the situation and the improving query and analytics tools in Hadoop are certainly going to make them stop and think. This is why Teradata and Oracle should probably have their own Hadoop distros at some point, like IBM has, instead of partnering with Hortonworks and Cloudera, respectively.

You want to be the next Red Hat more than you want to be its reseller. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story


5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.