The Register® — Biting the hand that feeds IT

Feeds

Cloudera brings Hadoop to the masses with Solr search

Now anyone can query petabytes of gobbledygook

Free ESG report : Seamless data management with Avere FXT

Commercial Hadoop supplier Cloudera is adding more methods to extract data stored in the Hadoop Distributed File System by rolling up the Solr search engine and hooking it into its CDH distro.

The company is also banging the drum that we are entering a new era of computing, one in which old-style relational databases will still have a role in transaction processing and analytics – but a much more diminished one.

Solr is an Apache project just like many of the elements of the Hadoop stack and is a variant of the Lucene search engine created by Doug Cutting, one of the creators of the Nutch web crawler fifteen years ago. Cutting was working at search engine giant Yahoo! (before it decided to piggyback on Microsoft Bing for search, and had its own search engine) when he read the Google paper on MapReduce, and extended Nutch with Tom White to create Hadoop. Solr was created by Yonik Seeley (not Cutting as this article originally said).

For the past several years, Cutting has been chief architect at Cloudera, and the wonder is what took so long for search to be added to the Hadoop stack officially. For whatever reason, there is no better person than Cutting to do the job.

Cloudera is not going to be the first commercial Hadoop distributor or large NoSQL data store supplier to adopt Solr as a search engine for unstructured data. DataStax, which has commercialized the Cassandra NoSQL data store originally created by Facebook added Solr search for Cassandra back in March 2012, and MapR Technologies added Solr search to its M7 Hadoop distro last month, but as is the case with Cloudera, it is only in beta at the moment.

Cloudera CEO Mike Olson says that Cloudera Search, as the feature will be known in the CDH stack, has been in private beta for a number of months and is being made available for public beta testing now. Cloudera Manager 4.6, the control freak for the CDH stack, has been tweaked to install Solr search and to monitor it as well.

Cloudera Search is being distributed as a separate download, but the next release of CDH will have Solr search rolled up alongside MapReduce batch and Impala SQL query methods for tickling data stored in HDFS. General availability is expected sometime in the third quarter, but Olson says it is subject to change. And like the Interactive Query (Impala) feature, Cloudera Search (Solr) will have an additional support fee above and beyond the base CDH support fee.

"The key benefit is that anybody can now use this platform," says Olson. "When Hadoop first appeared on the market, the knock against it by the existing analytics vendors was that you had to learn this new MapReduce thing and you have got to be a Java programmer. We have added SQL, but there are people who don't know that language, either. People want to search for data they know exists in their cluster, but with a petabyte of data, there is not set of folders that makes sense any more. What we have learned from Google is that we just want to type terms into a search box."

The Solr search engine can be used to index data as it is being ingested into HDFS or HBase and then embed it into HDFS for future searching. In some cases, using Solr to search through data will be sufficient to the task, and in others, end users will just use Solr to do data exploration before they write a MapReduce routine in Java, kick off a query against an HBase table, or even run Impala SQL queries against HDFS.

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Cloudera wants to commercialize a bunch of different engines to search HDFS and HBase

Providing access to data in HDFS doesn't end at MapReduce, HBase, Impala, and Solr, says Olson. "Watch this space, because we will add other engines over time because what companies want is to access the same data without making copies."

With all of the expanded capabilities of Hadoop, which are speeding up query times as the system moves from batch to near-realtime processing, and the significantly lower cost of storing data in Hadoop compared to traditional data warehouses powered by parallel relational databases, Olson thinks that the center of gravity for analytics is shifting away from relational tools to Hadoop.

"If you are paying by the terabyte, then these numbers on data warehouses get pretty scary pretty fast," says Olson.

Moreover, customers have different kinds of data than these warehouses were designed to store, and they are asking different kinds of questions as well of a mix of data types from varied data sources. The street price of a data warehouse is something on the order of $20,000 per terabyte, according to Olson, while it is on the order of $500 per terabyte for a Hadoop cluster. And so, performing data cleansing and doing extraction/test/load operations on data in a traditional warehouse can be very pricey indeed.

"On a data warehouse, every workload you have is flying first class," he quips. "Hadoop is not only cheaper, but you get a faster time to insight. And, you can move data transformation and analysis to Hadoop and free up capacity on the warehouse to do other work without spending more money there."

It will be many years before most corporations are ready to give up their data marts and data warehouses, but the economics of the situation and the improving query and analytics tools in Hadoop are certainly going to make them stop and think. This is why Teradata and Oracle should probably have their own Hadoop distros at some point, like IBM has, instead of partnering with Hortonworks and Cloudera, respectively.

You want to be the next Red Hat more than you want to be its reseller. ®

5 ways to reduce advertising network latency

Whitepapers

5 ways to reduce advertising network latency
Implementing the tactics laid out in this whitepaper can help reduce your overall advertising network latency.
Supercharge your infrastructure
Fusion­‐io has developed a shared storage solution that provides new performance management capabilities required to maximize flash utilization.
Avere FXT with FlashMove and FlashMirror
This ESG Lab validation report documents hands-on testing of the Avere FXT Series Edge Filer with the AOS 3.0 operating environment.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Email delivery: 4 steps to get more email to the inbox
This whitepaper lists some steps and information that will give you the best opportunity to achieve an amazing sender reputation.

More from The Register

next story
Dedupe-dedupe, dedupe-dedupe-dedupe: Flashy clients crowd around Permabit diamond
3 of the top six flash vendors are casing the OEM dedupe tech, claims analyst
Disk-pushers, get reel: Even GOOGLE relies on tape
Prepare to be beaten by your old, cheap rival
Dragons' Den star's biz Outsourcery sends yet more millions up in smoke
Telly moneybags went into the cloud and still nobody's making any profit
Hong Kong's data centres stay high and dry amid Typhoon Usagi
180 km/h winds kill 25 in China, but the data centres keep humming
Microsoft lures punters to hybrid storage cloud with free storage arrays
Spend on Azure, get StorSimple box at the low, low price of $0
WD unveils new MyBook line: External drives now bigger... and CHEAP
Less than £0.04/GB, but it loses the Thunderbolt speed
VMware vSAN test pilots: Don't panic but there's a chance of DATA LOSS
AHCI SATA controller won't play nice with Virtzilla's robo-storage beta
prev story