Original URL: http://www.theregister.co.uk/2011/08/04/dell_cloudera_hadoop_stack/

Dell floats Cloud Hadoop clusters

Stuffed elephant rides PowerEdge iron

By Timothy Prickett Morgan

Posted in Cloud, 4th August 2011 11:00 GMT

Dell might not own a lot of systems or middleware software a la IBM, Oracle, and Hewlett-Packard, but it wants to sell configured stacks tuned for specific work just like its rivals in the systems racket.

The company already peddles pre-configured Ubuntu/Eucalyptus and OpenStack systems for building and managing private clouds, and now, it's expanding out into big data munching with a pre-built Hadoop cluster based on the software stack from Cloudera.

Hadoop is a set of programs originally created by that mimic the functioning of Google's MapReduce data muncher and its related Google File System. Cloudera, which launched in March 2009, is the first of a handful of organizations providing commercial Hadoop distributions.

The company is a big contributor to the open source Apache Hadoop project, along with Yahoo!, where the project was originally bootstrapped. IBM, which could have partnered with Cloudera, chose instead to create its own Hadoop distribution, called InfoSphere BigInsights in May 2010. MapR kicked out its own commercially supported Hadoop in May of this year, and Hortonworks, a spinout directly from Yahoo, joined the fray in June 2011.

Cloudera offers its own open source distro (the Cloudera Distribution for Hadoop (CDH)) and an enterprise version (Cloudera Enterprise) that has extra goodies in it for which it can charge money. These extra goodies are not open sourced.

For its Dell-Cloudera stack, Dell is using the Cloudera Enterprise CDH3 version, which can run atop Red Hat Enterprise Linux 5 and 6, its CentOS 5 clone, SUSE Linux Enterprise Server 11, and Ubuntu Server 10.04 LTS and 10.10. Dell is at the moment putting RHEL in the Dell-Cloudera reference architecture, but you are allowed to tweak this and still order a single stack preconfigured. CDH3 Update 1 was just released on July 22.

The Cloudera stack includes the core Apache Hadoop, which has the MapReduce and distributed file system, as well as the companion Apache Hive (SQL-like query), Pig (a high-level programming language for Hadoop), HBase (a column-oriented distributed data store modeled after Google's BigTable), and ZooKeeper (a configuration server for clusters). The stack also includes Cloudera's SCM Express, a management server for configuring Hadoop clusters.

"There are hundreds of settings for a Hadoop configuration, and this makes specific setting recommendations based on the hardware and software that customers are deploying on," Ed Albanese, head of business development at Cloudera tells El Reg. The settings for deploying on a small Hadoop cluster can be radically different from those needed on a mid-sized or large cluster, and SCM Express makes it possible for companies to get the right settings without having to be Hadoop experts like Yahoo!

Dell-Cloudera block diagram

Block diagram of the Dell-Cloudera Hadoop stack (click to enlarge)

Dell is also tossing in its Crowbar tool, which it created for OpenStack-based private clouds running on its PowerEdge-C servers, which were announced last week. Crowbar works in concert with SCM Express to do BIOS configuration, RAID array setup, network setup, operating system deployment, and manage the provisioning of Hadoop software on the bare metal iron from Dell.

John Igoe, executive director of cloud software solutions at Dell, says that for a typical company that is not all that familiar with Hadoop, it can take days to weeks to manually configure a Hadoop cluster with 20, 30, or 40 nodes. But with the combination of the Cloudera Enterprise software and Crowbar, companies can go from a bare-metal rack to running Hadoop data-crunching jobs in less than a day.

Speed to deployment is probably not as important as ease of use, however. "I think there is a substantial opportunity for Hadoop, but we are still in the product's infancy," says Igoe. "Current Hadoop customers have a very deep software bench. But other companies that want to use Hadoop don't have these skills, and they are looking for is to give them those skills."

Stacking it all up

Dell's reference architecture recommends that customers choose the company's PowerEdge-C 2100 energy efficient rack servers, which are optimized to cram the most computing performance in the smallest space with the least amount of extraneous hardware. This is the same iron that Dell used for its Ubuntu Enterprise Cloud (UEC) pre-fabbed private clouds that came out in March, but is different from the PowerEdge C6100 cookie-sheet servers that are being used in the preconfigured OpenStack clouds launched last week.

Dell-Cloudera servers

Server layout in the Dell-Cloudera Hadoop stack (click to enlarge)

The Dell-Cloudera stack is built under the assumption that companies will start out small but grow fast. It consists of six PowerEdge C2100 server nodes and six (yes, six) PowerConnect 6248 48-port Gigabit Ethernet switches. There are two Hadoop master nodes (sometimes called a name node) that manage the Hadoop Distributed File System (HDFS) and MapReduce task distribution across the cluster. These master nodes can also run ZooKeeper. The edge node, of which there is only one in the base configuration, runs Pig and Hive and is the interface between users and the cluster. The slave nodes, of which there are three, run MapReduce and HDFS store data on local disk drives and chews on it as instructed by the master nodes.

Slaves do all the Hadoop work

Here are the feeds and speeds on the iron underneath the Dell-Cloudera bundle:

Dell-Cloudera server configurations

Hadoop server configuration in the Dell-Cloudera bundle

Dell is using two-socket C2100 servers through the whole cluster, and recommends six-core Xeon E5645 processors running at 2.4GHz and with 12MB of cache for the master and edge nodes; these machines get six 600GB SAS drives. Dell recommends that slave nodes get four-core Xeon E5606 processors running at 2.13GHz and with only 8MB of cache, which is less compute and heat, while jacking up the machines to a dozen 1TB SATA drives spinning at 7200 RPM. The master and edge nodes have RAID data protection, while the slaves are just a bunch of disks hanging off an LSI2008 SATA controller.

This base Hadoop cluster can't do much, obviously, with only three slaves. But you can gang up the slave nodes into pods of 60 slave nodes, and then put a dozen of these together to create a Hadoop cluster that scales up to 720 slaves.

To give you a sense of scale, Yahoo! had 40,000 servers running Hadoop across dozens of clusters at the beginning of this year running various workloads, and expected to significantly increase that server count before the end of 2011. The largest Yahoo! Hadoop clusters have around 4,000 machines.

Under the deal that Dell and Cloudera have worked out, they are offering collaborative support contracts for the hardware or software, according to Albanese. Customers can call either Dell or Cloudera for their tech support questions, but obviously software questions will be routed to Cloudera and hardware issues will fall in Dell's lap. Dell's sales force is the only one peddling the combined solution, however.

Albanese declined to comment on whether other server makers had lined up to create similar Hadoop stacks on their servers. But Hewlett-Packard and Fujitsu would be obvious choices. It seems far more likely that Oracle will create its own Hadoop distribution before partnering with Cloudera or anyone else with a commercially supported Hadoop stack. Silicon Graphics, Cray, and Super Micro as well as a slew of whitebox vendors might be interested in creating their own Hadoop stacks, too. IBM already has its own BigInsights distribution and would probably not be interested unless customers started clamoring for it.

The Dell-Cloudera stack will start shipping later in August, according to Igoe. Pricing for that base configuration, including a year of support for the Cloudera stack, will run between $118,000 and $124,000. The higher price tag is for 24x7 tech support.

Dell has been doing Hadoop engagements, sometimes on the open source software and sometimes in conjunction with Cloudera, and early on companies in the financial services, Internet media and hosting, and retailing businesses have shown the most interest thus far, says Igoe. He offers some interesting observations.

First, brick and mortar retailers as well as online retailers are very keen on using Hadoop to improve their sales and better understand what their customers and prospects are doing (and not doing) when they visit their Web sites. Interestingly, retailers are looking at Hadoop as a means of aggregating log files across their myriad IT equipment behind the firewall as a means of cutting their teeth on Hadoop processing before they apply the technology to their retailing data.

Ditto for a number of hosting companies, who have homegrown and third-party system monitoring tools to keep track of what is going on in their infrastructure. They are looking at Hadoop as an alternative to these tools. Having mastered Hadoop on these internal tasks, companies will build out their clusters and aggregate their Web and customer data and do analytics on it. ®