Feeds

Dell floats Cloud Hadoop clusters

Stuffed elephant rides PowerEdge iron

Internet Security Threat Report 2014

Dell might not own a lot of systems or middleware software a la IBM, Oracle, and Hewlett-Packard, but it wants to sell configured stacks tuned for specific work just like its rivals in the systems racket.

The company already peddles pre-configured Ubuntu/Eucalyptus and OpenStack systems for building and managing private clouds, and now, it's expanding out into big data munching with a pre-built Hadoop cluster based on the software stack from Cloudera.

Hadoop is a set of programs originally created by that mimic the functioning of Google's MapReduce data muncher and its related Google File System. Cloudera, which launched in March 2009, is the first of a handful of organizations providing commercial Hadoop distributions.

The company is a big contributor to the open source Apache Hadoop project, along with Yahoo!, where the project was originally bootstrapped. IBM, which could have partnered with Cloudera, chose instead to create its own Hadoop distribution, called InfoSphere BigInsights in May 2010. MapR kicked out its own commercially supported Hadoop in May of this year, and Hortonworks, a spinout directly from Yahoo, joined the fray in June 2011.

Cloudera offers its own open source distro (the Cloudera Distribution for Hadoop (CDH)) and an enterprise version (Cloudera Enterprise) that has extra goodies in it for which it can charge money. These extra goodies are not open sourced.

For its Dell-Cloudera stack, Dell is using the Cloudera Enterprise CDH3 version, which can run atop Red Hat Enterprise Linux 5 and 6, its CentOS 5 clone, SUSE Linux Enterprise Server 11, and Ubuntu Server 10.04 LTS and 10.10. Dell is at the moment putting RHEL in the Dell-Cloudera reference architecture, but you are allowed to tweak this and still order a single stack preconfigured. CDH3 Update 1 was just released on July 22.

The Cloudera stack includes the core Apache Hadoop, which has the MapReduce and distributed file system, as well as the companion Apache Hive (SQL-like query), Pig (a high-level programming language for Hadoop), HBase (a column-oriented distributed data store modeled after Google's BigTable), and ZooKeeper (a configuration server for clusters). The stack also includes Cloudera's SCM Express, a management server for configuring Hadoop clusters.

"There are hundreds of settings for a Hadoop configuration, and this makes specific setting recommendations based on the hardware and software that customers are deploying on," Ed Albanese, head of business development at Cloudera tells El Reg. The settings for deploying on a small Hadoop cluster can be radically different from those needed on a mid-sized or large cluster, and SCM Express makes it possible for companies to get the right settings without having to be Hadoop experts like Yahoo!

Dell-Cloudera block diagram

Block diagram of the Dell-Cloudera Hadoop stack (click to enlarge)

Dell is also tossing in its Crowbar tool, which it created for OpenStack-based private clouds running on its PowerEdge-C servers, which were announced last week. Crowbar works in concert with SCM Express to do BIOS configuration, RAID array setup, network setup, operating system deployment, and manage the provisioning of Hadoop software on the bare metal iron from Dell.

John Igoe, executive director of cloud software solutions at Dell, says that for a typical company that is not all that familiar with Hadoop, it can take days to weeks to manually configure a Hadoop cluster with 20, 30, or 40 nodes. But with the combination of the Cloudera Enterprise software and Crowbar, companies can go from a bare-metal rack to running Hadoop data-crunching jobs in less than a day.

Speed to deployment is probably not as important as ease of use, however. "I think there is a substantial opportunity for Hadoop, but we are still in the product's infancy," says Igoe. "Current Hadoop customers have a very deep software bench. But other companies that want to use Hadoop don't have these skills, and they are looking for is to give them those skills."

Stacking it all up

Dell's reference architecture recommends that customers choose the company's PowerEdge-C 2100 energy efficient rack servers, which are optimized to cram the most computing performance in the smallest space with the least amount of extraneous hardware. This is the same iron that Dell used for its Ubuntu Enterprise Cloud (UEC) pre-fabbed private clouds that came out in March, but is different from the PowerEdge C6100 cookie-sheet servers that are being used in the preconfigured OpenStack clouds launched last week.

Dell-Cloudera servers

Server layout in the Dell-Cloudera Hadoop stack (click to enlarge)

The Dell-Cloudera stack is built under the assumption that companies will start out small but grow fast. It consists of six PowerEdge C2100 server nodes and six (yes, six) PowerConnect 6248 48-port Gigabit Ethernet switches. There are two Hadoop master nodes (sometimes called a name node) that manage the Hadoop Distributed File System (HDFS) and MapReduce task distribution across the cluster. These master nodes can also run ZooKeeper. The edge node, of which there is only one in the base configuration, runs Pig and Hive and is the interface between users and the cluster. The slave nodes, of which there are three, run MapReduce and HDFS store data on local disk drives and chews on it as instructed by the master nodes.

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.