Feeds

VMware teaches Serengeti big-data virt new Hadoop tricks

Probably shuffling off to Pivotal soon

The essential guide to IT transformation

It comes as no surprise that VMware wants companies to run everything virtually rather than on bare metal, and for several years it has pushed the idea of virtualizing the Hadoop stack to make it run better and easier to manage. The tool it created to do that, called Project Serengeti, now has some feature tweaks to try to entice more big data cluster builders give it a whirl.

With Serengeti 0.8.0, released Tuesday, the open source tool for virtualizing Hadoop now supports a number of new Hadoop releases plus adds features to make it easier to set up HBase data warehouses on top of Hadoop.

The update to Seregenti was announced in a blog post by Richard McDougall, principal engineer in the office of the CTO at the virtualization giant. "Most big-data environments consist of a mix of workloads," McDougall explains. "Serengeti's mission is to enable as many of the big-data family of workloads into the same theme park, all running on a common shared platform."

By virtualizing clusters you can run various parts of the big-data munching tools on shared hardware, dialing up virtual machines running each workload as needed, and dialing them back so other workloads can play.

It's all about elastic scaling, for which you pay a virtualization performance tax. For many workloads, as servers have been crammed to the gills with cores, this overhead has been acceptable.

VMware wants to layer big data tools on top of its ESXi server virtualization

VMware wants to layer big data tools on top of its ESXi server virtualization

Most companies probably don't think about their Hadoop clusters in this manner, and very likely do think about them as performing very specific functions. They're more worried about the turnaround time for batch jobs and queries and how other applications are dependent on the results of that work, and they don't want to pay a performance overhead for virtualization.

But VMware is going to keep plugging away at the idea that virtualization will allow for mixed-mode use of server clusters for all kinds of big-data jobs. So will the Pivotal group once Serengeti passes along with the Cloud Foundry platform cloud and EMC's Greenplum data warehouse and Hadoop distribution over to the Pivotal spinoff sometime later this year.

With the Serengeti 0.8.0 release, Cloudera's CDH4 and MapR Technologies' M5 Hadoop distributions are now supported running inside of virtual machine containers. The open source Apache 1.0 distribution was already supported, as was EMC's Greenplum HD 1.2., Cloudera CDH3, and Hortonworks Data Platform 1.0.

With the CHD4 release, Serengeti is aware that you can use the HDFS1 or HDFS2 file systems, and is also aware of the federated NameNode support that Cloudera has built into its Hadoop distro and knows how to configure these options.

And with MapR distros, Serengeti is similarly aware of the container location database (CLDB) used in the NFS-alike file system that MapR uses instead of HDFS, and is also in the know about the FileServer, JobTracker, and TaskTracker elements of the MapR stack, and how to package these up into virty machines and scale out their performance by replicating copies.

If you are looking to set up an HBase data warehouse, as you can see in the Serengeti 0.8.0 release notes, the VMware tool can create an HBase cluster, with an underlying HDFS file system and linked to the MapReduce data-muncher and the Thrift and RESTful APIs that are used to control HBase.

Serengeti also knows how to configure active and hot standby replicants of the HMaster nodes for the data warehouse, and can scale out HBase RegionalServers once the data warehouse is set up atop HDFS. HBase can be deployed in a virtualized manner by Serengeti on top of the Apache Hadoop. Cloudera, Hortonworks, or Greenplum distros – but not MapR distros, for some reason.

You can download the virtual machine appliance stuffed with Serengeti 0.8.0 here at the VMware site, and it doesn't cost anything to use. ®

Boost IT visibility and business value

More from The Register

next story
Pay to play: The hidden cost of software defined everything
Enter credit card details if you want that system you bought to actually be useful
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
HP busts out new ProLiant Gen9 servers
Think those are cool? Wait till you get a load of our racks
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
VMware's high-wire balancing act: EVO might drag us ALL down
Get it right, EMC, or there'll be STORAGE CIVIL WAR. Mark my words
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story

Whitepapers

Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up distributed data
Eliminating the redundant use of bandwidth and storage capacity and application consolidation in the modern data center.
The essential guide to IT transformation
ServiceNow discusses three IT transformations that can help CIOs automate IT services to transform IT and the enterprise
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.