Feeds

VMware teaches Serengeti big-data virt new Hadoop tricks

Probably shuffling off to Pivotal soon

Next gen security for virtualised datacentres

It comes as no surprise that VMware wants companies to run everything virtually rather than on bare metal, and for several years it has pushed the idea of virtualizing the Hadoop stack to make it run better and easier to manage. The tool it created to do that, called Project Serengeti, now has some feature tweaks to try to entice more big data cluster builders give it a whirl.

With Serengeti 0.8.0, released Tuesday, the open source tool for virtualizing Hadoop now supports a number of new Hadoop releases plus adds features to make it easier to set up HBase data warehouses on top of Hadoop.

The update to Seregenti was announced in a blog post by Richard McDougall, principal engineer in the office of the CTO at the virtualization giant. "Most big-data environments consist of a mix of workloads," McDougall explains. "Serengeti's mission is to enable as many of the big-data family of workloads into the same theme park, all running on a common shared platform."

By virtualizing clusters you can run various parts of the big-data munching tools on shared hardware, dialing up virtual machines running each workload as needed, and dialing them back so other workloads can play.

It's all about elastic scaling, for which you pay a virtualization performance tax. For many workloads, as servers have been crammed to the gills with cores, this overhead has been acceptable.

VMware wants to layer big data tools on top of its ESXi server virtualization

VMware wants to layer big data tools on top of its ESXi server virtualization

Most companies probably don't think about their Hadoop clusters in this manner, and very likely do think about them as performing very specific functions. They're more worried about the turnaround time for batch jobs and queries and how other applications are dependent on the results of that work, and they don't want to pay a performance overhead for virtualization.

But VMware is going to keep plugging away at the idea that virtualization will allow for mixed-mode use of server clusters for all kinds of big-data jobs. So will the Pivotal group once Serengeti passes along with the Cloud Foundry platform cloud and EMC's Greenplum data warehouse and Hadoop distribution over to the Pivotal spinoff sometime later this year.

With the Serengeti 0.8.0 release, Cloudera's CDH4 and MapR Technologies' M5 Hadoop distributions are now supported running inside of virtual machine containers. The open source Apache 1.0 distribution was already supported, as was EMC's Greenplum HD 1.2., Cloudera CDH3, and Hortonworks Data Platform 1.0.

With the CHD4 release, Serengeti is aware that you can use the HDFS1 or HDFS2 file systems, and is also aware of the federated NameNode support that Cloudera has built into its Hadoop distro and knows how to configure these options.

And with MapR distros, Serengeti is similarly aware of the container location database (CLDB) used in the NFS-alike file system that MapR uses instead of HDFS, and is also in the know about the FileServer, JobTracker, and TaskTracker elements of the MapR stack, and how to package these up into virty machines and scale out their performance by replicating copies.

If you are looking to set up an HBase data warehouse, as you can see in the Serengeti 0.8.0 release notes, the VMware tool can create an HBase cluster, with an underlying HDFS file system and linked to the MapReduce data-muncher and the Thrift and RESTful APIs that are used to control HBase.

Serengeti also knows how to configure active and hot standby replicants of the HMaster nodes for the data warehouse, and can scale out HBase RegionalServers once the data warehouse is set up atop HDFS. HBase can be deployed in a virtualized manner by Serengeti on top of the Apache Hadoop. Cloudera, Hortonworks, or Greenplum distros – but not MapR distros, for some reason.

You can download the virtual machine appliance stuffed with Serengeti 0.8.0 here at the VMware site, and it doesn't cost anything to use. ®

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.