Amazon slides MapR into elastic Hadoop service

Rolls up 2.0 releases for M3 and M5 distros

Security for virtualized datacentres

Hadoop World 2012 MapR Technologies, one of the main distributors of commercial-grade Hadoop data-munching software, has been tapped by Amazon Web Services to be an alternative to the open source Hadoop stack in the Elastic MapReduce service that Amazon sells to people who don't want to manage their own Hadoop clusters.

At the same time, MapR is trotting out the 2.0 release of its M3 open source and M5 open-core Hadoop distributions.

Until this week, if you were using Elastic MapReduce and you went to the configuration file to set up a MapReduce service (which AWS automagically spits out onto an appropriately sized cluster to fit your budget and job size), you were given two options: the open source Hadoop 0.020 or Hadoop 0.20.205 from the Apache Software Foundation.

Starting this week, however, you now have two more options: MapR M3 v1.2 or M5 v1.2, which were announced in December 2011 and therefore have had the kinks worked out of them.

The M3 and M5 v1.2 releases were also packaged up to run inside of a VMware ESXi hypervisor to allow for the creation of a baby demo Hadoop cluster that could run on a laptop or server, and it was not that much of a leap to spin up the distros into Amazon Machine Image (AMI) formats to run atop Amazon's home-tweaked Xen hypervisor used for its EC2 compute cloud and therefore underneath the Elastic MapReduce service.

The big news is not that there is an AMI for running the MapR variant of Hadoop, but rather that Amazon has made the MapR code – rather than Cloudera, HortonWorks, or IBM variants – a default alternative to its own rollup of Apache Hadoop.

Jack Norris, vice president of marketing at MapR, tells El Reg that this is particularly important given that 90 per cent of the Hadoop work running on the Amazon cloud is through Elastic MapReduce, not by companies setting up their own virty clusters on EC2 and S3.

"EMR is really how people consume Hadoop on Amazon," says Norris.

And while the AMI images for the M3 and M5 Hadoop distros are available for companies to license on the Amazon Marketplace that debuted two months ago, and run on clusters they configure themselves, the value in EMR is that it can spawn hundreds of virtual servers running the code in about five minutes, and then get to work data munching. See how many cans of Jolt Red Bull it takes for you to do the same.

The M3 v1.2 distribution has the same cost on EMR as the two AMIs packaged up by Amazon for the service; if you want to use the M5 distribution, which offers NFS mounting of the Hadoop Distributed File System (HDFS) underneath Hadoop, then it costs an extra 10 cents per hour atop the EMR fees on a standard large instance (m1.large in the AWS lingo), and an extra 72 cents per hour for a dedicated cluster compute eight extra large image (which is called cc2.8xlarge and which is essentially a whole physical server).

That price includes 24x7 tech support from MapR, which is just thrilled to get a piece of the Amazon action – particularly since MapR's only other route to market is through EMC's Greenplum data warehousing and analytics division.

Amazon EMR configuration screen with MapR options

Amazon EMR configuration screen with MapR options (click to enlarge)

The M5 release also includes distributed NameNode and JobTracker management nodes for extra resiliency. In addition, Amazon and MapR have done tweaks to both the M3 and M5 code running in the AMI to tune it for the EC2 compute and S3 storage utilities; they have also done work to interface the MapR Hadoopery with Amazon's DynamoDB NoSQL data store and its CloudWatch management tool.

This last item is particularly important because of the data compression that M5 offers to help speed up throughput on Hadoop jobs and the snapshotting capability that M5 has, which allows for point-in-time recovery snapshots to be taken of HDFS and dumped to S3.

Amazon does the Level 1 tech support on the M3 and M5 instances running on EMR, while MapR does Levels 2 and 3 support.

While MapR is rolling out the 2.0 releases of its M3 and M5 distributions this week as well, these are not yet available on Amazon's EMR service. But they will be shortly, says Norris.

MapR M3 and M5 are based on the Apache 1.0 Hadoop stack, with lots of extra patches thrown in by MapR and Amazon. The Amazon tweaks are the tunings for EC2 and S3, while the MapR tweaks are for its in-memory sharding of the NameNode data and replication to disk, which eliminates the single-point-of-failure issue of the standard Hadoop NameNode, which keeps track of which chunks of data are stored on what spindle in what server in the Hadoop cluster.

There's only one NameNode in a normal Hadoop cluster, although the Apache 2.0 stack, which is in alpha testing now, has some replication services to provide HA for this node. Cloudera is using it in its latest release, while Hortonworks is plunking the NameNode in a VMware ESXi VM and using vSphere and Site Replication Manager high availability extensions to replicate the name node.

Norris says that the MapR v2.0 Hadoop distros have features to allow a single cluster to be carved up into isolated sections so you can do multi-tenancy and run multiple MapReduce jobs across those sections rather than having to set up multiple, separate clusters. You can also use MapR internally and do replication out to the Amazon cloud, or do inter-cluster mirroring from different AWS availability zones (which are isolated chunks of EC2 within a single Amazon data center).

The 2.0 release has centralized logging and central configuration – you don't have to hop from node to node tweaking the Hadoop cluster or troubleshooting it – and also sports LZ4, LZf, and GZIP compression algorithms. New versions of HBase (the distributed database that rides atop HDFS), Pig (the high-level language to create MapReduce routines), and Hive (the ad-hoc query language and data warehousing tool that works with HDFS) have been updated to the latest stable releases in the MapR stacks.

MapR is now supporting SELinux security with its Hadoop distros, and has added SUSE Linux Enterprise Server 11 (including the SP1 and SP2 updates) as an operating system on which M3 or M5 can run. Prior releases as well as the MapR 2.0 releases ran on Red Hat Enterprise Linux 5 and 6, Canonical Ubuntu 9.04 and higher, and CentOS 5 and 6.

The M3 and M5 v2.0 distros are in a public beta now, and the software will be generally available in the third quarter. M3 is free and M5 costs $4,000 per node for the license to the proprietary extensions to the stack and a year of technical support for the code. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.