Docker ported into Hadoop as benchmarks show SCREAMING FAST performance

Code committers hope unholy union of open source tech will spawn speedy gonzalez virtualization

Securing Web Applications Made Simple and Scalable

The Hadoop community is working on patches that will bring the popular app-containerization technology Docker into the data management system, and independent benchmarks are showing the tech has a huge speedup over traditional virtualization approaches.

Docker is an open source Linux containerization technology that uses underlying kernel elements like namespaces, lxc, and cgroups to let an admin run multiple apps with all their dependencies in secure sandboxes on the same underlying Linux OS, making it an attractive alternative to typical virtualization, which bundles a copy of the OS with each app.

In a set of benchmarks an IBM employee released on Thursday, the company showed that Docker containerization has some huge advantages over the KVM hypervisor from a performance perspective.

Alongside this, El Reg has discovered some fascinating work by the Hadoop community to bring the tech into the eponymous data analysis and management engine.

Combined, these crumbs of news add more grist to the idea that Docker could become an eventual replacement for traditional virtualization approaches, granting organizations big benefits from an open source tech.

To start with, benchmarks conducted by IBM show that Docker has a number of performance advantages over the KVM hypervisor when running on the open source cloud infrastructure tool OpenStack.

In an informative post published on Thursday, IBM chap Boden Russell goes into further details about the results.

"From an OpenStack Cloudy operational time perspective (boot, reboot, delete, snapshot, etc.) docker LXC outperformed KVM ranging from 1.09x (delete) to 49x (reboot)," Russell wrote. "Based on the compute node resource usage metrics during the serial VM packing test: Docker LXC CPU growth is approximately 26x lower than KVM. On this surface this indicates a 26x density potential increase from a CPU point of view using docker LXC vs a traditional hypervisor. Docker LXC memory growth is approximately 3x lower than KVM. On the surface this indicates a 3x density potential increase from a memory point of view using docker LXC vs a traditional hypervisor."

Impressive stuff, indeed.

Altiscale wants to spin a Docker YARN

Not only does Docker have desirable resource-usage characteristics, but the way it allows devs to package up applications has attracted attention from the open source Hadoop community.

Recently we learned that some people are diligently working to add Docker support into a crucial component of Apache Hadoop 2.0 named YARN, with the goal of increasing the usefuleness of both techs.

YARN was introduced in version two of Apache Hadoop. It lets the software run multiple applications within Hadoop rather than purely MapReduce jobs. Thanks to this, YARN is helping to transform Hadoop from a batch processing and storage system into a more general tool for manipulating and storing data.

By combining YARN with Docker, the community hopes it can make it trivial for developers to package up an application in a Docker container, then sling it onto the YARN tech as part of a larger Hadoop installation.

Altiscale, the company behind the code contributions that make this possible, was kind enough to answer some of our questions about why this could be useful.

"As a company building a Hadoop as a Service platform, we are particularly interested in YARN as it allows Hadoop to move beyond map-reduce to a much more diverse variety of applications," explained the company's chief executive Raymie Stata to El Reg via email. "One of the key components of YARN that make this possible are containers. The existing YARN container implementation does not adequately provide all the types of isolation required to address a scenario we are noticing with our larger customers – multiple, independent groups in the same organization with different software requirements."

By adding in Docker support, Altiscale hopes it can flatten some of the barriers that lie between enterprise developers and a greater use of Hadoop.

"A common struggle for users is software dependency management," Stata explained. "Docker provides an intriguing approach to solving that problem by allowing users to upload prepackaged environments (or images) into repositories which can then easily be downloaded and run in isolation. For example, there are public repositories in the Docker community called Docker registries which provide a variety of language environments such as Java and Ruby. There is also support for private repositories where containers with more specialized environments can be placed."

Other members of the Hadoop community are keen on the addition of Docker as well.

"Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the *entire* unix filesystem image for any YARN container," explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

"This way, instead of forcing the user to deal with individual files or binaries (as today) we can allow the application to package up the *entire* Unix filesystem image it needs as Docker image and then get perfect predictability, from an environment perspective, at runtime. This is where Docker has the most amount of interest to the YARN/Hadoop community - particularly for people packaging up complex applications which need their own version of perl, python, java, libc etc. etc. ... that is hard to manage on YARN currently."

The addition of Docker to YARN looks like a potentially useful tool and is another example of the enthusiasm with which Silicon Valley has adopted the young open source technology.

This follows Red Hat announcing broad support for Docker in its eponymous Linux distribution, and launching a project named "Atomic" built around the tech.

Amazon also recently added Docker support to its "Elastic Beanstalk" platform-as-a-service cloud.

These moves back up an earlier assertion by a Red Hat employee that: "Docker as a packaging tool for shipping software may be a game changer". ®

The Essential Guide to IT Transformation

More from The Register

next story
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
FLAPE – the next BIG THING in storage
Find cold data with flash, transmit it from tape
Seagate chances ARM with NAS boxes for the SOHO crowd
There's an Atom-powered offering, too
Gartner: To the right, to the right – biz sync firms who've won in a box to the right...
Magic quadrant: Top marks for, er, completeness of vision, EMC
prev story


Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Mobile application security vulnerability report
The alarming realities regarding the sheer number of applications vulnerable to attack, and the most common and easily addressable vulnerability errors.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.