Feeds

Docker ported into Hadoop as benchmarks show SCREAMING FAST performance

Code committers hope unholy union of open source tech will spawn speedy gonzalez virtualization

Internet Security Threat Report 2014

The Hadoop community is working on patches that will bring the popular app-containerization technology Docker into the data management system, and independent benchmarks are showing the tech has a huge speedup over traditional virtualization approaches.

Docker is an open source Linux containerization technology that uses underlying kernel elements like namespaces, lxc, and cgroups to let an admin run multiple apps with all their dependencies in secure sandboxes on the same underlying Linux OS, making it an attractive alternative to typical virtualization, which bundles a copy of the OS with each app.

In a set of benchmarks an IBM employee released on Thursday, the company showed that Docker containerization has some huge advantages over the KVM hypervisor from a performance perspective.

Alongside this, El Reg has discovered some fascinating work by the Hadoop community to bring the tech into the eponymous data analysis and management engine.

Combined, these crumbs of news add more grist to the idea that Docker could become an eventual replacement for traditional virtualization approaches, granting organizations big benefits from an open source tech.

To start with, benchmarks conducted by IBM show that Docker has a number of performance advantages over the KVM hypervisor when running on the open source cloud infrastructure tool OpenStack.

In an informative post published on Thursday, IBM chap Boden Russell goes into further details about the results.

"From an OpenStack Cloudy operational time perspective (boot, reboot, delete, snapshot, etc.) docker LXC outperformed KVM ranging from 1.09x (delete) to 49x (reboot)," Russell wrote. "Based on the compute node resource usage metrics during the serial VM packing test: Docker LXC CPU growth is approximately 26x lower than KVM. On this surface this indicates a 26x density potential increase from a CPU point of view using docker LXC vs a traditional hypervisor. Docker LXC memory growth is approximately 3x lower than KVM. On the surface this indicates a 3x density potential increase from a memory point of view using docker LXC vs a traditional hypervisor."

Impressive stuff, indeed.

Altiscale wants to spin a Docker YARN

Not only does Docker have desirable resource-usage characteristics, but the way it allows devs to package up applications has attracted attention from the open source Hadoop community.

Recently we learned that some people are diligently working to add Docker support into a crucial component of Apache Hadoop 2.0 named YARN, with the goal of increasing the usefuleness of both techs.

YARN was introduced in version two of Apache Hadoop. It lets the software run multiple applications within Hadoop rather than purely MapReduce jobs. Thanks to this, YARN is helping to transform Hadoop from a batch processing and storage system into a more general tool for manipulating and storing data.

By combining YARN with Docker, the community hopes it can make it trivial for developers to package up an application in a Docker container, then sling it onto the YARN tech as part of a larger Hadoop installation.

Altiscale, the company behind the code contributions that make this possible, was kind enough to answer some of our questions about why this could be useful.

"As a company building a Hadoop as a Service platform, we are particularly interested in YARN as it allows Hadoop to move beyond map-reduce to a much more diverse variety of applications," explained the company's chief executive Raymie Stata to El Reg via email. "One of the key components of YARN that make this possible are containers. The existing YARN container implementation does not adequately provide all the types of isolation required to address a scenario we are noticing with our larger customers – multiple, independent groups in the same organization with different software requirements."

By adding in Docker support, Altiscale hopes it can flatten some of the barriers that lie between enterprise developers and a greater use of Hadoop.

"A common struggle for users is software dependency management," Stata explained. "Docker provides an intriguing approach to solving that problem by allowing users to upload prepackaged environments (or images) into repositories which can then easily be downloaded and run in isolation. For example, there are public repositories in the Docker community called Docker registries which provide a variety of language environments such as Java and Ruby. There is also support for private repositories where containers with more specialized environments can be placed."

Other members of the Hadoop community are keen on the addition of Docker as well.

"Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the *entire* unix filesystem image for any YARN container," explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

"This way, instead of forcing the user to deal with individual files or binaries (as today) we can allow the application to package up the *entire* Unix filesystem image it needs as Docker image and then get perfect predictability, from an environment perspective, at runtime. This is where Docker has the most amount of interest to the YARN/Hadoop community - particularly for people packaging up complex applications which need their own version of perl, python, java, libc etc. etc. ... that is hard to manage on YARN currently."

The addition of Docker to YARN looks like a potentially useful tool and is another example of the enthusiasm with which Silicon Valley has adopted the young open source technology.

This follows Red Hat announcing broad support for Docker in its eponymous Linux distribution, and launching a project named "Atomic" built around the tech.

Amazon also recently added Docker support to its "Elastic Beanstalk" platform-as-a-service cloud.

These moves back up an earlier assertion by a Red Hat employee that: "Docker as a packaging tool for shipping software may be a game changer". ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.