Feeds

Apache lets fly Hadoop 1.0 data muncher

This ain't no Dumbo

Maximizing your infrastructure through virtualization

The Hadoop project at the Apache Software Foundation is beating its chest for delivering the v1.0 version of the open source MapReduce data analysis tool, its Hadoop Distributed File System (HDFS), and other related code.

While software version and release numbers can sometimes be arbitrary, they are often also symbolic, and in this case the jump up to v1.0 is a little bit of both. But it's clearly intended to signify that the open source software is ready for primetime use among IT organizations that are not data analytics hotshots, but want to be.

This is akin to the position that the open source Linux operating system was in towards the tail end of the dot-com boom, with the Linux 2.2 kernel, which had the stability and scalability that enterprises required as they contemplated alternatives to more expensive Unix and proprietary operating systems.

In the dot-com bust, when Linux began its assault on the data center, the open source code was available for free for those who wanted to roll their own Linuxes - as well as there being commercial releases from Red Hat, SUSE, and others who would do the rolling for you and offer commercial support.

The same thing is happening again with the Hadoop project, with Apache providing a home for Hadoop and its related tools to allow them to be improved and integrated. Meanwhile companies such as Cloudera, HortonWorks, MapR, IBM, Oracle, and a number of other smaller players are offering complete Hadoop stacks, or support for parts of the stack. The release number on Apache Hadoop is therefore perhaps not as relevant as the stacks others are providing commercial support for, and what elements they pick and choose from among the various Hadoop trees.

Hadoop and its related HDFS were created by techies at Yahoo! to mimic the prior generation of search engine indexing technology – MapReduce and Google File System to be specific – which was used by Google back in the mid-2000s. Google has since moved on, much as Hadoop MapReduce is being augmented to include NoSQL data stores and expanded to other data crunching algorithms besides mapping and reducing. Hadoop has been an Apache project for the past six years, with heavy contributions from Yahoo! and the Hadoop commercializers.

Not all of the possible code modules that work with Hadoop have been included in the Hadoop v1.0 release, but the key functionality is in there to start doing real work. However, some useful functions for Hadoop are closed source and under control of the commercial distros.

The core Hadoop code in the v1.0 stack from Apache is based on Hadoop 0.20.205.0 from the 0.20-security code line, which came out in October. In addition to the core MapReduce algorithm and HDFS, Hadoop v1.0 includes improvements to HBase, the column-oriented add-on for HDFS that is modeled on Google's BigTable; WebHDFS, a REST API front-end for HBase; as well as performance enhancements and bug fixes for Hadoop, HDFS, and HBase. (You can read the release notes here and the v1.0 documentation there.)

This core Hadoop software is akin to a Linux kernel for big data, with other tools such as HBase, Hive (for ad hoc SQL-like queries), Pig (a high-level programming language for Hadoop), and ZooKeeper (a configuration server for clusters). These are also Apache projects, as is the related Mahout machine learning add-on and the Cassandra alternative data store for Hadoop.

A full-blown Hadoop distribution with multiple data stores and data munching algorithms, as El Reg outlined last November from Hadoop World, is getting complicated and crowded. Or diverse and more useful, if you want to be nice to the stuffed elephant that is Hadoop's namesake. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.