Feeds

Apache lets fly Hadoop 1.0 data muncher

This ain't no Dumbo

Mobile application security vulnerability report

The Hadoop project at the Apache Software Foundation is beating its chest for delivering the v1.0 version of the open source MapReduce data analysis tool, its Hadoop Distributed File System (HDFS), and other related code.

While software version and release numbers can sometimes be arbitrary, they are often also symbolic, and in this case the jump up to v1.0 is a little bit of both. But it's clearly intended to signify that the open source software is ready for primetime use among IT organizations that are not data analytics hotshots, but want to be.

This is akin to the position that the open source Linux operating system was in towards the tail end of the dot-com boom, with the Linux 2.2 kernel, which had the stability and scalability that enterprises required as they contemplated alternatives to more expensive Unix and proprietary operating systems.

In the dot-com bust, when Linux began its assault on the data center, the open source code was available for free for those who wanted to roll their own Linuxes - as well as there being commercial releases from Red Hat, SUSE, and others who would do the rolling for you and offer commercial support.

The same thing is happening again with the Hadoop project, with Apache providing a home for Hadoop and its related tools to allow them to be improved and integrated. Meanwhile companies such as Cloudera, HortonWorks, MapR, IBM, Oracle, and a number of other smaller players are offering complete Hadoop stacks, or support for parts of the stack. The release number on Apache Hadoop is therefore perhaps not as relevant as the stacks others are providing commercial support for, and what elements they pick and choose from among the various Hadoop trees.

Hadoop and its related HDFS were created by techies at Yahoo! to mimic the prior generation of search engine indexing technology – MapReduce and Google File System to be specific – which was used by Google back in the mid-2000s. Google has since moved on, much as Hadoop MapReduce is being augmented to include NoSQL data stores and expanded to other data crunching algorithms besides mapping and reducing. Hadoop has been an Apache project for the past six years, with heavy contributions from Yahoo! and the Hadoop commercializers.

Not all of the possible code modules that work with Hadoop have been included in the Hadoop v1.0 release, but the key functionality is in there to start doing real work. However, some useful functions for Hadoop are closed source and under control of the commercial distros.

The core Hadoop code in the v1.0 stack from Apache is based on Hadoop 0.20.205.0 from the 0.20-security code line, which came out in October. In addition to the core MapReduce algorithm and HDFS, Hadoop v1.0 includes improvements to HBase, the column-oriented add-on for HDFS that is modeled on Google's BigTable; WebHDFS, a REST API front-end for HBase; as well as performance enhancements and bug fixes for Hadoop, HDFS, and HBase. (You can read the release notes here and the v1.0 documentation there.)

This core Hadoop software is akin to a Linux kernel for big data, with other tools such as HBase, Hive (for ad hoc SQL-like queries), Pig (a high-level programming language for Hadoop), and ZooKeeper (a configuration server for clusters). These are also Apache projects, as is the related Mahout machine learning add-on and the Cassandra alternative data store for Hadoop.

A full-blown Hadoop distribution with multiple data stores and data munching algorithms, as El Reg outlined last November from Hadoop World, is getting complicated and crowded. Or diverse and more useful, if you want to be nice to the stuffed elephant that is Hadoop's namesake. ®

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
FLAPE – the next BIG THING in storage
Find cold data with flash, transmit it from tape
Seagate chances ARM with NAS boxes for the SOHO crowd
There's an Atom-powered offering, too
Gartner: To the right, to the right – biz sync firms who've won in a box to the right...
Magic quadrant: Top marks for, er, completeness of vision, EMC
prev story

Whitepapers

Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.