Feeds

Apache lets fly Hadoop 1.0 data muncher

This ain't no Dumbo

Application security programs and practises

The Hadoop project at the Apache Software Foundation is beating its chest for delivering the v1.0 version of the open source MapReduce data analysis tool, its Hadoop Distributed File System (HDFS), and other related code.

While software version and release numbers can sometimes be arbitrary, they are often also symbolic, and in this case the jump up to v1.0 is a little bit of both. But it's clearly intended to signify that the open source software is ready for primetime use among IT organizations that are not data analytics hotshots, but want to be.

This is akin to the position that the open source Linux operating system was in towards the tail end of the dot-com boom, with the Linux 2.2 kernel, which had the stability and scalability that enterprises required as they contemplated alternatives to more expensive Unix and proprietary operating systems.

In the dot-com bust, when Linux began its assault on the data center, the open source code was available for free for those who wanted to roll their own Linuxes - as well as there being commercial releases from Red Hat, SUSE, and others who would do the rolling for you and offer commercial support.

The same thing is happening again with the Hadoop project, with Apache providing a home for Hadoop and its related tools to allow them to be improved and integrated. Meanwhile companies such as Cloudera, HortonWorks, MapR, IBM, Oracle, and a number of other smaller players are offering complete Hadoop stacks, or support for parts of the stack. The release number on Apache Hadoop is therefore perhaps not as relevant as the stacks others are providing commercial support for, and what elements they pick and choose from among the various Hadoop trees.

Hadoop and its related HDFS were created by techies at Yahoo! to mimic the prior generation of search engine indexing technology – MapReduce and Google File System to be specific – which was used by Google back in the mid-2000s. Google has since moved on, much as Hadoop MapReduce is being augmented to include NoSQL data stores and expanded to other data crunching algorithms besides mapping and reducing. Hadoop has been an Apache project for the past six years, with heavy contributions from Yahoo! and the Hadoop commercializers.

Not all of the possible code modules that work with Hadoop have been included in the Hadoop v1.0 release, but the key functionality is in there to start doing real work. However, some useful functions for Hadoop are closed source and under control of the commercial distros.

The core Hadoop code in the v1.0 stack from Apache is based on Hadoop 0.20.205.0 from the 0.20-security code line, which came out in October. In addition to the core MapReduce algorithm and HDFS, Hadoop v1.0 includes improvements to HBase, the column-oriented add-on for HDFS that is modeled on Google's BigTable; WebHDFS, a REST API front-end for HBase; as well as performance enhancements and bug fixes for Hadoop, HDFS, and HBase. (You can read the release notes here and the v1.0 documentation there.)

This core Hadoop software is akin to a Linux kernel for big data, with other tools such as HBase, Hive (for ad hoc SQL-like queries), Pig (a high-level programming language for Hadoop), and ZooKeeper (a configuration server for clusters). These are also Apache projects, as is the related Mahout machine learning add-on and the Cassandra alternative data store for Hadoop.

A full-blown Hadoop distribution with multiple data stores and data munching algorithms, as El Reg outlined last November from Hadoop World, is getting complicated and crowded. Or diverse and more useful, if you want to be nice to the stuffed elephant that is Hadoop's namesake. ®

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story

Whitepapers

Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.