Feeds

Apache lets fly Hadoop 1.0 data muncher

This ain't no Dumbo

Top 5 reasons to deploy VMware with Tegile

The Hadoop project at the Apache Software Foundation is beating its chest for delivering the v1.0 version of the open source MapReduce data analysis tool, its Hadoop Distributed File System (HDFS), and other related code.

While software version and release numbers can sometimes be arbitrary, they are often also symbolic, and in this case the jump up to v1.0 is a little bit of both. But it's clearly intended to signify that the open source software is ready for primetime use among IT organizations that are not data analytics hotshots, but want to be.

This is akin to the position that the open source Linux operating system was in towards the tail end of the dot-com boom, with the Linux 2.2 kernel, which had the stability and scalability that enterprises required as they contemplated alternatives to more expensive Unix and proprietary operating systems.

In the dot-com bust, when Linux began its assault on the data center, the open source code was available for free for those who wanted to roll their own Linuxes - as well as there being commercial releases from Red Hat, SUSE, and others who would do the rolling for you and offer commercial support.

The same thing is happening again with the Hadoop project, with Apache providing a home for Hadoop and its related tools to allow them to be improved and integrated. Meanwhile companies such as Cloudera, HortonWorks, MapR, IBM, Oracle, and a number of other smaller players are offering complete Hadoop stacks, or support for parts of the stack. The release number on Apache Hadoop is therefore perhaps not as relevant as the stacks others are providing commercial support for, and what elements they pick and choose from among the various Hadoop trees.

Hadoop and its related HDFS were created by techies at Yahoo! to mimic the prior generation of search engine indexing technology – MapReduce and Google File System to be specific – which was used by Google back in the mid-2000s. Google has since moved on, much as Hadoop MapReduce is being augmented to include NoSQL data stores and expanded to other data crunching algorithms besides mapping and reducing. Hadoop has been an Apache project for the past six years, with heavy contributions from Yahoo! and the Hadoop commercializers.

Not all of the possible code modules that work with Hadoop have been included in the Hadoop v1.0 release, but the key functionality is in there to start doing real work. However, some useful functions for Hadoop are closed source and under control of the commercial distros.

The core Hadoop code in the v1.0 stack from Apache is based on Hadoop 0.20.205.0 from the 0.20-security code line, which came out in October. In addition to the core MapReduce algorithm and HDFS, Hadoop v1.0 includes improvements to HBase, the column-oriented add-on for HDFS that is modeled on Google's BigTable; WebHDFS, a REST API front-end for HBase; as well as performance enhancements and bug fixes for Hadoop, HDFS, and HBase. (You can read the release notes here and the v1.0 documentation there.)

This core Hadoop software is akin to a Linux kernel for big data, with other tools such as HBase, Hive (for ad hoc SQL-like queries), Pig (a high-level programming language for Hadoop), and ZooKeeper (a configuration server for clusters). These are also Apache projects, as is the related Mahout machine learning add-on and the Cassandra alternative data store for Hadoop.

A full-blown Hadoop distribution with multiple data stores and data munching algorithms, as El Reg outlined last November from Hadoop World, is getting complicated and crowded. Or diverse and more useful, if you want to be nice to the stuffed elephant that is Hadoop's namesake. ®

Intelligent flash storage arrays

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Driving business with continuous operational intelligence
Introducing an innovative approach offered by ExtraHop for producing continuous operational intelligence.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?