Feeds

Apache lets fly Hadoop 1.0 data muncher

This ain't no Dumbo

Top 5 reasons to deploy VMware with Tegile

The Hadoop project at the Apache Software Foundation is beating its chest for delivering the v1.0 version of the open source MapReduce data analysis tool, its Hadoop Distributed File System (HDFS), and other related code.

While software version and release numbers can sometimes be arbitrary, they are often also symbolic, and in this case the jump up to v1.0 is a little bit of both. But it's clearly intended to signify that the open source software is ready for primetime use among IT organizations that are not data analytics hotshots, but want to be.

This is akin to the position that the open source Linux operating system was in towards the tail end of the dot-com boom, with the Linux 2.2 kernel, which had the stability and scalability that enterprises required as they contemplated alternatives to more expensive Unix and proprietary operating systems.

In the dot-com bust, when Linux began its assault on the data center, the open source code was available for free for those who wanted to roll their own Linuxes - as well as there being commercial releases from Red Hat, SUSE, and others who would do the rolling for you and offer commercial support.

The same thing is happening again with the Hadoop project, with Apache providing a home for Hadoop and its related tools to allow them to be improved and integrated. Meanwhile companies such as Cloudera, HortonWorks, MapR, IBM, Oracle, and a number of other smaller players are offering complete Hadoop stacks, or support for parts of the stack. The release number on Apache Hadoop is therefore perhaps not as relevant as the stacks others are providing commercial support for, and what elements they pick and choose from among the various Hadoop trees.

Hadoop and its related HDFS were created by techies at Yahoo! to mimic the prior generation of search engine indexing technology – MapReduce and Google File System to be specific – which was used by Google back in the mid-2000s. Google has since moved on, much as Hadoop MapReduce is being augmented to include NoSQL data stores and expanded to other data crunching algorithms besides mapping and reducing. Hadoop has been an Apache project for the past six years, with heavy contributions from Yahoo! and the Hadoop commercializers.

Not all of the possible code modules that work with Hadoop have been included in the Hadoop v1.0 release, but the key functionality is in there to start doing real work. However, some useful functions for Hadoop are closed source and under control of the commercial distros.

The core Hadoop code in the v1.0 stack from Apache is based on Hadoop 0.20.205.0 from the 0.20-security code line, which came out in October. In addition to the core MapReduce algorithm and HDFS, Hadoop v1.0 includes improvements to HBase, the column-oriented add-on for HDFS that is modeled on Google's BigTable; WebHDFS, a REST API front-end for HBase; as well as performance enhancements and bug fixes for Hadoop, HDFS, and HBase. (You can read the release notes here and the v1.0 documentation there.)

This core Hadoop software is akin to a Linux kernel for big data, with other tools such as HBase, Hive (for ad hoc SQL-like queries), Pig (a high-level programming language for Hadoop), and ZooKeeper (a configuration server for clusters). These are also Apache projects, as is the related Mahout machine learning add-on and the Cassandra alternative data store for Hadoop.

A full-blown Hadoop distribution with multiple data stores and data munching algorithms, as El Reg outlined last November from Hadoop World, is getting complicated and crowded. Or diverse and more useful, if you want to be nice to the stuffed elephant that is Hadoop's namesake. ®

Beginner's guide to SSL certificates

More from The Register

next story
Ellison: Sparc M7 is Oracle's most important silicon EVER
'Acceleration engines' key to performance, security, Larry says
Oracle SHELLSHOCKER - data titan lists unpatchables
Database kingpin lists 32 products that can't be patched (yet) as GNU fixes second vuln
Lenovo to finish $2.1bn IBM x86 server gobble in October
A lighter snack than expected – but what's a few $100m between friends, eh?
Ello? ello? ello?: Facebook challenger in DDoS KNOCKOUT
Gets back up again after half an hour though
Troll hunter Rackspace turns Rotatable's bizarro patent to stone
News of the Weird: Screen-rotating technology declared unpatentable
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.