Feeds

IBM punts commercial Hadoop distro

Big Blue elephant in the data center

Boost IT visibility and business value

With so much of its future sales and growth staked on smart infrastructure and the data analytics that enables it, it comes as no surprise that IBM has taken a shining to the open source Hadoop big data crunching software that has found a loving home at the Apache Foundation. Today, IBM announced it has created a commercial version of Hadoop as well as some add-ons and - you guessed it - implementation services to make Hadoop more consumable for the Global 20000.

Not everyone is a Google, where the MapReduce distributed data cruncher and its related file system was created, or even a Yahoo, where Hadoop was nurtured to do what Google does - but in an open source, community-driven fashion. Hadoop is used at Yahoo! and Facebook and Twitter, and it helps drive a portion of Microsoft's Bing search engine. But it is not widely understood in the corporations where IBM does its business.

Bernie Spang, director of product strategy for database software and systems at IBM, says that the company needs Hadoop to complete its data analytics hat trick. IBM has traditional data warehousing and predictive analytics in its InfoSphere, Cognos, and now SPSS products, which can extra data from transactional systems to help companies make better decisions. And it has the "System S" InfoSphere Streams system, which debuted as a prototype a year ago to mash up streaming data from text, video, and audio streams and mix it with databases to create something that is a bit more real-time than a data warehouse, helping governments and companies wade through mountains of data to make decisions (like trade options a hell of a lot faster than most systems can, as the prototype did).

Spang says that IBM needs to offer a product that does the "big data" crunching that the Googles of the world do because its own customers have loads of structured and unstructured data that can be sucked into a Hadoop file system and chewed on using MapReduce for a wider, finer-grained, and more long-term analysis than can be done with a data warehouse or stream system.

And that is why IBM is creating its own distro of Hadoop, which is called InfoSphere BigInsights. Spang called BigInsights an enterprise-ready version of the Apache Hadoop code that IBM will package up and install for customers who want to build their own Hadoop grids. IBM has done about a dozen Hadoop installations to build up experience setting up the code and systems, and now feels it has enough experience to offer commercial support and various services, including the Hadoop software but also services and expertise relating to how Hadoop can be used for risk management and analysis at financial firms or for all kinds of cross-linking in social networking and online entertainment applications. IBM will plan your Hadoop installation for you, set it up, and even monitor it for you. Just get out that checkbook.

IBM could have just done the easy thing and partnered with Cloudera, which back in March 2009 launched a commercialized version of the Hadoop Distributed File System, the MapReduce parallelization and data-crunching algorithm to chew on Webby data, and the Hive client library associated with Hadoop. But Big Data is important enough that IBM feels compelled to offer its own distro.

<o<While IBM is now a competitor to Cloudera, Big Blue says it will participate with the members of the Apache Hadoop community, singling out Cloudera and Karmasphere, which has created a graphical tool for debugging Hadoop apps, by name.

Cloudera welcomes IBM's arrival. "I am excited to see more organizations like IBM get behind the Apache Hadoop project," said Cloudera's Doug Cutting, the man who founded Hadoop. "IBM has been working for some time on Hadoop-related projects for its internal use such as BigSheets and I am looking forward to their investment in the core open source platform development as well.

"At Cloudera we've seen incredible Hadoop uptake in mainstream enterprises which has been reflected in the growth of our own business. I see no end to the number of applications of this new technology. IBM's entry means more open source contributors will help expand the horizons for Hadoop around the world."

The InfoSphere BigInsights distro will have some home-grown IBM software as well, including a technology preview of something called BigSheets that Spang says is basically a spreadsheet front-end running in a Web browser that is used for consolidating and visualizing the chewed data coming out of Hadoop, which can be terabytes or petabytes of Web pages and other kinds of unstructured data.

As an example of how BigSheets can interface with Hadoop, IBM is working with the British Library to archive and preserve 5 TB of Web pages culled from sites with the .co.uk domain. The BigSheets interface will let researchers, academics, and students to chew on this data and search it in more sophisticated ways than is possible using a search engine.

IBM is not divulging its prices for the BigInsights Hadoop distro or what the various installation and support services cost. The BigInsights distro is available today. It is not clear when BigSheets will move from technology preview to production, but you can find out more about the software here. Spang said that IBM has other tools to make Hadoop do more tricks, but it is a fair guess that these will cost more than peanuts. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.