Feeds

IBM snuggles up BigInsights Hadoop with Cloudera

Who says Blue elephants can't dance?

Maximizing your infrastructure through virtualization

Like all of the other Hadoop disties, Big Blue is upgrading its big data muncher to the most recent Apache modules and integrating it with its own add-ons. And with its InfoSphere BigInsights Enterprise Edition V1.4, IBM is also doing something else interesting: It is letting customers slide in the Cloudera CDH distribution and slide out its own variant of the core Apache Hadoop stack.

IBM started rolling up its own Hadoop back in May 2010, when it became clear that everybody was going to want an elephant in the data center and that IBM had some other software, cooked up at IBM Research and in Software Group. It was also clear at the time – and El Reg said as much – that IBM could have just partnered with Cloudera and been a reseller of its Cloudera Distribution for Apache Hadoop (CDH) and Cloudera Manager control freak to tame the wild elephant. IBM felt at the time that Hadoop was important enough for it to do its own distro and to get its own support revenue stream for the core Apache Hadoop tools and a licensing stream for its add-on.

IBM BigInsights Hadoop logo

Those BigInsights add-ons include BigSheets, a tool that rides atop of the Hadoop Distributed File System and presents a virtual spreadsheet that you can surf through in a Web browser to surf through the unstructured data. Later BigInsights releases include the text analytics engine that IBM put into the Watson QA system that played Jeopardy! and beat humanity as well as Jaql, an SQL-like ad hoc query language that can interface with HBase, the database layer that rides on top of HDFS, indexes created by the Lucene search engine, or databases running on IBM's own Netezza parallel database appliances.

IBM has also created its own BigInsights Scheduler to handle workflow and allocation of MapReduce jobs on Hadoop clusters, which is itself an extension to the Hadoop Fair Scheduler; and a technique called Adaptive MapReduce, which is a tweak on the core MapReduce algorithms in Hadoop that corrals many small jobs running on Hadoop and keeps them from consuming too much cluster resources.

IBM also provides connectors into its DB2 databases, its InfoSphere Warehouse data warehouses, and its Smart Analytics System appliances, which do data warehousing and analytics work on the same iron.

With BigInsights Enterprise Edition V1.4, IBM is updating the core Hadoop components to the 1.0.0 release level and also includes Flume 0.9.4, HBase 0.90.5, Hive 0.8.0, Oozie 2.3.1, Nutch 1.4, Pig 0.9.1, and Zookeeper 3.3.4.

IBM is supporting BigInsights V1.4 only on x86-based systems (Watson ran on Power iron, and it is funny how IBM has not put the latest BigInsights on Power), and server nodes in the Hadoop cluster have to be running Red Hat Enterprise Linux 5 or 6 or SUSE Linux Enterprise Server 11 (at base or SP1 level).

You would think that IBM would have the latest BigInsights running on Power-Linux and AIX if it was so keen to peddle Power machinery. BigInsights V1.3 is supported on PowerLinux servers, and later this year V1.4 will be moved to Power iron and will be supported in synch with x86 iron from there on out.

With the updated BigInsights, IBM is also letting customers swap in the Cloudera CDH variant of the open source Apache Hadoop stack – presumably CDH3, which is the closest thing to the current BigInsights release and not the new CDH4 release announced a month ago that is basically the alpha level of Apache 2.0.0+73 backported to also allow for the old MapReduce methodology as well as for the still-experimental MapReduce2 (MR2 or sometimes known as Yarn).

MapR Technologies and Hortonworks have not pushed their Hadoop stacks this hard and are content, like IBM, to let Apache Hadoop 2.0 mature a bit. In any event, the important thing is that all of the add-ons that IBM created for its own Hadoop distro work with the CDH distro.

IBM is tossing in a limited-use license to its InfoSphere Streams software, the code underneath the System S streaming supercomputer, which originally ran on BlueGene parallel supercomputers and which is used to mash up text, video, and audio streams with databases and other data storehouses in real-time and to create various kinds of trading systems or process control systems that react to all this data rather than wait to be sifted like a data warehouse.

IBM prices its BigInsights distribution using a metric called the Resource Value Unit, or RVU, and it is based on the storage capacity of the Hadoop cluster, unlike the other distros which count server nodes. You take the total capacity under management in the Hadoop cluster and you use these multipliers to come up with an aggregate RVU for the cluster:

  • 1TB - 12TB, 1.00
  • 13TB - 32TB, 0.84
  • 33TB - 64TB, 0.61
  • 65TB - 100TB, 0.46
  • 101TB - 250TB, 0.38
  • 251TB+, 0.30

So let's say you have a 40TB cluster. That's 12TB multiplied by 1.00 RVU in the first tier, 20TB multiplied by 0.84 in the second tier (16.8), and 8TB multiplied by 0.61 (4.88) in the third tier. Add it up, you get 33.68, and round up to 34 total RVUs.

For a developmental license, it costs $5,000 per RVU and for a production environment you are talking $10,000 per RVU. This includes all the IBM software plus the rolled up Apache code (from IBM or Cloudera), the InfoSphere Streams entry license, and a year of support for the whole shebang, and it comes out to $340,000.

BigInsights Enterprise Edition V1.4 ships on July 13. Presumably there will a BigInsights Basic Edition, the first of which IBM rolled out last May. The Basic Edition is a support contract that covers just the core Hadoop stuff plus Jaql; pricing was not announced. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
Disaster Recovery upstart joins DR 'as a service' gang
Quorum joins the aaS crowd with DRaaS offering
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.