Feeds

IBM snuggles up BigInsights Hadoop with Cloudera

Who says Blue elephants can't dance?

Designing a Defense for Mobile Applications

Like all of the other Hadoop disties, Big Blue is upgrading its big data muncher to the most recent Apache modules and integrating it with its own add-ons. And with its InfoSphere BigInsights Enterprise Edition V1.4, IBM is also doing something else interesting: It is letting customers slide in the Cloudera CDH distribution and slide out its own variant of the core Apache Hadoop stack.

IBM started rolling up its own Hadoop back in May 2010, when it became clear that everybody was going to want an elephant in the data center and that IBM had some other software, cooked up at IBM Research and in Software Group. It was also clear at the time – and El Reg said as much – that IBM could have just partnered with Cloudera and been a reseller of its Cloudera Distribution for Apache Hadoop (CDH) and Cloudera Manager control freak to tame the wild elephant. IBM felt at the time that Hadoop was important enough for it to do its own distro and to get its own support revenue stream for the core Apache Hadoop tools and a licensing stream for its add-on.

IBM BigInsights Hadoop logo

Those BigInsights add-ons include BigSheets, a tool that rides atop of the Hadoop Distributed File System and presents a virtual spreadsheet that you can surf through in a Web browser to surf through the unstructured data. Later BigInsights releases include the text analytics engine that IBM put into the Watson QA system that played Jeopardy! and beat humanity as well as Jaql, an SQL-like ad hoc query language that can interface with HBase, the database layer that rides on top of HDFS, indexes created by the Lucene search engine, or databases running on IBM's own Netezza parallel database appliances.

IBM has also created its own BigInsights Scheduler to handle workflow and allocation of MapReduce jobs on Hadoop clusters, which is itself an extension to the Hadoop Fair Scheduler; and a technique called Adaptive MapReduce, which is a tweak on the core MapReduce algorithms in Hadoop that corrals many small jobs running on Hadoop and keeps them from consuming too much cluster resources.

IBM also provides connectors into its DB2 databases, its InfoSphere Warehouse data warehouses, and its Smart Analytics System appliances, which do data warehousing and analytics work on the same iron.

With BigInsights Enterprise Edition V1.4, IBM is updating the core Hadoop components to the 1.0.0 release level and also includes Flume 0.9.4, HBase 0.90.5, Hive 0.8.0, Oozie 2.3.1, Nutch 1.4, Pig 0.9.1, and Zookeeper 3.3.4.

IBM is supporting BigInsights V1.4 only on x86-based systems (Watson ran on Power iron, and it is funny how IBM has not put the latest BigInsights on Power), and server nodes in the Hadoop cluster have to be running Red Hat Enterprise Linux 5 or 6 or SUSE Linux Enterprise Server 11 (at base or SP1 level).

You would think that IBM would have the latest BigInsights running on Power-Linux and AIX if it was so keen to peddle Power machinery. BigInsights V1.3 is supported on PowerLinux servers, and later this year V1.4 will be moved to Power iron and will be supported in synch with x86 iron from there on out.

With the updated BigInsights, IBM is also letting customers swap in the Cloudera CDH variant of the open source Apache Hadoop stack – presumably CDH3, which is the closest thing to the current BigInsights release and not the new CDH4 release announced a month ago that is basically the alpha level of Apache 2.0.0+73 backported to also allow for the old MapReduce methodology as well as for the still-experimental MapReduce2 (MR2 or sometimes known as Yarn).

MapR Technologies and Hortonworks have not pushed their Hadoop stacks this hard and are content, like IBM, to let Apache Hadoop 2.0 mature a bit. In any event, the important thing is that all of the add-ons that IBM created for its own Hadoop distro work with the CDH distro.

IBM is tossing in a limited-use license to its InfoSphere Streams software, the code underneath the System S streaming supercomputer, which originally ran on BlueGene parallel supercomputers and which is used to mash up text, video, and audio streams with databases and other data storehouses in real-time and to create various kinds of trading systems or process control systems that react to all this data rather than wait to be sifted like a data warehouse.

IBM prices its BigInsights distribution using a metric called the Resource Value Unit, or RVU, and it is based on the storage capacity of the Hadoop cluster, unlike the other distros which count server nodes. You take the total capacity under management in the Hadoop cluster and you use these multipliers to come up with an aggregate RVU for the cluster:

  • 1TB - 12TB, 1.00
  • 13TB - 32TB, 0.84
  • 33TB - 64TB, 0.61
  • 65TB - 100TB, 0.46
  • 101TB - 250TB, 0.38
  • 251TB+, 0.30

So let's say you have a 40TB cluster. That's 12TB multiplied by 1.00 RVU in the first tier, 20TB multiplied by 0.84 in the second tier (16.8), and 8TB multiplied by 0.61 (4.88) in the third tier. Add it up, you get 33.68, and round up to 34 total RVUs.

For a developmental license, it costs $5,000 per RVU and for a production environment you are talking $10,000 per RVU. This includes all the IBM software plus the rolled up Apache code (from IBM or Cloudera), the InfoSphere Streams entry license, and a year of support for the whole shebang, and it comes out to $340,000.

BigInsights Enterprise Edition V1.4 ships on July 13. Presumably there will a BigInsights Basic Edition, the first of which IBM rolled out last May. The Basic Edition is a support contract that covers just the core Hadoop stuff plus Jaql; pricing was not announced. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story

Whitepapers

Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.