Feeds

IBM snuggles up BigInsights Hadoop with Cloudera

Who says Blue elephants can't dance?

Beginner's guide to SSL certificates

Like all of the other Hadoop disties, Big Blue is upgrading its big data muncher to the most recent Apache modules and integrating it with its own add-ons. And with its InfoSphere BigInsights Enterprise Edition V1.4, IBM is also doing something else interesting: It is letting customers slide in the Cloudera CDH distribution and slide out its own variant of the core Apache Hadoop stack.

IBM started rolling up its own Hadoop back in May 2010, when it became clear that everybody was going to want an elephant in the data center and that IBM had some other software, cooked up at IBM Research and in Software Group. It was also clear at the time – and El Reg said as much – that IBM could have just partnered with Cloudera and been a reseller of its Cloudera Distribution for Apache Hadoop (CDH) and Cloudera Manager control freak to tame the wild elephant. IBM felt at the time that Hadoop was important enough for it to do its own distro and to get its own support revenue stream for the core Apache Hadoop tools and a licensing stream for its add-on.

IBM BigInsights Hadoop logo

Those BigInsights add-ons include BigSheets, a tool that rides atop of the Hadoop Distributed File System and presents a virtual spreadsheet that you can surf through in a Web browser to surf through the unstructured data. Later BigInsights releases include the text analytics engine that IBM put into the Watson QA system that played Jeopardy! and beat humanity as well as Jaql, an SQL-like ad hoc query language that can interface with HBase, the database layer that rides on top of HDFS, indexes created by the Lucene search engine, or databases running on IBM's own Netezza parallel database appliances.

IBM has also created its own BigInsights Scheduler to handle workflow and allocation of MapReduce jobs on Hadoop clusters, which is itself an extension to the Hadoop Fair Scheduler; and a technique called Adaptive MapReduce, which is a tweak on the core MapReduce algorithms in Hadoop that corrals many small jobs running on Hadoop and keeps them from consuming too much cluster resources.

IBM also provides connectors into its DB2 databases, its InfoSphere Warehouse data warehouses, and its Smart Analytics System appliances, which do data warehousing and analytics work on the same iron.

With BigInsights Enterprise Edition V1.4, IBM is updating the core Hadoop components to the 1.0.0 release level and also includes Flume 0.9.4, HBase 0.90.5, Hive 0.8.0, Oozie 2.3.1, Nutch 1.4, Pig 0.9.1, and Zookeeper 3.3.4.

IBM is supporting BigInsights V1.4 only on x86-based systems (Watson ran on Power iron, and it is funny how IBM has not put the latest BigInsights on Power), and server nodes in the Hadoop cluster have to be running Red Hat Enterprise Linux 5 or 6 or SUSE Linux Enterprise Server 11 (at base or SP1 level).

You would think that IBM would have the latest BigInsights running on Power-Linux and AIX if it was so keen to peddle Power machinery. BigInsights V1.3 is supported on PowerLinux servers, and later this year V1.4 will be moved to Power iron and will be supported in synch with x86 iron from there on out.

With the updated BigInsights, IBM is also letting customers swap in the Cloudera CDH variant of the open source Apache Hadoop stack – presumably CDH3, which is the closest thing to the current BigInsights release and not the new CDH4 release announced a month ago that is basically the alpha level of Apache 2.0.0+73 backported to also allow for the old MapReduce methodology as well as for the still-experimental MapReduce2 (MR2 or sometimes known as Yarn).

MapR Technologies and Hortonworks have not pushed their Hadoop stacks this hard and are content, like IBM, to let Apache Hadoop 2.0 mature a bit. In any event, the important thing is that all of the add-ons that IBM created for its own Hadoop distro work with the CDH distro.

IBM is tossing in a limited-use license to its InfoSphere Streams software, the code underneath the System S streaming supercomputer, which originally ran on BlueGene parallel supercomputers and which is used to mash up text, video, and audio streams with databases and other data storehouses in real-time and to create various kinds of trading systems or process control systems that react to all this data rather than wait to be sifted like a data warehouse.

IBM prices its BigInsights distribution using a metric called the Resource Value Unit, or RVU, and it is based on the storage capacity of the Hadoop cluster, unlike the other distros which count server nodes. You take the total capacity under management in the Hadoop cluster and you use these multipliers to come up with an aggregate RVU for the cluster:

  • 1TB - 12TB, 1.00
  • 13TB - 32TB, 0.84
  • 33TB - 64TB, 0.61
  • 65TB - 100TB, 0.46
  • 101TB - 250TB, 0.38
  • 251TB+, 0.30

So let's say you have a 40TB cluster. That's 12TB multiplied by 1.00 RVU in the first tier, 20TB multiplied by 0.84 in the second tier (16.8), and 8TB multiplied by 0.61 (4.88) in the third tier. Add it up, you get 33.68, and round up to 34 total RVUs.

For a developmental license, it costs $5,000 per RVU and for a production environment you are talking $10,000 per RVU. This includes all the IBM software plus the rolled up Apache code (from IBM or Cloudera), the InfoSphere Streams entry license, and a year of support for the whole shebang, and it comes out to $340,000.

BigInsights Enterprise Edition V1.4 ships on July 13. Presumably there will a BigInsights Basic Edition, the first of which IBM rolled out last May. The Basic Edition is a support contract that covers just the core Hadoop stuff plus Jaql; pricing was not announced. ®

Internet Security Threat Report 2014

More from The Register

next story
The cloud that goes puff: Seagate Central home NAS woes
4TB of home storage is great, until you wake up to a dead device
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Intel offers ingenious piece of 10TB 3D NAND chippery
The race for next generation flash capacity now on
Want to STUFF Facebook with blatant ADVERTISING? Fine! But you must PAY
Pony up or push off, Zuck tells social marketeers
Oi, Europe! Tell US feds to GTFO of our servers, say Microsoft and pals
By writing a really angry letter about how it's harming our cloud business, ta
SAVE ME, NASA system builder, from my DEAD WORKSTATION
Anal-retentive hardware nerd in paws-on workstation crisis
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
The Heartbleed Bug: how to protect your business with Symantec
What happens when the next Heartbleed (or worse) comes along, and what can you do to weather another chapter in an all-too-familiar string of debilitating attacks?
Top 5 reasons to deploy VMware with Tegile
Data demand and the rise of virtualization is challenging IT teams to deliver storage performance, scalability and capacity that can keep up, while maximizing efficiency.