Feeds

IBM snuggles up BigInsights Hadoop with Cloudera

Who says Blue elephants can't dance?

Choosing a cloud hosting partner with confidence

Like all of the other Hadoop disties, Big Blue is upgrading its big data muncher to the most recent Apache modules and integrating it with its own add-ons. And with its InfoSphere BigInsights Enterprise Edition V1.4, IBM is also doing something else interesting: It is letting customers slide in the Cloudera CDH distribution and slide out its own variant of the core Apache Hadoop stack.

IBM started rolling up its own Hadoop back in May 2010, when it became clear that everybody was going to want an elephant in the data center and that IBM had some other software, cooked up at IBM Research and in Software Group. It was also clear at the time – and El Reg said as much – that IBM could have just partnered with Cloudera and been a reseller of its Cloudera Distribution for Apache Hadoop (CDH) and Cloudera Manager control freak to tame the wild elephant. IBM felt at the time that Hadoop was important enough for it to do its own distro and to get its own support revenue stream for the core Apache Hadoop tools and a licensing stream for its add-on.

IBM BigInsights Hadoop logo

Those BigInsights add-ons include BigSheets, a tool that rides atop of the Hadoop Distributed File System and presents a virtual spreadsheet that you can surf through in a Web browser to surf through the unstructured data. Later BigInsights releases include the text analytics engine that IBM put into the Watson QA system that played Jeopardy! and beat humanity as well as Jaql, an SQL-like ad hoc query language that can interface with HBase, the database layer that rides on top of HDFS, indexes created by the Lucene search engine, or databases running on IBM's own Netezza parallel database appliances.

IBM has also created its own BigInsights Scheduler to handle workflow and allocation of MapReduce jobs on Hadoop clusters, which is itself an extension to the Hadoop Fair Scheduler; and a technique called Adaptive MapReduce, which is a tweak on the core MapReduce algorithms in Hadoop that corrals many small jobs running on Hadoop and keeps them from consuming too much cluster resources.

IBM also provides connectors into its DB2 databases, its InfoSphere Warehouse data warehouses, and its Smart Analytics System appliances, which do data warehousing and analytics work on the same iron.

With BigInsights Enterprise Edition V1.4, IBM is updating the core Hadoop components to the 1.0.0 release level and also includes Flume 0.9.4, HBase 0.90.5, Hive 0.8.0, Oozie 2.3.1, Nutch 1.4, Pig 0.9.1, and Zookeeper 3.3.4.

IBM is supporting BigInsights V1.4 only on x86-based systems (Watson ran on Power iron, and it is funny how IBM has not put the latest BigInsights on Power), and server nodes in the Hadoop cluster have to be running Red Hat Enterprise Linux 5 or 6 or SUSE Linux Enterprise Server 11 (at base or SP1 level).

You would think that IBM would have the latest BigInsights running on Power-Linux and AIX if it was so keen to peddle Power machinery. BigInsights V1.3 is supported on PowerLinux servers, and later this year V1.4 will be moved to Power iron and will be supported in synch with x86 iron from there on out.

With the updated BigInsights, IBM is also letting customers swap in the Cloudera CDH variant of the open source Apache Hadoop stack – presumably CDH3, which is the closest thing to the current BigInsights release and not the new CDH4 release announced a month ago that is basically the alpha level of Apache 2.0.0+73 backported to also allow for the old MapReduce methodology as well as for the still-experimental MapReduce2 (MR2 or sometimes known as Yarn).

MapR Technologies and Hortonworks have not pushed their Hadoop stacks this hard and are content, like IBM, to let Apache Hadoop 2.0 mature a bit. In any event, the important thing is that all of the add-ons that IBM created for its own Hadoop distro work with the CDH distro.

IBM is tossing in a limited-use license to its InfoSphere Streams software, the code underneath the System S streaming supercomputer, which originally ran on BlueGene parallel supercomputers and which is used to mash up text, video, and audio streams with databases and other data storehouses in real-time and to create various kinds of trading systems or process control systems that react to all this data rather than wait to be sifted like a data warehouse.

IBM prices its BigInsights distribution using a metric called the Resource Value Unit, or RVU, and it is based on the storage capacity of the Hadoop cluster, unlike the other distros which count server nodes. You take the total capacity under management in the Hadoop cluster and you use these multipliers to come up with an aggregate RVU for the cluster:

  • 1TB - 12TB, 1.00
  • 13TB - 32TB, 0.84
  • 33TB - 64TB, 0.61
  • 65TB - 100TB, 0.46
  • 101TB - 250TB, 0.38
  • 251TB+, 0.30

So let's say you have a 40TB cluster. That's 12TB multiplied by 1.00 RVU in the first tier, 20TB multiplied by 0.84 in the second tier (16.8), and 8TB multiplied by 0.61 (4.88) in the third tier. Add it up, you get 33.68, and round up to 34 total RVUs.

For a developmental license, it costs $5,000 per RVU and for a production environment you are talking $10,000 per RVU. This includes all the IBM software plus the rolled up Apache code (from IBM or Cloudera), the InfoSphere Streams entry license, and a year of support for the whole shebang, and it comes out to $340,000.

BigInsights Enterprise Edition V1.4 ships on July 13. Presumably there will a BigInsights Basic Edition, the first of which IBM rolled out last May. The Basic Edition is a support contract that covers just the core Hadoop stuff plus Jaql; pricing was not announced. ®

Security for virtualized datacentres

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Bitcasa bins $10-a-month Infinite storage offer
Firm cites 'low demand' plus 'abusers'
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
AWS pulls desktop-as-a-service from the PC
Support for PCoIP protocol means zero clients can run cloudy desktops
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.