Feeds

IBM snuggles up BigInsights Hadoop with Cloudera

Who says Blue elephants can't dance?

Security for virtualized datacentres

Like all of the other Hadoop disties, Big Blue is upgrading its big data muncher to the most recent Apache modules and integrating it with its own add-ons. And with its InfoSphere BigInsights Enterprise Edition V1.4, IBM is also doing something else interesting: It is letting customers slide in the Cloudera CDH distribution and slide out its own variant of the core Apache Hadoop stack.

IBM started rolling up its own Hadoop back in May 2010, when it became clear that everybody was going to want an elephant in the data center and that IBM had some other software, cooked up at IBM Research and in Software Group. It was also clear at the time – and El Reg said as much – that IBM could have just partnered with Cloudera and been a reseller of its Cloudera Distribution for Apache Hadoop (CDH) and Cloudera Manager control freak to tame the wild elephant. IBM felt at the time that Hadoop was important enough for it to do its own distro and to get its own support revenue stream for the core Apache Hadoop tools and a licensing stream for its add-on.

IBM BigInsights Hadoop logo

Those BigInsights add-ons include BigSheets, a tool that rides atop of the Hadoop Distributed File System and presents a virtual spreadsheet that you can surf through in a Web browser to surf through the unstructured data. Later BigInsights releases include the text analytics engine that IBM put into the Watson QA system that played Jeopardy! and beat humanity as well as Jaql, an SQL-like ad hoc query language that can interface with HBase, the database layer that rides on top of HDFS, indexes created by the Lucene search engine, or databases running on IBM's own Netezza parallel database appliances.

IBM has also created its own BigInsights Scheduler to handle workflow and allocation of MapReduce jobs on Hadoop clusters, which is itself an extension to the Hadoop Fair Scheduler; and a technique called Adaptive MapReduce, which is a tweak on the core MapReduce algorithms in Hadoop that corrals many small jobs running on Hadoop and keeps them from consuming too much cluster resources.

IBM also provides connectors into its DB2 databases, its InfoSphere Warehouse data warehouses, and its Smart Analytics System appliances, which do data warehousing and analytics work on the same iron.

With BigInsights Enterprise Edition V1.4, IBM is updating the core Hadoop components to the 1.0.0 release level and also includes Flume 0.9.4, HBase 0.90.5, Hive 0.8.0, Oozie 2.3.1, Nutch 1.4, Pig 0.9.1, and Zookeeper 3.3.4.

IBM is supporting BigInsights V1.4 only on x86-based systems (Watson ran on Power iron, and it is funny how IBM has not put the latest BigInsights on Power), and server nodes in the Hadoop cluster have to be running Red Hat Enterprise Linux 5 or 6 or SUSE Linux Enterprise Server 11 (at base or SP1 level).

You would think that IBM would have the latest BigInsights running on Power-Linux and AIX if it was so keen to peddle Power machinery. BigInsights V1.3 is supported on PowerLinux servers, and later this year V1.4 will be moved to Power iron and will be supported in synch with x86 iron from there on out.

With the updated BigInsights, IBM is also letting customers swap in the Cloudera CDH variant of the open source Apache Hadoop stack – presumably CDH3, which is the closest thing to the current BigInsights release and not the new CDH4 release announced a month ago that is basically the alpha level of Apache 2.0.0+73 backported to also allow for the old MapReduce methodology as well as for the still-experimental MapReduce2 (MR2 or sometimes known as Yarn).

MapR Technologies and Hortonworks have not pushed their Hadoop stacks this hard and are content, like IBM, to let Apache Hadoop 2.0 mature a bit. In any event, the important thing is that all of the add-ons that IBM created for its own Hadoop distro work with the CDH distro.

IBM is tossing in a limited-use license to its InfoSphere Streams software, the code underneath the System S streaming supercomputer, which originally ran on BlueGene parallel supercomputers and which is used to mash up text, video, and audio streams with databases and other data storehouses in real-time and to create various kinds of trading systems or process control systems that react to all this data rather than wait to be sifted like a data warehouse.

IBM prices its BigInsights distribution using a metric called the Resource Value Unit, or RVU, and it is based on the storage capacity of the Hadoop cluster, unlike the other distros which count server nodes. You take the total capacity under management in the Hadoop cluster and you use these multipliers to come up with an aggregate RVU for the cluster:

  • 1TB - 12TB, 1.00
  • 13TB - 32TB, 0.84
  • 33TB - 64TB, 0.61
  • 65TB - 100TB, 0.46
  • 101TB - 250TB, 0.38
  • 251TB+, 0.30

So let's say you have a 40TB cluster. That's 12TB multiplied by 1.00 RVU in the first tier, 20TB multiplied by 0.84 in the second tier (16.8), and 8TB multiplied by 0.61 (4.88) in the third tier. Add it up, you get 33.68, and round up to 34 total RVUs.

For a developmental license, it costs $5,000 per RVU and for a production environment you are talking $10,000 per RVU. This includes all the IBM software plus the rolled up Apache code (from IBM or Cloudera), the InfoSphere Streams entry license, and a year of support for the whole shebang, and it comes out to $340,000.

BigInsights Enterprise Edition V1.4 ships on July 13. Presumably there will a BigInsights Basic Edition, the first of which IBM rolled out last May. The Basic Edition is a support contract that covers just the core Hadoop stuff plus Jaql; pricing was not announced. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.