Oracle mounts Cloudera's elephant for big data ride
Cloud biz chosen for number-stomping Hadoop stack
When Oracle co-founder and CEO Larry Ellison clambered onto his own Big Data elephant back in October  as his company announced the Big Data Appliance, Oracle gave the impression that it would be rolling up its own implementation of the open-source Apache Hadoop data muncher. This turns out to be not true.
In fact, Oracle revealed today as it began shipping its Hadoop stack (which is named after a stuffed elephant) that it has in fact inked an OEM agreement with Cloudera – the first and thus far the biggest of the commercial Hadoop disties – as the Big Data Appliance's mapper and reducer. Specifically, the Big Data Appliance will include the CDH3 version of the Hadoop stacked up by Cloudera as well as its Cloudera Manager 3.7 add-on, a control freak that debuted  back in December 2011.
Because Oracle is using CDH3 as its core Hadoop, customers will not be limited to using Oracle's own NoSQL Database key-value store, which is based on its acquired BerkeleyDB. This had previously been the only data store that Oracle talked about when the Big Data Appliance was announced last fall at Oracle's OpenWorld extravaganza. Cetin Ozbutun, vice president of data warehousing technologies, tells El Reg that Big Data Appliance customers will be able to store data in the Hadoop Distributed File System (HDFS) that is part of the Apache Hadoop stack if they don't want to run Oracle NoSQL, which runs on top of Oracle's Enterprise Linux, its RHEL-ish re-implementation. And because CDH3 also includes the HBase – the column-oriented add-on for HDFS that is modeled on Google's BigTable data store – Oracle's customers will also be able to use HBase if their applications run better on it than on the raw HDFS or Oracle NoSQL Database. The Big Data Appliance uses Oracle's Community edition of its NoSQL data store and its HotSpot Java virtual machine running atop Oracle Enterprise Linux.
Oracle could have grabbed the Apache Hadoop code and rolled its own distro, much like it has grabbed Red Hat's Enterprise Linux and forked it a bit to tune it up for its own hardware and software stack while remaining compatible with RHEL. Ozbutun said that Oracle had evaluated this alternative as well as looking at other Hadoop distros from Hortonworks and MapR. Hortonworks is the spinout of the Yahoo! team that was supporting Hadoop inside of the search engine giant, and MapR is the Hadoop commercializer that OEMs its software to EMC's Greenplum data appliance division.
"We did consider a lot of different options, but we thought it best to partner with Cloudera," explained Ozbutun. "Cloudera is obviously the leader in this area, and we have expertise in other areas that are complementary."
Of course, Oracle used to partner with Red Hat, then it sold a clone of Red Hat support, and then it decided to do its own distribution and control all the money and the code. The odds are that over the long haul that Oracle will buy Cloudera, buy another distro, or roll its own. Hadoop is just too important for Oracle to not own a piece of it: it's that simple.
This isn't a bedroom mashup project
The important thing about the Big Data Appliance is that it is an engineered system, not just a mashup of Oracle and Cloudera software. Ozbutun says that Oracle has spent months fine-tuning the configuration of the underlying hardware that runs the data munching algorithms, their various data stores, and the plug-ins that let Oracle databases and Hadoop data stores share information.
The Big Data Appliance is a rack of 18 Sun Fire x86 server nodes. The nodes each have two six-core Xeon X5675 processors running at 3.06GHz. The server tops out at 144GB of memory, but this particular configuration has 48GB of main memory (or 4GB per core). The server has a disk controller with 512MB of battery-backed cache memory and a dozen 3TB 7.2K RPM SAS disks (3.5-inchers, and one per core). The servers have two 40Gb/sec InfiniBand ports and four Gigabit Ethernet ports. The rack has two InfiniBand switches that have 32 QDR InfiniBand ports and eight 10GE ports and another plain vanilla 36-port QDR InfiniBand switch. (Oracle is a minority stakeholder in chip- and switch-maker Mellanox Technologies and uses Mellonox's silicon in its own switches.) There's another Ethernet switch in the rack to provide a separate management network for the Hadoop nodes.
Ozbutun says that the main thing about Exadata database clusters, Exalogic middleware clusters, and now the Big Data Appliance, is to stop thinking about the oomph in the CPUs and start worrying about having the right mix of system I/O and network bandwidth. A lot of customers overprovision on the CPUs and underprovision on the networking and I/O. And that means they are spending a lot of money on hardware that will not help Hadoop run well. Oracle has spent months configuring and tuning this system and making sure nothing is in the rack that isn't needed because – like other high performance computing customers – Hadoop shops are stingy.
This Big Data Appliance includes a lifetime OEM license to CDH3 and licenses to the core Oracle software, and costs $450,000 per rack. That is considerably less costly than a rack of fully configured Exadata iron – which includes database servers, Exadata storage arrays, and the Oracle 11g R2 database and Real Application clusters extensions, which will cost you $4.47m at list price on an Exadata X2-2 with 96 cores. The hardware cost within this is $1.1m at list price, just for comparison.
In addition to the Big Data Appliance, Oracle is also rolling out a set of connectors to link Oracle databases to Hadoop data stores. These were hinted at back in the slide presentations by Oracle in October, but were not discussed in any great detail.
The first is Oracle Loader for Hadoop, which moved data from Oracle 11g R2 databases to Hadoop data stores. There are a lot of ways to skin this elephant, but Ozbutun says that this one is unique in that this connector is engineered so the bulk of the data partitioning, conversion, and other work is done by the Hadoop cluster rather than the machine (or machines) running the Oracle databases.
The second connector is Oracle Data Integrator for Hadoop, a twist on the existing Data Integrator tool that can automatically generate MapReduce code to chew on data and bring data sets into view of Oracle databases.
The third connector is called Direct Connection for HDFS, and this essentially makes a section of the HDFS file system holding mapped and reduced data to be viewed as an Oracle database table. And finally, the fourth connector is called R Connector for Hadoop. With this, Oracle has indeed taken the open source R statistical analysis package and added optimized math libraries to link it to the various data stores in the Big Data Appliance stack. This connector is not based on the Hadoop-friendly R tools  from Revolution Analytics.
The bundle of four Hadoop data connectors from Oracle costs $2,000 per server processor. ®