Oracle mounts Cloudera's elephant for big data ride

Cloud biz chosen for number-stomping Hadoop stack

Top 5 reasons to deploy VMware with Tegile

When Oracle co-founder and CEO Larry Ellison clambered onto his own Big Data elephant back in October as his company announced the Big Data Appliance, Oracle gave the impression that it would be rolling up its own implementation of the open-source Apache Hadoop data muncher. This turns out to be not true.

In fact, Oracle revealed today as it began shipping its Hadoop stack (which is named after a stuffed elephant) that it has in fact inked an OEM agreement with Cloudera – the first and thus far the biggest of the commercial Hadoop disties – as the Big Data Appliance's mapper and reducer. Specifically, the Big Data Appliance will include the CDH3 version of the Hadoop stacked up by Cloudera as well as its Cloudera Manager 3.7 add-on, a control freak that debuted back in December 2011.

Because Oracle is using CDH3 as its core Hadoop, customers will not be limited to using Oracle's own NoSQL Database key-value store, which is based on its acquired BerkeleyDB. This had previously been the only data store that Oracle talked about when the Big Data Appliance was announced last fall at Oracle's OpenWorld extravaganza. Cetin Ozbutun, vice president of data warehousing technologies, tells El Reg that Big Data Appliance customers will be able to store data in the Hadoop Distributed File System (HDFS) that is part of the Apache Hadoop stack if they don't want to run Oracle NoSQL, which runs on top of Oracle's Enterprise Linux, its RHEL-ish re-implementation. And because CDH3 also includes the HBase – the column-oriented add-on for HDFS that is modeled on Google's BigTable data store – Oracle's customers will also be able to use HBase if their applications run better on it than on the raw HDFS or Oracle NoSQL Database. The Big Data Appliance uses Oracle's Community edition of its NoSQL data store and its HotSpot Java virtual machine running atop Oracle Enterprise Linux.

Oracle could have grabbed the Apache Hadoop code and rolled its own distro, much like it has grabbed Red Hat's Enterprise Linux and forked it a bit to tune it up for its own hardware and software stack while remaining compatible with RHEL. Ozbutun said that Oracle had evaluated this alternative as well as looking at other Hadoop distros from Hortonworks and MapR. Hortonworks is the spinout of the Yahoo! team that was supporting Hadoop inside of the search engine giant, and MapR is the Hadoop commercializer that OEMs its software to EMC's Greenplum data appliance division.

"We did consider a lot of different options, but we thought it best to partner with Cloudera," explained Ozbutun. "Cloudera is obviously the leader in this area, and we have expertise in other areas that are complementary."

Of course, Oracle used to partner with Red Hat, then it sold a clone of Red Hat support, and then it decided to do its own distribution and control all the money and the code. The odds are that over the long haul that Oracle will buy Cloudera, buy another distro, or roll its own. Hadoop is just too important for Oracle to not own a piece of it: it's that simple.

This isn't a bedroom mashup project

The important thing about the Big Data Appliance is that it is an engineered system, not just a mashup of Oracle and Cloudera software. Ozbutun says that Oracle has spent months fine-tuning the configuration of the underlying hardware that runs the data munching algorithms, their various data stores, and the plug-ins that let Oracle databases and Hadoop data stores share information.

The Big Data Appliance is a rack of 18 Sun Fire x86 server nodes. The nodes each have two six-core Xeon X5675 processors running at 3.06GHz. The server tops out at 144GB of memory, but this particular configuration has 48GB of main memory (or 4GB per core). The server has a disk controller with 512MB of battery-backed cache memory and a dozen 3TB 7.2K RPM SAS disks (3.5-inchers, and one per core). The servers have two 40Gb/sec InfiniBand ports and four Gigabit Ethernet ports. The rack has two InfiniBand switches that have 32 QDR InfiniBand ports and eight 10GE ports and another plain vanilla 36-port QDR InfiniBand switch. (Oracle is a minority stakeholder in chip- and switch-maker Mellanox Technologies and uses Mellonox's silicon in its own switches.) There's another Ethernet switch in the rack to provide a separate management network for the Hadoop nodes.

Ozbutun says that the main thing about Exadata database clusters, Exalogic middleware clusters, and now the Big Data Appliance, is to stop thinking about the oomph in the CPUs and start worrying about having the right mix of system I/O and network bandwidth. A lot of customers overprovision on the CPUs and underprovision on the networking and I/O. And that means they are spending a lot of money on hardware that will not help Hadoop run well. Oracle has spent months configuring and tuning this system and making sure nothing is in the rack that isn't needed because – like other high performance computing customers – Hadoop shops are stingy.

This Big Data Appliance includes a lifetime OEM license to CDH3 and licenses to the core Oracle software, and costs $450,000 per rack. That is considerably less costly than a rack of fully configured Exadata iron – which includes database servers, Exadata storage arrays, and the Oracle 11g R2 database and Real Application clusters extensions, which will cost you $4.47m at list price on an Exadata X2-2 with 96 cores. The hardware cost within this is $1.1m at list price, just for comparison.

Hooking up

In addition to the Big Data Appliance, Oracle is also rolling out a set of connectors to link Oracle databases to Hadoop data stores. These were hinted at back in the slide presentations by Oracle in October, but were not discussed in any great detail.

The first is Oracle Loader for Hadoop, which moved data from Oracle 11g R2 databases to Hadoop data stores. There are a lot of ways to skin this elephant, but Ozbutun says that this one is unique in that this connector is engineered so the bulk of the data partitioning, conversion, and other work is done by the Hadoop cluster rather than the machine (or machines) running the Oracle databases.

The second connector is Oracle Data Integrator for Hadoop, a twist on the existing Data Integrator tool that can automatically generate MapReduce code to chew on data and bring data sets into view of Oracle databases.

The third connector is called Direct Connection for HDFS, and this essentially makes a section of the HDFS file system holding mapped and reduced data to be viewed as an Oracle database table. And finally, the fourth connector is called R Connector for Hadoop. With this, Oracle has indeed taken the open source R statistical analysis package and added optimized math libraries to link it to the various data stores in the Big Data Appliance stack. This connector is not based on the Hadoop-friendly R tools from Revolution Analytics.

The bundle of four Hadoop data connectors from Oracle costs $2,000 per server processor. ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
prev story


Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.