Oracle mounts Cloudera's elephant for big data ride

Cloud biz chosen for number-stomping Hadoop stack

Next gen security for virtualised datacentres

When Oracle co-founder and CEO Larry Ellison clambered onto his own Big Data elephant back in October as his company announced the Big Data Appliance, Oracle gave the impression that it would be rolling up its own implementation of the open-source Apache Hadoop data muncher. This turns out to be not true.

In fact, Oracle revealed today as it began shipping its Hadoop stack (which is named after a stuffed elephant) that it has in fact inked an OEM agreement with Cloudera – the first and thus far the biggest of the commercial Hadoop disties – as the Big Data Appliance's mapper and reducer. Specifically, the Big Data Appliance will include the CDH3 version of the Hadoop stacked up by Cloudera as well as its Cloudera Manager 3.7 add-on, a control freak that debuted back in December 2011.

Because Oracle is using CDH3 as its core Hadoop, customers will not be limited to using Oracle's own NoSQL Database key-value store, which is based on its acquired BerkeleyDB. This had previously been the only data store that Oracle talked about when the Big Data Appliance was announced last fall at Oracle's OpenWorld extravaganza. Cetin Ozbutun, vice president of data warehousing technologies, tells El Reg that Big Data Appliance customers will be able to store data in the Hadoop Distributed File System (HDFS) that is part of the Apache Hadoop stack if they don't want to run Oracle NoSQL, which runs on top of Oracle's Enterprise Linux, its RHEL-ish re-implementation. And because CDH3 also includes the HBase – the column-oriented add-on for HDFS that is modeled on Google's BigTable data store – Oracle's customers will also be able to use HBase if their applications run better on it than on the raw HDFS or Oracle NoSQL Database. The Big Data Appliance uses Oracle's Community edition of its NoSQL data store and its HotSpot Java virtual machine running atop Oracle Enterprise Linux.

Oracle could have grabbed the Apache Hadoop code and rolled its own distro, much like it has grabbed Red Hat's Enterprise Linux and forked it a bit to tune it up for its own hardware and software stack while remaining compatible with RHEL. Ozbutun said that Oracle had evaluated this alternative as well as looking at other Hadoop distros from Hortonworks and MapR. Hortonworks is the spinout of the Yahoo! team that was supporting Hadoop inside of the search engine giant, and MapR is the Hadoop commercializer that OEMs its software to EMC's Greenplum data appliance division.

"We did consider a lot of different options, but we thought it best to partner with Cloudera," explained Ozbutun. "Cloudera is obviously the leader in this area, and we have expertise in other areas that are complementary."

Of course, Oracle used to partner with Red Hat, then it sold a clone of Red Hat support, and then it decided to do its own distribution and control all the money and the code. The odds are that over the long haul that Oracle will buy Cloudera, buy another distro, or roll its own. Hadoop is just too important for Oracle to not own a piece of it: it's that simple.

This isn't a bedroom mashup project

The important thing about the Big Data Appliance is that it is an engineered system, not just a mashup of Oracle and Cloudera software. Ozbutun says that Oracle has spent months fine-tuning the configuration of the underlying hardware that runs the data munching algorithms, their various data stores, and the plug-ins that let Oracle databases and Hadoop data stores share information.

The Big Data Appliance is a rack of 18 Sun Fire x86 server nodes. The nodes each have two six-core Xeon X5675 processors running at 3.06GHz. The server tops out at 144GB of memory, but this particular configuration has 48GB of main memory (or 4GB per core). The server has a disk controller with 512MB of battery-backed cache memory and a dozen 3TB 7.2K RPM SAS disks (3.5-inchers, and one per core). The servers have two 40Gb/sec InfiniBand ports and four Gigabit Ethernet ports. The rack has two InfiniBand switches that have 32 QDR InfiniBand ports and eight 10GE ports and another plain vanilla 36-port QDR InfiniBand switch. (Oracle is a minority stakeholder in chip- and switch-maker Mellanox Technologies and uses Mellonox's silicon in its own switches.) There's another Ethernet switch in the rack to provide a separate management network for the Hadoop nodes.

Ozbutun says that the main thing about Exadata database clusters, Exalogic middleware clusters, and now the Big Data Appliance, is to stop thinking about the oomph in the CPUs and start worrying about having the right mix of system I/O and network bandwidth. A lot of customers overprovision on the CPUs and underprovision on the networking and I/O. And that means they are spending a lot of money on hardware that will not help Hadoop run well. Oracle has spent months configuring and tuning this system and making sure nothing is in the rack that isn't needed because – like other high performance computing customers – Hadoop shops are stingy.

This Big Data Appliance includes a lifetime OEM license to CDH3 and licenses to the core Oracle software, and costs $450,000 per rack. That is considerably less costly than a rack of fully configured Exadata iron – which includes database servers, Exadata storage arrays, and the Oracle 11g R2 database and Real Application clusters extensions, which will cost you $4.47m at list price on an Exadata X2-2 with 96 cores. The hardware cost within this is $1.1m at list price, just for comparison.

Hooking up

In addition to the Big Data Appliance, Oracle is also rolling out a set of connectors to link Oracle databases to Hadoop data stores. These were hinted at back in the slide presentations by Oracle in October, but were not discussed in any great detail.

The first is Oracle Loader for Hadoop, which moved data from Oracle 11g R2 databases to Hadoop data stores. There are a lot of ways to skin this elephant, but Ozbutun says that this one is unique in that this connector is engineered so the bulk of the data partitioning, conversion, and other work is done by the Hadoop cluster rather than the machine (or machines) running the Oracle databases.

The second connector is Oracle Data Integrator for Hadoop, a twist on the existing Data Integrator tool that can automatically generate MapReduce code to chew on data and bring data sets into view of Oracle databases.

The third connector is called Direct Connection for HDFS, and this essentially makes a section of the HDFS file system holding mapped and reduced data to be viewed as an Oracle database table. And finally, the fourth connector is called R Connector for Hadoop. With this, Oracle has indeed taken the open source R statistical analysis package and added optimized math libraries to link it to the various data stores in the Big Data Appliance stack. This connector is not based on the Hadoop-friendly R tools from Revolution Analytics.

The bundle of four Hadoop data connectors from Oracle costs $2,000 per server processor. ®

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story


Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.