Oracle: Quit messin' and marry Hadoop!
Why Larry should pop the question
Open...and Shut Oracle isn't the biggest enterprise software vendor, but in 2010 it grew faster than its big-enterprise peers, including Microsoft and IBM, to claim third place. Being ever so ambitious, it's unlikely that Oracle chief executive Larry Ellison will be content to take the bronze. But it's equally unlikely that relational databases will be enough to power Oracle to the top of the enterprise heap.
Oracle needs Apache Hadoop, but risks missing its chance unless it moves quickly.
Hadoop, after all, is becoming the new Linux, with a plethora of companies, big and small, contributing to the Apache Software Foundation-led project and leveraging it in a bevy of new products. Yahoo!, which originally carried the project, is contemplating a startup focused on Hadoop. More promisingly, this week should see EMC release a Hadoop appliance and software distribution, according to The Wall Street Journal. They join IBM, eBay, Amazon, Facebook, and others already using Hadoop.
Oracle, to date, has been missing in action, preferring to push customers to its Exadata appliance in an attempt to leverage the hardware and software assets it acquired from Sun Microsystems.
Good luck with that.
The fact is that while the Visas of the world still see plenty of reason to use Oracle's relational databases, they also can't live without Hadoop and other NoSQL technologies. But among the NoSQL crowd, Hadoop is king, and more and more companies are finding ways to mix Hadoop with their Oracle assets, often in ways that cut Oracle database licensing costs by putting more data into Hadoop.
There's an easy way for Oracle to mitigate the Hadoop threat while simultaneously blessing its customers: buy into Hadoop by either partnering with an existing Hadoop player or by developing its own Hadoop distribution.
This latter approach is crowded and unlikely to succeed. But the former approach – an acquisition – fits Oracle's preferred model and conveniently could be accomplished by buying back one of its previous employees: Mike Olson, CEO of Cloudera, the frontrunner among Hadoop companies. Olson was CEO of Sleepycat, the open source database vendor Oracle bought back in 2006. Presumably Olson's desk is still vacant and ready for him.
An Oracle acquisition of Cloudera makes sense not only from this personal/personnel perspective, but also because Cloudera has already been working to integrate Hadoop with Oracle databases, through the "Ora-Oop" connector released in 2010. This connector makes it possible for Oracle customers to tap into Hadoop. It also, incidentally, makes it easier for such customers to leave the database giant.
Hadoop tends to be very developer-focused and, as such, not easy for an average DBA to pick up and use – although Cloudera has been working hard to make Hadoop more DBA-friendly. The easier Hadoop becomes, the more likely that DBAs will find Ora-Oop convenient to move data out of expensive Oracle and into open source Hadoop.
Oracle is familiar with this kind of threat, having dealt with it before in MySQL. Oracle bought Sun and its MySQL assets, muting MySQL as a threat even as Oracle has invested heavily in the open source database. The same can be true of Hadoop.
Left to fester, Hadoop will be viewed as alternative technology to Oracle's relational database. But brought into Oracle, Hadoop can be seen for what it should be: an excellent complement to Oracle databases.
As suggested, the easiest way to accomplish this is by buying Cloudera. But with so much at stake in Hadoop, Oracle needs to act fast, because there's no shortage of big players circling Hadoop to take it mainstream in enterprise computing. Cloudera will be on their shopping lists, too. ®
Matt Asay is senior vice president of business development at Strobe, a startup that offers an open source framework for building mobile apps. He was formerly chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfreso's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears twice a week on The Register.
Diet modification needed
Stop drinking bleach, Matt. It's not good for you.
I don't think the author of this article should be allowed to write about Apache Hadoop -it's painful to read. I hope nobody actually believes a word this person says.
1. The only official release of Apache Hadoop comes from the Apache Software Foundation, the last of the 0.20 releases, 0.20.203 came out yesterday with lots of bug fixes from Yahoo! and Cloudera in it.
2. Any other so called "distribution" of Hadoop is not "a distribution" unless it is just the Apache release packaged for easy installation (as Thomas Koch does for debian) -it is a derivative work, containing code that is not in the Apache release.
3. Such derivative works can be open source (Cloudera) or closed source (EMC, IBM).
4. Any closed source derivative work forces the distributor to maintain their branch indefinitely.
5. Any derivative work forces the developer to test at the same scale as Y! and Facebook (thousands of machines, tens of PB of storage), or they cannot claim that it scales up.
6. Any closed source derivative work will only support bug fixes and patches at a rate determined by the closed source developer team, and provided at a cost determined by the price of that developer team.
7. Apache only provide support for the official apache release. If you use Cloudera or EMC: go talk to them about problems.
8. People who are not part of the Apache developer and user community do not get their needs addressed in the Apache releases, because we are unaware of them.
9. We, the apache developer team, have no need to take on random patches from developers of closed source derivative works unless we can see tangible benefits.
10. Finally, any derivative work that pulls out large amounts of the Hadoop codebase (e.g Brisk, EMC Enteprise HD) cannot call themselves a version of Hadoop. They are not. We, the apache community define the interfaces and what "100% compatible" means. When someone like EMC declare their derivative work is "certified 100% compatible", that is a meaningless statement. Only the official Apache Hadoop release is, implicitly 100% compatible with Apache Hadoop.
11. We reserve the right to change the semantics and interfaces to meet the community needs, on the schedule that suits the development community.
12. The rules of using the term "Hadoop" are defined in the Apache license, and it is not legal to say "a distribution of Hadoop" if it is in fact a derivative work. This is why Cloudera call their software "Cloudera’s Distribution including Apache Hadoop". EMC, Brisk and others are sailing close to the wind here.
13. The fact that Oracle are now subpoenaing Apache in the Oracle/Google lawsuit mean that the relationship between Oracle and Apache have reached a low point -even after Apache left the Java Community Program due to Oracle's unwillingness to meet its legal requirements to provide the Testing Compatibility Kit without imposing Field of Use restrictions.
14. Because of (#13), it's hard to see a team of Oracle developers being trusted or welcome in the Hadoop community. You can't serve subpoenas on the ASF and then say "we'd like to help develop a technology of yours that threatens our entire business model and margins". They won't be trusted.
I have a term for the EMC-style not-quite-Hadoop products that use the same interfaces but offer unknown semantics and a cost model on a par with the vendor's existing enterprise product line. It is "Enterprisey Hadoop". This is not Apache Hadoop supported in the Enterprise, it is some derivative work that pretends to be Hadoop but misses the point about affordable scalability through commodity hardware and an open source codebase.
SteveL, Apache Hadoop Committer. All comments are personal opinions only, etc.
I could be wrong, but don't Oracle already have Coherence, which although slightly different is still basically a hadoop-style map cache?