Original URL: http://www.theregister.co.uk/2008/08/11/hadoop_dziuba/
Hadoop: When grownups do open source
On the emasculation of Twitter and Dirty Harry
Fail and You Hadoop is a library for writing distributed data processing programs using the MapReduce framework. It's got all the makings of a blogosphere hit: cluster computing, large datasets, parallelism, algorithms published by Google, and open source. Every four days or so, a nerd will discover Hadoop, write a “Basic MapReduce Tutorial with Hadoop” tutorial on his blog with some trivial examples, and feel satisfied with himself for educating the world about a yet-undiscovered gem. Comparatively, very few people actually use Hadoop in practice, and those who do don't write about it. Why? Because they're adults who don't care about getting on the front page of Digg.
Hadoop was birthed to the world by Doug Cutting, the man who gave us the Lucene search library and the Nutch web crawler. It's a little known fact that Cutting only sleeps three hours per night, and not because he is tired, but rather as an act of mercy for his keyboard. Furthermore, Hadoop isn't open source because Doug wanted to help other developers solve distributed computing problems. It's open source because he thought you could learn a thing or two from his code.
Despite being a canon of Java engineering, Hadoop is actually pretty useful, if you've got a problem it can solve.
Now, that's a tricky caveat that isn't well understood by many contemporary programmers. A lot of developers (and this is most prevalent in the Web 2.0 world), walk balls-first into any given task, assuming that scalability is going to be issue number one. If you've got a big data processing problem that needs to scale, then Hadoop will probably help you out. In that case, you can read any one of the over nine thousand regurgitations of the same word count tutorial on some nerd's blog. Otherwise, you can do like the web scalability Bible beaters and completely miss the point of Hadoop, contort your code into something that resembles MapReduce, and let those throbbing nuts of yours hang out in the breeze, because shit, you just built one scalable-ass system. Sure, it runs several times slower than your old code used to, but it's parallelized, and that's really all that matters.
Put Me In, Coach
When it comes to open source, Hadoop is a good example of what separates the men from the boys. Being a top-level Apache project, there's significant backing behind it. With corporate sponsorship, a software project can go far. When I say “corporate sponsorship”, I'm referring to companies that actually make money, which brings us to the comedic open source attempts of Web 2.0.
Twitter, which is widely accepted as the drum major of the Web 2.0 failure parade, released an open source project called Starling in January of this year. Starling is the Ruby-based messaging system that runs Twitter's backend. Yes, Twitter, the nonprofit web service known widely for its downtime, dropped its disaster-producing shitpile on the world. Why? Maybe they thought more competent developers would fix their problems. The more likely scenario is that they wanted to get a quickie beatoff from the fake tech media to make themselves look more important. I am guessing this is why no code has been released for Starling since it was open sourced. Oops.
Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get shit done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...what the fuck is that?). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working.
Perhaps this is a feast of apples and oranges. Starling is a messaging system, Hadoop is a distributed data processing system. Don't worry. There's plenty of failure to be had elsewhere in the open source world. Let's take, for example, a project called Starfish), which is a pure-Ruby implementation of MapReduce. Eh, well that's not entirely accurate. Starfish is a MapReduce-inspired framework that's simple enough for even Ruby developers to understand. That means there's no actual “reduce” phase in the MapReduce, and it works on MySQL database records. In other words, this project is virtually useless in every way, aside from getting the author a quick beatoff from the blogosphere. It's a half-baked implementation of an algorithm from Google, it's written in Ruby and it integrates with Rails. That's so warm and fuzzy it could turn Clint Eastwood gay.
But hey, taking the time to actually understand something is way harder that writing an open source Ruby implementation of it.
More Than Just Data Processing
Along with the data processing framework, Doug Cutting also included a fault tolerant, replicated, distributed file system with Hadoop just because fuck you. Called HDFS, it is inspired by Google's GFS, again something that only existed to the public in the form of an academic paper. HDFS is designed to integrate perfectly with the MapReduce framework, but you can also use it on its own. If you need to replicate large files across many machines, then HDFS has got what you need.
Why don't you see distributed filesystems coming out of places like Twitter? Because that shit is hard. It could also be that companies like Twitter are blissfully unaware of any data storage medium other than MySQL. In either case, data processing gets you more pedantic nerd-cred than data storage. Processing large amounts of data lets a nerd excuse himself for being a shut-in, because shit, he's doing important work in there. He's got to command an army of machines, all working in unison. That will show the football team who the real man is. Sure beats talking to girls.
Projects like Hadoop are few and far between. Everybody wants to work on an interesting problem like distributed data processing, but few people actually can. It's the disparity between “want” and “can” that brings us failures like Twitter.
Sometimes, your best just isn't good enough. ®
Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.