Feeds

Hadoop: When grownups do open source

On the emasculation of Twitter and Dirty Harry

High performance access to file storage

Fail and You Hadoop is a library for writing distributed data processing programs using the MapReduce framework. It's got all the makings of a blogosphere hit: cluster computing, large datasets, parallelism, algorithms published by Google, and open source. Every four days or so, a nerd will discover Hadoop, write a “Basic MapReduce Tutorial with Hadoop” tutorial on his blog with some trivial examples, and feel satisfied with himself for educating the world about a yet-undiscovered gem. Comparatively, very few people actually use Hadoop in practice, and those who do don't write about it. Why? Because they're adults who don't care about getting on the front page of Digg.

Hadoop was birthed to the world by Doug Cutting, the man who gave us the Lucene search library and the Nutch web crawler. It's a little known fact that Cutting only sleeps three hours per night, and not because he is tired, but rather as an act of mercy for his keyboard. Furthermore, Hadoop isn't open source because Doug wanted to help other developers solve distributed computing problems. It's open source because he thought you could learn a thing or two from his code.

Despite being a canon of Java engineering, Hadoop is actually pretty useful, if you've got a problem it can solve.

Now, that's a tricky caveat that isn't well understood by many contemporary programmers. A lot of developers (and this is most prevalent in the Web 2.0 world), walk balls-first into any given task, assuming that scalability is going to be issue number one. If you've got a big data processing problem that needs to scale, then Hadoop will probably help you out. In that case, you can read any one of the over nine thousand regurgitations of the same word count tutorial on some nerd's blog. Otherwise, you can do like the web scalability Bible beaters and completely miss the point of Hadoop, contort your code into something that resembles MapReduce, and let those throbbing nuts of yours hang out in the breeze, because shit, you just built one scalable-ass system. Sure, it runs several times slower than your old code used to, but it's parallelized, and that's really all that matters.

Put Me In, Coach

When it comes to open source, Hadoop is a good example of what separates the men from the boys. Being a top-level Apache project, there's significant backing behind it. With corporate sponsorship, a software project can go far. When I say “corporate sponsorship”, I'm referring to companies that actually make money, which brings us to the comedic open source attempts of Web 2.0.

Twitter, which is widely accepted as the drum major of the Web 2.0 failure parade, released an open source project called Starling in January of this year. Starling is the Ruby-based messaging system that runs Twitter's backend. Yes, Twitter, the nonprofit web service known widely for its downtime, dropped its disaster-producing shitpile on the world. Why? Maybe they thought more competent developers would fix their problems. The more likely scenario is that they wanted to get a quickie beatoff from the fake tech media to make themselves look more important. I am guessing this is why no code has been released for Starling since it was open sourced. Oops.

Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get shit done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...what the fuck is that?). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working.

Perhaps this is a feast of apples and oranges. Starling is a messaging system, Hadoop is a distributed data processing system. Don't worry. There's plenty of failure to be had elsewhere in the open source world. Let's take, for example, a project called Starfish), which is a pure-Ruby implementation of MapReduce. Eh, well that's not entirely accurate. Starfish is a MapReduce-inspired framework that's simple enough for even Ruby developers to understand. That means there's no actual “reduce” phase in the MapReduce, and it works on MySQL database records. In other words, this project is virtually useless in every way, aside from getting the author a quick beatoff from the blogosphere. It's a half-baked implementation of an algorithm from Google, it's written in Ruby and it integrates with Rails. That's so warm and fuzzy it could turn Clint Eastwood gay.

But hey, taking the time to actually understand something is way harder that writing an open source Ruby implementation of it.

More Than Just Data Processing

Along with the data processing framework, Doug Cutting also included a fault tolerant, replicated, distributed file system with Hadoop just because fuck you. Called HDFS, it is inspired by Google's GFS, again something that only existed to the public in the form of an academic paper. HDFS is designed to integrate perfectly with the MapReduce framework, but you can also use it on its own. If you need to replicate large files across many machines, then HDFS has got what you need.

Why don't you see distributed filesystems coming out of places like Twitter? Because that shit is hard. It could also be that companies like Twitter are blissfully unaware of any data storage medium other than MySQL. In either case, data processing gets you more pedantic nerd-cred than data storage. Processing large amounts of data lets a nerd excuse himself for being a shut-in, because shit, he's doing important work in there. He's got to command an army of machines, all working in unison. That will show the football team who the real man is. Sure beats talking to girls.

Projects like Hadoop are few and far between. Everybody wants to work on an interesting problem like distributed data processing, but few people actually can. It's the disparity between “want” and “can” that brings us failures like Twitter.

Sometimes, your best just isn't good enough. ®

Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.

High performance access to file storage

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
USA opposes 'Schengen cloud' Eurocentric routing plan
All routes should transit America, apparently
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.