Feeds

Hadoop: When grownups do open source

On the emasculation of Twitter and Dirty Harry

Beginner's guide to SSL certificates

Fail and You Hadoop is a library for writing distributed data processing programs using the MapReduce framework. It's got all the makings of a blogosphere hit: cluster computing, large datasets, parallelism, algorithms published by Google, and open source. Every four days or so, a nerd will discover Hadoop, write a “Basic MapReduce Tutorial with Hadoop” tutorial on his blog with some trivial examples, and feel satisfied with himself for educating the world about a yet-undiscovered gem. Comparatively, very few people actually use Hadoop in practice, and those who do don't write about it. Why? Because they're adults who don't care about getting on the front page of Digg.

Hadoop was birthed to the world by Doug Cutting, the man who gave us the Lucene search library and the Nutch web crawler. It's a little known fact that Cutting only sleeps three hours per night, and not because he is tired, but rather as an act of mercy for his keyboard. Furthermore, Hadoop isn't open source because Doug wanted to help other developers solve distributed computing problems. It's open source because he thought you could learn a thing or two from his code.

Despite being a canon of Java engineering, Hadoop is actually pretty useful, if you've got a problem it can solve.

Now, that's a tricky caveat that isn't well understood by many contemporary programmers. A lot of developers (and this is most prevalent in the Web 2.0 world), walk balls-first into any given task, assuming that scalability is going to be issue number one. If you've got a big data processing problem that needs to scale, then Hadoop will probably help you out. In that case, you can read any one of the over nine thousand regurgitations of the same word count tutorial on some nerd's blog. Otherwise, you can do like the web scalability Bible beaters and completely miss the point of Hadoop, contort your code into something that resembles MapReduce, and let those throbbing nuts of yours hang out in the breeze, because shit, you just built one scalable-ass system. Sure, it runs several times slower than your old code used to, but it's parallelized, and that's really all that matters.

Put Me In, Coach

When it comes to open source, Hadoop is a good example of what separates the men from the boys. Being a top-level Apache project, there's significant backing behind it. With corporate sponsorship, a software project can go far. When I say “corporate sponsorship”, I'm referring to companies that actually make money, which brings us to the comedic open source attempts of Web 2.0.

Twitter, which is widely accepted as the drum major of the Web 2.0 failure parade, released an open source project called Starling in January of this year. Starling is the Ruby-based messaging system that runs Twitter's backend. Yes, Twitter, the nonprofit web service known widely for its downtime, dropped its disaster-producing shitpile on the world. Why? Maybe they thought more competent developers would fix their problems. The more likely scenario is that they wanted to get a quickie beatoff from the fake tech media to make themselves look more important. I am guessing this is why no code has been released for Starling since it was open sourced. Oops.

Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get shit done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...what the fuck is that?). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working.

Perhaps this is a feast of apples and oranges. Starling is a messaging system, Hadoop is a distributed data processing system. Don't worry. There's plenty of failure to be had elsewhere in the open source world. Let's take, for example, a project called Starfish), which is a pure-Ruby implementation of MapReduce. Eh, well that's not entirely accurate. Starfish is a MapReduce-inspired framework that's simple enough for even Ruby developers to understand. That means there's no actual “reduce” phase in the MapReduce, and it works on MySQL database records. In other words, this project is virtually useless in every way, aside from getting the author a quick beatoff from the blogosphere. It's a half-baked implementation of an algorithm from Google, it's written in Ruby and it integrates with Rails. That's so warm and fuzzy it could turn Clint Eastwood gay.

But hey, taking the time to actually understand something is way harder that writing an open source Ruby implementation of it.

More Than Just Data Processing

Along with the data processing framework, Doug Cutting also included a fault tolerant, replicated, distributed file system with Hadoop just because fuck you. Called HDFS, it is inspired by Google's GFS, again something that only existed to the public in the form of an academic paper. HDFS is designed to integrate perfectly with the MapReduce framework, but you can also use it on its own. If you need to replicate large files across many machines, then HDFS has got what you need.

Why don't you see distributed filesystems coming out of places like Twitter? Because that shit is hard. It could also be that companies like Twitter are blissfully unaware of any data storage medium other than MySQL. In either case, data processing gets you more pedantic nerd-cred than data storage. Processing large amounts of data lets a nerd excuse himself for being a shut-in, because shit, he's doing important work in there. He's got to command an army of machines, all working in unison. That will show the football team who the real man is. Sure beats talking to girls.

Projects like Hadoop are few and far between. Everybody wants to work on an interesting problem like distributed data processing, but few people actually can. It's the disparity between “want” and “can” that brings us failures like Twitter.

Sometimes, your best just isn't good enough. ®

Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
Cray-cray Met Office spaffs £97m on VERY AVERAGE HPC box
Only 250th most powerful in the world? Bring back Michael Fish
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.