Feeds

Hadoop: When grownups do open source

On the emasculation of Twitter and Dirty Harry

Application security programs and practises

Fail and You Hadoop is a library for writing distributed data processing programs using the MapReduce framework. It's got all the makings of a blogosphere hit: cluster computing, large datasets, parallelism, algorithms published by Google, and open source. Every four days or so, a nerd will discover Hadoop, write a “Basic MapReduce Tutorial with Hadoop” tutorial on his blog with some trivial examples, and feel satisfied with himself for educating the world about a yet-undiscovered gem. Comparatively, very few people actually use Hadoop in practice, and those who do don't write about it. Why? Because they're adults who don't care about getting on the front page of Digg.

Hadoop was birthed to the world by Doug Cutting, the man who gave us the Lucene search library and the Nutch web crawler. It's a little known fact that Cutting only sleeps three hours per night, and not because he is tired, but rather as an act of mercy for his keyboard. Furthermore, Hadoop isn't open source because Doug wanted to help other developers solve distributed computing problems. It's open source because he thought you could learn a thing or two from his code.

Despite being a canon of Java engineering, Hadoop is actually pretty useful, if you've got a problem it can solve.

Now, that's a tricky caveat that isn't well understood by many contemporary programmers. A lot of developers (and this is most prevalent in the Web 2.0 world), walk balls-first into any given task, assuming that scalability is going to be issue number one. If you've got a big data processing problem that needs to scale, then Hadoop will probably help you out. In that case, you can read any one of the over nine thousand regurgitations of the same word count tutorial on some nerd's blog. Otherwise, you can do like the web scalability Bible beaters and completely miss the point of Hadoop, contort your code into something that resembles MapReduce, and let those throbbing nuts of yours hang out in the breeze, because shit, you just built one scalable-ass system. Sure, it runs several times slower than your old code used to, but it's parallelized, and that's really all that matters.

Put Me In, Coach

When it comes to open source, Hadoop is a good example of what separates the men from the boys. Being a top-level Apache project, there's significant backing behind it. With corporate sponsorship, a software project can go far. When I say “corporate sponsorship”, I'm referring to companies that actually make money, which brings us to the comedic open source attempts of Web 2.0.

Twitter, which is widely accepted as the drum major of the Web 2.0 failure parade, released an open source project called Starling in January of this year. Starling is the Ruby-based messaging system that runs Twitter's backend. Yes, Twitter, the nonprofit web service known widely for its downtime, dropped its disaster-producing shitpile on the world. Why? Maybe they thought more competent developers would fix their problems. The more likely scenario is that they wanted to get a quickie beatoff from the fake tech media to make themselves look more important. I am guessing this is why no code has been released for Starling since it was open sourced. Oops.

Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get shit done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...what the fuck is that?). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working.

Perhaps this is a feast of apples and oranges. Starling is a messaging system, Hadoop is a distributed data processing system. Don't worry. There's plenty of failure to be had elsewhere in the open source world. Let's take, for example, a project called Starfish), which is a pure-Ruby implementation of MapReduce. Eh, well that's not entirely accurate. Starfish is a MapReduce-inspired framework that's simple enough for even Ruby developers to understand. That means there's no actual “reduce” phase in the MapReduce, and it works on MySQL database records. In other words, this project is virtually useless in every way, aside from getting the author a quick beatoff from the blogosphere. It's a half-baked implementation of an algorithm from Google, it's written in Ruby and it integrates with Rails. That's so warm and fuzzy it could turn Clint Eastwood gay.

But hey, taking the time to actually understand something is way harder that writing an open source Ruby implementation of it.

More Than Just Data Processing

Along with the data processing framework, Doug Cutting also included a fault tolerant, replicated, distributed file system with Hadoop just because fuck you. Called HDFS, it is inspired by Google's GFS, again something that only existed to the public in the form of an academic paper. HDFS is designed to integrate perfectly with the MapReduce framework, but you can also use it on its own. If you need to replicate large files across many machines, then HDFS has got what you need.

Why don't you see distributed filesystems coming out of places like Twitter? Because that shit is hard. It could also be that companies like Twitter are blissfully unaware of any data storage medium other than MySQL. In either case, data processing gets you more pedantic nerd-cred than data storage. Processing large amounts of data lets a nerd excuse himself for being a shut-in, because shit, he's doing important work in there. He's got to command an army of machines, all working in unison. That will show the football team who the real man is. Sure beats talking to girls.

Projects like Hadoop are few and far between. Everybody wants to work on an interesting problem like distributed data processing, but few people actually can. It's the disparity between “want” and “can” that brings us failures like Twitter.

Sometimes, your best just isn't good enough. ®

Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story

Whitepapers

Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.