Feeds

Hadoop: When grownups do open source

On the emasculation of Twitter and Dirty Harry

Top 5 reasons to deploy VMware with Tegile

Fail and You Hadoop is a library for writing distributed data processing programs using the MapReduce framework. It's got all the makings of a blogosphere hit: cluster computing, large datasets, parallelism, algorithms published by Google, and open source. Every four days or so, a nerd will discover Hadoop, write a “Basic MapReduce Tutorial with Hadoop” tutorial on his blog with some trivial examples, and feel satisfied with himself for educating the world about a yet-undiscovered gem. Comparatively, very few people actually use Hadoop in practice, and those who do don't write about it. Why? Because they're adults who don't care about getting on the front page of Digg.

Hadoop was birthed to the world by Doug Cutting, the man who gave us the Lucene search library and the Nutch web crawler. It's a little known fact that Cutting only sleeps three hours per night, and not because he is tired, but rather as an act of mercy for his keyboard. Furthermore, Hadoop isn't open source because Doug wanted to help other developers solve distributed computing problems. It's open source because he thought you could learn a thing or two from his code.

Despite being a canon of Java engineering, Hadoop is actually pretty useful, if you've got a problem it can solve.

Now, that's a tricky caveat that isn't well understood by many contemporary programmers. A lot of developers (and this is most prevalent in the Web 2.0 world), walk balls-first into any given task, assuming that scalability is going to be issue number one. If you've got a big data processing problem that needs to scale, then Hadoop will probably help you out. In that case, you can read any one of the over nine thousand regurgitations of the same word count tutorial on some nerd's blog. Otherwise, you can do like the web scalability Bible beaters and completely miss the point of Hadoop, contort your code into something that resembles MapReduce, and let those throbbing nuts of yours hang out in the breeze, because shit, you just built one scalable-ass system. Sure, it runs several times slower than your old code used to, but it's parallelized, and that's really all that matters.

Put Me In, Coach

When it comes to open source, Hadoop is a good example of what separates the men from the boys. Being a top-level Apache project, there's significant backing behind it. With corporate sponsorship, a software project can go far. When I say “corporate sponsorship”, I'm referring to companies that actually make money, which brings us to the comedic open source attempts of Web 2.0.

Twitter, which is widely accepted as the drum major of the Web 2.0 failure parade, released an open source project called Starling in January of this year. Starling is the Ruby-based messaging system that runs Twitter's backend. Yes, Twitter, the nonprofit web service known widely for its downtime, dropped its disaster-producing shitpile on the world. Why? Maybe they thought more competent developers would fix their problems. The more likely scenario is that they wanted to get a quickie beatoff from the fake tech media to make themselves look more important. I am guessing this is why no code has been released for Starling since it was open sourced. Oops.

Twitter decided they would be cute and trendy. They wrote their code in Ruby: the official state language of the hipster-developer nation. Doug Cutting, on the other hand, decided he would get shit done, and wrote Hadoop in Java. Starling was hidden away in some corner and forgotten (it's hosted at RubyForge...what the fuck is that?). Hadoop lives prominently at the Apache Software Foundation. Starling is a re-hash of an existing Java Enterprise API called JMS that has several open source implementations. Hadoop is an implementation of Google's MapReduce, a system that publicly only existed on paper. Hadoop has the added benefit of actually working.

Perhaps this is a feast of apples and oranges. Starling is a messaging system, Hadoop is a distributed data processing system. Don't worry. There's plenty of failure to be had elsewhere in the open source world. Let's take, for example, a project called Starfish), which is a pure-Ruby implementation of MapReduce. Eh, well that's not entirely accurate. Starfish is a MapReduce-inspired framework that's simple enough for even Ruby developers to understand. That means there's no actual “reduce” phase in the MapReduce, and it works on MySQL database records. In other words, this project is virtually useless in every way, aside from getting the author a quick beatoff from the blogosphere. It's a half-baked implementation of an algorithm from Google, it's written in Ruby and it integrates with Rails. That's so warm and fuzzy it could turn Clint Eastwood gay.

But hey, taking the time to actually understand something is way harder that writing an open source Ruby implementation of it.

More Than Just Data Processing

Along with the data processing framework, Doug Cutting also included a fault tolerant, replicated, distributed file system with Hadoop just because fuck you. Called HDFS, it is inspired by Google's GFS, again something that only existed to the public in the form of an academic paper. HDFS is designed to integrate perfectly with the MapReduce framework, but you can also use it on its own. If you need to replicate large files across many machines, then HDFS has got what you need.

Why don't you see distributed filesystems coming out of places like Twitter? Because that shit is hard. It could also be that companies like Twitter are blissfully unaware of any data storage medium other than MySQL. In either case, data processing gets you more pedantic nerd-cred than data storage. Processing large amounts of data lets a nerd excuse himself for being a shut-in, because shit, he's doing important work in there. He's got to command an army of machines, all working in unison. That will show the football team who the real man is. Sure beats talking to girls.

Projects like Hadoop are few and far between. Everybody wants to work on an interesting problem like distributed data processing, but few people actually can. It's the disparity between “want” and “can” that brings us failures like Twitter.

Sometimes, your best just isn't good enough. ®

Ted Dziuba is a co-founder at Milo.com You can read his regular Reg column, Fail and You, every other Monday.

Intelligent flash storage arrays

More from The Register

next story
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Want to STUFF Facebook with blatant ADVERTISING? Fine! But you must PAY
Pony up or push off, Zuck tells social marketeers
Oi, Europe! Tell US feds to GTFO of our servers, say Microsoft and pals
By writing a really angry letter about how it's harming our cloud business, ta
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.