Feeds

Hadoop takes Big Data beyond Java

Stuffed elephant mates with Python

Internet Security Threat Report 2014

From Nutch to Hadoop

He mimicked GFS and MapReduce to break up large chunks of data into small pieces and search them quickly across thousands of servers, building an implementation using open source. Again, it worked - to a point. "We could do demos on 20 machines and actually get some work done, but it wasn't ready to scale to thousands of machine and it wasn't horribly reliable," Cutting said. "This reliability thing was really hard work."

It was then Yahoo! that stepped in, offering the engineers and servers needed to iron out the problems. But Yahoo! had found another use for Hadoop: to quickly analyze huge piles of data distributed in silos of servers and web properties. With Yahoo!'s vice president of Hadoop software development Eric Baldeschwieler, Cutting split out the distributed computing part of Nutch and put it into Hadoop.

Cutting said researchers in Yahoo! wanted to get access to lots of data sets for things like ads served and web server loads. "If you were a researcher in Yahoo! asking how to make ads more relevant, you didn't have all the data in one place," he said. "They started pulling data together in one place to get some early users - and they loved it."

Suddenly, Yahoo! was quickly analyzing ever-changing data on its pages to making updates in hours that had previously taken weeks, and it was shuffling ads around to follow the latest click traffic.

"What it's all about is getting people a handle on running computation on terabytes of data and getting an answer back in a small amount of time reliably," Cutting said.

With Yahoo! focused on solving cluster security, Cutting is still pushing Hadoop forward and trying to crack the problem of breaking changes. Also he wants to make take Hadoop a step further attracting non-Java developers. He's tackling both through the Avro project.

Beyond Java

Avro is a format for data interchange intended to let applications call and process data after the application has been updated or changed. Also, the goal is for applications to be written for Hadoop in languages other than Java and to let Hadoop support native MapReduce and HDFS clients in languages like Python, C, and C++.

Meanwhile, Cutting has followed other open sourcers by joining a company that's trying to sell support and services to customers using his pet technology. He joined Cloudera in August 2009. Despite Hadoop's use at some of the largest sites online, Cutting believes Hadoop is good if you're running just 20 node clusters and that it's easier than running a database server to crunch huge piles of data. Cloudera customers include NetFlix and Samsung.

And if you don't want to run Hadoop yourself, you can deploy on cloud providers like Amazon and Rackspace that are running Hadoop. "It's a little harder than spread-sheet programming but there are tools that are making it simpler," Cutting re-assured us. "The whole goal is to make it fairly simple from the outside and keep the complexity inside."

Cutting may never have planned for where Hadoop is today, but he's not letting delays to version 1.0 obstruct its future either.®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Sway: Microsoft's new Office app doesn't have an Undo function
Content aggregation, meet the workplace ... oh
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
NetWare sales revive in China thanks to that man Snowden
If it ain't Microsoft, it's in fashion behind the Great Firewall
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.