Feeds

Hadoop takes Big Data beyond Java

Stuffed elephant mates with Python

Secure remote control for conventional and virtual desktops

From Nutch to Hadoop

He mimicked GFS and MapReduce to break up large chunks of data into small pieces and search them quickly across thousands of servers, building an implementation using open source. Again, it worked - to a point. "We could do demos on 20 machines and actually get some work done, but it wasn't ready to scale to thousands of machine and it wasn't horribly reliable," Cutting said. "This reliability thing was really hard work."

It was then Yahoo! that stepped in, offering the engineers and servers needed to iron out the problems. But Yahoo! had found another use for Hadoop: to quickly analyze huge piles of data distributed in silos of servers and web properties. With Yahoo!'s vice president of Hadoop software development Eric Baldeschwieler, Cutting split out the distributed computing part of Nutch and put it into Hadoop.

Cutting said researchers in Yahoo! wanted to get access to lots of data sets for things like ads served and web server loads. "If you were a researcher in Yahoo! asking how to make ads more relevant, you didn't have all the data in one place," he said. "They started pulling data together in one place to get some early users - and they loved it."

Suddenly, Yahoo! was quickly analyzing ever-changing data on its pages to making updates in hours that had previously taken weeks, and it was shuffling ads around to follow the latest click traffic.

"What it's all about is getting people a handle on running computation on terabytes of data and getting an answer back in a small amount of time reliably," Cutting said.

With Yahoo! focused on solving cluster security, Cutting is still pushing Hadoop forward and trying to crack the problem of breaking changes. Also he wants to make take Hadoop a step further attracting non-Java developers. He's tackling both through the Avro project.

Beyond Java

Avro is a format for data interchange intended to let applications call and process data after the application has been updated or changed. Also, the goal is for applications to be written for Hadoop in languages other than Java and to let Hadoop support native MapReduce and HDFS clients in languages like Python, C, and C++.

Meanwhile, Cutting has followed other open sourcers by joining a company that's trying to sell support and services to customers using his pet technology. He joined Cloudera in August 2009. Despite Hadoop's use at some of the largest sites online, Cutting believes Hadoop is good if you're running just 20 node clusters and that it's easier than running a database server to crunch huge piles of data. Cloudera customers include NetFlix and Samsung.

And if you don't want to run Hadoop yourself, you can deploy on cloud providers like Amazon and Rackspace that are running Hadoop. "It's a little harder than spread-sheet programming but there are tools that are making it simpler," Cutting re-assured us. "The whole goal is to make it fairly simple from the outside and keep the complexity inside."

Cutting may never have planned for where Hadoop is today, but he's not letting delays to version 1.0 obstruct its future either.®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Microsoft on the Threshold of a new name for Windows next week
Rebranded OS reportedly set to be flung open by Redmond
'In... 15 feet... you will be HIT BY A TRAIN' Google patents the SPLAT-NAV
Alert system tips oblivious phone junkies to oncoming traffic
Apple: SO sorry for the iOS 8.0.1 UPDATE BUNGLE HORROR
Apple kills 'upgrade'. Hey, Microsoft. You sure you want to be like these guys?
SMASH the Bash bug! Apple and Red Hat scramble for patch batches
'Applying multiple security updates is extremely difficult'
ARM gives Internet of Things a piece of its mind – the Cortex-M7
32-bit core packs some DSP for VIP IoT CPU LOL
Lotus Notes inventor Ozzie invents app to talk to people on your phone
Imagine that. Startup floats with voice collab app for Win iPhone
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.