Feeds

Yahoo! breeds Pig that talks elephant

Swine talk of a different kind

High performance access to file storage

If there's one place on earth where swine talk is still met with open arms, it's Yahoo!.

Yahoo! is gradually moving its data-heavy web services onto Hadoop - that Google-inspired open-source platform for crunching epic amounts of information across a sea of distributed machines. And to grease the move, the company has developed its own Hadoop programming language. In typical Hadoop fashion, it's known as Pig.

Hadoop mimics Google's MapReduce framework, which maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. You can write straight to the framework in Java, but Pig aims to put MapReduce coding at a higher level.

"There was a lot of hype around [Hadoop] MapReduce and it gained a lot of traction, probably because it's a very simple low-level model," says Chris Olston, part of the Yahoo! research team that originated Pig. "But at the same time, people were writing higher-level functions over and over again."

Hadoop MapReduce, for instance, has no "join" operation - a staple of data programming - and Pig makes amends.

Hadoop founder Dave Cutting describes Pig as "SQL for MapReduce." But that description might be better applied to Hive, a high-level open-source MapReduce language first developed at Facebook. Pig sits somewhere between Hive and the low-level code of MapReduce.

"Hive is closer to SQL syntax. Pig aims for something that's more of an explicit data flow syntax" Olston tells The Reg. "We wanted to get to something where the common operations like 'join' are built-in - so you just have to write a one-line command to do a 'join' - but at the same time, it retains the explicit data flow aspect of MapReduce. It's in the sweet spot between the two."

In the end, this still puts Hadoop coding in the hands of those who may not be hardcore developers. "You have to be able to write scripts," says Olston. "But you don't have to be a full-fledged programmer."

Pig began life as an Apache Incubator project in the fall of 2007, and in October of 2008 it was accepted as an official Hadoop sub-project. About 30 per cent of all Yahoo! Hadoop jobs are now Pig jobs too, and according to Ajay Anand, director of product management for grid computing at Yahoo!, when new developers join Yahoo!'s Hadoop migration they typically choose to ride the Pig. "It's much easier to get going," he says.

According to Olga Natkovich (PowerPoint), who manages the Pig development team, the typical Pig program is about 1/20th as long as an equivalent MapReduce creation - and requires about 1/16th of the development time.

Doug Cutting - the man behind the Lucene search library and the Nutch web crawler - first developed Hadoop after Google kindly published a pair of research papers on MapReduce and its proprietary Google File System (GFS). He envisioned the project as underpinning for his open-source Nutch webcrawler, but Yahoo! soon took an interest and he's now on the company payroll.

Most notably, Hadoop runs Yahoo! Search Webmap, which provides the world’s second most popular search engine with a database of all known web pages – complete with all the metadata needed to understand them. According to Yahoo! grid guru Eric Baldeschwieler, the app draws its web map 33 per cent faster than the company's previous system.

But Hadoop also underpins various Yahoo! content and advertising services. On the content side, for instance, it now powers the real-time automated algorithms that select news stories for the Yahoo! home page.

Cutting named Hadoop after his son's yellow stuffed elephant, and animal references tend to pop in the names of sub-projects. Thus the Pig. Version 0.2.0 was released last month, and you can download it here. Need a Hadoop installation first? Go here. ®

High performance access to file storage

More from The Register

next story
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
Microsoft: Windows version you probably haven't upgraded to yet is ALREADY OBSOLETE
Pre-Update versions of Windows 8.1 will no longer support patches
OpenSSL Heartbleed: Bloody nose for open-source bleeding hearts
Bloke behind the cockup says not enough people are helping crucial crypto project
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
US taxman blows Win XP deadline, must now spend millions on custom support
Gov't IT likened to 'a Model T with a lot of things on top of it'
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.