Feeds

Yahoo! breeds Pig that talks elephant

Swine talk of a different kind

Secure remote control for conventional and virtual desktops

If there's one place on earth where swine talk is still met with open arms, it's Yahoo!.

Yahoo! is gradually moving its data-heavy web services onto Hadoop - that Google-inspired open-source platform for crunching epic amounts of information across a sea of distributed machines. And to grease the move, the company has developed its own Hadoop programming language. In typical Hadoop fashion, it's known as Pig.

Hadoop mimics Google's MapReduce framework, which maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. You can write straight to the framework in Java, but Pig aims to put MapReduce coding at a higher level.

"There was a lot of hype around [Hadoop] MapReduce and it gained a lot of traction, probably because it's a very simple low-level model," says Chris Olston, part of the Yahoo! research team that originated Pig. "But at the same time, people were writing higher-level functions over and over again."

Hadoop MapReduce, for instance, has no "join" operation - a staple of data programming - and Pig makes amends.

Hadoop founder Dave Cutting describes Pig as "SQL for MapReduce." But that description might be better applied to Hive, a high-level open-source MapReduce language first developed at Facebook. Pig sits somewhere between Hive and the low-level code of MapReduce.

"Hive is closer to SQL syntax. Pig aims for something that's more of an explicit data flow syntax" Olston tells The Reg. "We wanted to get to something where the common operations like 'join' are built-in - so you just have to write a one-line command to do a 'join' - but at the same time, it retains the explicit data flow aspect of MapReduce. It's in the sweet spot between the two."

In the end, this still puts Hadoop coding in the hands of those who may not be hardcore developers. "You have to be able to write scripts," says Olston. "But you don't have to be a full-fledged programmer."

Pig began life as an Apache Incubator project in the fall of 2007, and in October of 2008 it was accepted as an official Hadoop sub-project. About 30 per cent of all Yahoo! Hadoop jobs are now Pig jobs too, and according to Ajay Anand, director of product management for grid computing at Yahoo!, when new developers join Yahoo!'s Hadoop migration they typically choose to ride the Pig. "It's much easier to get going," he says.

According to Olga Natkovich (PowerPoint), who manages the Pig development team, the typical Pig program is about 1/20th as long as an equivalent MapReduce creation - and requires about 1/16th of the development time.

Doug Cutting - the man behind the Lucene search library and the Nutch web crawler - first developed Hadoop after Google kindly published a pair of research papers on MapReduce and its proprietary Google File System (GFS). He envisioned the project as underpinning for his open-source Nutch webcrawler, but Yahoo! soon took an interest and he's now on the company payroll.

Most notably, Hadoop runs Yahoo! Search Webmap, which provides the world’s second most popular search engine with a database of all known web pages – complete with all the metadata needed to understand them. According to Yahoo! grid guru Eric Baldeschwieler, the app draws its web map 33 per cent faster than the company's previous system.

But Hadoop also underpins various Yahoo! content and advertising services. On the content side, for instance, it now powers the real-time automated algorithms that select news stories for the Yahoo! home page.

Cutting named Hadoop after his son's yellow stuffed elephant, and animal references tend to pop in the names of sub-projects. Thus the Pig. Version 0.2.0 was released last month, and you can download it here. Need a Hadoop installation first? Go here. ®

The essential guide to IT transformation

More from The Register

next story
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
'Stop dissing Google or quit': OK, I quit, says Code Club co-founder
And now a message from our sponsors: 'STFU or else'
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
Uber, Lyft and cutting corners: The true face of the Sharing Economy
Casual labour and tired ideas = not really web-tastic
Mozilla's 'Tiles' ads debut in new Firefox nightlies
You can try turning them off and on again
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up Big Data
Solving backup challenges and “protect everything from everywhere,” as we move into the era of big data management and the adoption of BYOD.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?