Feeds

Yahoo! breeds Pig that talks elephant

Swine talk of a different kind

Securing Web Applications Made Simple and Scalable

If there's one place on earth where swine talk is still met with open arms, it's Yahoo!.

Yahoo! is gradually moving its data-heavy web services onto Hadoop - that Google-inspired open-source platform for crunching epic amounts of information across a sea of distributed machines. And to grease the move, the company has developed its own Hadoop programming language. In typical Hadoop fashion, it's known as Pig.

Hadoop mimics Google's MapReduce framework, which maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. You can write straight to the framework in Java, but Pig aims to put MapReduce coding at a higher level.

"There was a lot of hype around [Hadoop] MapReduce and it gained a lot of traction, probably because it's a very simple low-level model," says Chris Olston, part of the Yahoo! research team that originated Pig. "But at the same time, people were writing higher-level functions over and over again."

Hadoop MapReduce, for instance, has no "join" operation - a staple of data programming - and Pig makes amends.

Hadoop founder Dave Cutting describes Pig as "SQL for MapReduce." But that description might be better applied to Hive, a high-level open-source MapReduce language first developed at Facebook. Pig sits somewhere between Hive and the low-level code of MapReduce.

"Hive is closer to SQL syntax. Pig aims for something that's more of an explicit data flow syntax" Olston tells The Reg. "We wanted to get to something where the common operations like 'join' are built-in - so you just have to write a one-line command to do a 'join' - but at the same time, it retains the explicit data flow aspect of MapReduce. It's in the sweet spot between the two."

In the end, this still puts Hadoop coding in the hands of those who may not be hardcore developers. "You have to be able to write scripts," says Olston. "But you don't have to be a full-fledged programmer."

Pig began life as an Apache Incubator project in the fall of 2007, and in October of 2008 it was accepted as an official Hadoop sub-project. About 30 per cent of all Yahoo! Hadoop jobs are now Pig jobs too, and according to Ajay Anand, director of product management for grid computing at Yahoo!, when new developers join Yahoo!'s Hadoop migration they typically choose to ride the Pig. "It's much easier to get going," he says.

According to Olga Natkovich (PowerPoint), who manages the Pig development team, the typical Pig program is about 1/20th as long as an equivalent MapReduce creation - and requires about 1/16th of the development time.

Doug Cutting - the man behind the Lucene search library and the Nutch web crawler - first developed Hadoop after Google kindly published a pair of research papers on MapReduce and its proprietary Google File System (GFS). He envisioned the project as underpinning for his open-source Nutch webcrawler, but Yahoo! soon took an interest and he's now on the company payroll.

Most notably, Hadoop runs Yahoo! Search Webmap, which provides the world’s second most popular search engine with a database of all known web pages – complete with all the metadata needed to understand them. According to Yahoo! grid guru Eric Baldeschwieler, the app draws its web map 33 per cent faster than the company's previous system.

But Hadoop also underpins various Yahoo! content and advertising services. On the content side, for instance, it now powers the real-time automated algorithms that select news stories for the Yahoo! home page.

Cutting named Hadoop after his son's yellow stuffed elephant, and animal references tend to pop in the names of sub-projects. Thus the Pig. Version 0.2.0 was released last month, and you can download it here. Need a Hadoop installation first? Go here. ®

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
KDE releases ice-cream coloured Plasma 5 just in time for summer
Melty but refreshing - popular rival to Mint's Cinnamon's still a work in progress
NO MORE ALL CAPS and other pleasures of Visual Studio 14
Unpicking a packed preview that breaks down ASP.NET
Secure microkernel that uses maths to be 'bug free' goes open source
Hacker-repelling, drone-protecting code will soon be yours to tweak as you see fit
Cheer up, Nokia fans. It can start making mobes again in 18 months
The real winner of the Nokia sale is *drumroll* ... Nokia
Put down that Oracle database patch: It could cost $23,000 per CPU
On-by-default INMEMORY tech a boon for developers ... as long as they can afford it
Another day, another Firefox: Version 31 is upon us ALREADY
Web devs, Mozilla really wants you to like this one
Google shows off new Chrome OS look
Athena springs full-grown from Chromium project's head
prev story

Whitepapers

Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Seven Steps to Software Security
Seven practical steps you can begin to take today to secure your applications and prevent the damages a successful cyber-attack can cause.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.