Feeds

Yahoo! breeds Pig that talks elephant

Swine talk of a different kind

Providing a secure and efficient Helpdesk

If there's one place on earth where swine talk is still met with open arms, it's Yahoo!.

Yahoo! is gradually moving its data-heavy web services onto Hadoop - that Google-inspired open-source platform for crunching epic amounts of information across a sea of distributed machines. And to grease the move, the company has developed its own Hadoop programming language. In typical Hadoop fashion, it's known as Pig.

Hadoop mimics Google's MapReduce framework, which maps data-crunching tasks across distributed machines, splitting them into tiny sub-tasks, before reducing the results into one master calculation. You can write straight to the framework in Java, but Pig aims to put MapReduce coding at a higher level.

"There was a lot of hype around [Hadoop] MapReduce and it gained a lot of traction, probably because it's a very simple low-level model," says Chris Olston, part of the Yahoo! research team that originated Pig. "But at the same time, people were writing higher-level functions over and over again."

Hadoop MapReduce, for instance, has no "join" operation - a staple of data programming - and Pig makes amends.

Hadoop founder Dave Cutting describes Pig as "SQL for MapReduce." But that description might be better applied to Hive, a high-level open-source MapReduce language first developed at Facebook. Pig sits somewhere between Hive and the low-level code of MapReduce.

"Hive is closer to SQL syntax. Pig aims for something that's more of an explicit data flow syntax" Olston tells The Reg. "We wanted to get to something where the common operations like 'join' are built-in - so you just have to write a one-line command to do a 'join' - but at the same time, it retains the explicit data flow aspect of MapReduce. It's in the sweet spot between the two."

In the end, this still puts Hadoop coding in the hands of those who may not be hardcore developers. "You have to be able to write scripts," says Olston. "But you don't have to be a full-fledged programmer."

Pig began life as an Apache Incubator project in the fall of 2007, and in October of 2008 it was accepted as an official Hadoop sub-project. About 30 per cent of all Yahoo! Hadoop jobs are now Pig jobs too, and according to Ajay Anand, director of product management for grid computing at Yahoo!, when new developers join Yahoo!'s Hadoop migration they typically choose to ride the Pig. "It's much easier to get going," he says.

According to Olga Natkovich (PowerPoint), who manages the Pig development team, the typical Pig program is about 1/20th as long as an equivalent MapReduce creation - and requires about 1/16th of the development time.

Doug Cutting - the man behind the Lucene search library and the Nutch web crawler - first developed Hadoop after Google kindly published a pair of research papers on MapReduce and its proprietary Google File System (GFS). He envisioned the project as underpinning for his open-source Nutch webcrawler, but Yahoo! soon took an interest and he's now on the company payroll.

Most notably, Hadoop runs Yahoo! Search Webmap, which provides the world’s second most popular search engine with a database of all known web pages – complete with all the metadata needed to understand them. According to Yahoo! grid guru Eric Baldeschwieler, the app draws its web map 33 per cent faster than the company's previous system.

But Hadoop also underpins various Yahoo! content and advertising services. On the content side, for instance, it now powers the real-time automated algorithms that select news stories for the Yahoo! home page.

Cutting named Hadoop after his son's yellow stuffed elephant, and animal references tend to pop in the names of sub-projects. Thus the Pig. Version 0.2.0 was released last month, and you can download it here. Need a Hadoop installation first? Go here. ®

Internet Security Threat Report 2014

More from The Register

next story
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.