Feeds

Facebook warehousing 180 PETABYTES of data a year

The Social Network open-sources ‘Corona’ tool used to manage the deluge

Boost IT visibility and business value

Facebook’s data warehouses grow by “Over half a petabyte … every 24 hours”, according to an explanatory note The Social Network’s Engineering team has issued to explain a new release of open source code.

The note says the warehouse performs "ad-hoc queries, data pipelines, and custom MapReduce jobs process this raw data around the clock to generate more meaningful features and aggregations."

But vanilla-flavoured Apache Hadoop can't do that job, so Facebook has created the code in question, dubbed Corona, to extend the big data darling's capabilities so it can manage the deluge of data it collects each day.

The note explains “We initially employed the MapReduce implementation from Apache Hadoop as the foundation of this infrastructure, and that served us well for several years. But by early 2011, we started reaching the limits of that system.”

Those limits saw compute clusters clogged, due to scheduling issues with MapReduce, while resource management struggled to meet Facebook’s enormous demands.

Facebook characterises MapReduce, Hadoop-style, with the following illustration.

Facebook's depiction of Hadoop at work

Corona, by contrast, offers the configuration depicted below.

Facebook's Corona tool

Facebook says Corona rocks for the following reasons:

“Corona introduces a cluster manager whose only purpose is to track the nodes in the cluster and the amount of free resources. A dedicated job tracker is created for each job, and can run either in the same process as the client (for small jobs) or as a separate process in the cluster (for large jobs). One major difference from our previous Hadoop MapReduce implementation is that Corona uses push-based, rather than pull-based, scheduling. After the cluster manager receives resource requests from the job tracker, it pushes the resource grants back to the job tracker. Also, once the job tracker gets resource grants, it creates tasks and then pushes these tasks to the task trackers for running. There is no periodic heartbeat involved in this scheduling, so the scheduling latency is minimized.”

The post also details how Facebook introduced the new tool and, along the way, gives some insights into the scale of the company’s infrastructure with the revelation rollout started with a modestly-sized cluster of 500 nodes, to “get feedback from early adopters.”

A 1000-node trial yielded the first scaling problem, before the tool was introduced to all of the company’s servers.

The company has now made Corona available, on github. By doing so it has played by the right open source rules, given that the Engineering note suggests the company believes Corona will be a crucial tool for “for years to come”.

Given the note says Facebook’s data warehouse “has grown by 2500x in the past four years” Corona looks to have serious data-handling grunt. And that’s just the warehouse: how much other data Facebook holds is not disclosed. Nor is just what Corona will deliver, in terms of products or data analysis.

It may therefore be sensible, if one were to relax and partake of Corona’s namesake beverage, to admire the technical achievements described here, but to reserve judgement on what they may enable. ®

Boost IT visibility and business value

More from The Register

next story
NO MORE ALL CAPS and other pleasures of Visual Studio 14
Unpicking a packed preview that breaks down ASP.NET
KDE releases ice-cream coloured Plasma 5 just in time for summer
Melty but refreshing - popular rival to Mint's Cinnamon's still a work in progress
Leaked Windows Phone 8.1 Update specs tease details of Nokia's next mobes
New screen sizes, dual SIMs, voice over LTE, and more
Secure microkernel that uses maths to be 'bug free' goes open source
Hacker-repelling, drone-protecting code will soon be yours to tweak as you see fit
Mozilla keeps its Beard, hopes anti-gay marriage troubles are now over
Plenty on new CEO's todo list – starting with Firefox's slipping grasp
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
HIDDEN packet sniffer spy tech in MILLIONS of iPhones, iPads – expert
Don't panic though – Apple's backdoor is not wide open to all, guru tells us
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Backing up Big Data
Solving backup challenges and “protect everything from everywhere,” as we move into the era of big data management and the adoption of BYOD.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.