Feeds

Facebook reveals TAO - the data store for its social graph

Calmly serves ONE BEEEELION reads per second

Choosing a cloud hosting partner with confidence

USENIX Facebook has revealed details about Tao, its multi-petabyte data store for the company's social graph.

Though Facebook's social network may have little relevance for IT pros, its internal infrastructure does, because here the social network is dealing with quantities of information so vast that it has to come up with new ways to store, compute, and manage the data.

So the publication of details about Tao at the USENIX conference on Wednesday is novel for two reasons: one, it shows off the scale at which future enterprises are going to have to operate, and two, it highlights some of the design methodologies that modern data systems have brought about within sophisticated tech companies.

"A system like TAO is likely to be useful for any application domain that needs to efficiently generate fine-grained customized content from highly interconnected data," Facebook's employees write in the paper. "The application should not expect the data to be stale in the common case, but should be able to tolerate it. Many social networks fit in this category."

Other applications of a system like Tao could be large data sets relating to wildlife populations over time, or other complex systems with many agents whose relationship to one another is defined by a variety of actions. For the tin foil hat-aficionados, Tao would also seem to deal with the problems an intelligence agency would run into when trying to keep tabs on all its citizens.

Tao is a read-optimized data store that is deployed at Facebook as a single geographically distributed instance. It lets Facebook engineers access and write information across the company's "social graph" which stores all information about objects on Facebook (people, brands, comments and such), and associations (likes, pokes, tags).

It has been built to deal with over a billion reads per second across a data set "of many petabytes," Facebook said. Tao was designed by Facebook to better link together data kept in its main data store (MySQL) and caching layer (memcache), while being able to deal with unpredictable queries on objects.

"The fact Tao is using MySQL is completely hidden away from the client," Facebook director of engineering Venkat Venkataramani, tells The Register. "We haven't found anything that is better than MySQL, we are constantly looking at that."

Its API is mapped to a small amount of SQL queries, which ease communication with the underlying MySQL database. As Facebook's dataset is too large for a single database it has instead split data into logical shards which are handled by database servers.

TAO also has an eventually consistent caching layer which is built via a similar principle and filled with objects, associations, and association counts. The caching layer is crucial for allowing Facebook to speedily load the hundreds of objects and associations that populate any one page on the site.

Because Facebook's dataset is so large, the cache is split into a two-level hierarchy of a few "leader" caches which deal with writes and a subsidiary "follower" cache that helps with reads, which dramatically outnumber writes – Tao typically experiences a billion reads per second versus "millions of writes".

Data is cached in such a way that objects and associations have proximity to one another, Venkataramani says. "An important design decision is to keep the locality the system tries to exploit similar to the locality the workload has," he says. "That was one of the fundamental decisions that allowed us to scale."

Barack Obama's Facebook page, for instance, will generate vast numbers of reads at unpredictable times, and so many of Tao's design considerations revolve around guaranteeing read access to objects – hence its adoption of eventual consistency and high availability, over strong consistency and higher latency.

"Nothing before Facebook has seen this kind of workload," Venkataramani says. "When people think of web-scale apps people think of email, but the workload was very different because everyone checks their own email - you're not looking at other emails. The problem is very different when you take a social network because there are extremely high fan-outs."

Though the number of companies likely to deal with data in this way is quite small for now, studying Tao gives insights into the problems a company will run into when it gets really big, and shows that behind the blue and white bazaar of Facebook there's a rather sophisticated underlay.

"As the world moves more and more to cloud and a lot of data is being managed in bigger data centers I think this may be the starting of an era for new backend architectures," Venkataramani says. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Just don't blame Bono! Apple iTunes music sales PLUMMET
Cupertino revenue hit by cheapo downloads, says report
The DRUGSTORES DON'T WORK, CVS makes IT WORSE ... for Apple Pay
Goog Wallet apparently also spurned in NFC lockdown
IBM, backing away from hardware? NEVER!
Don't be so sure, so-surers
Hey - who wants 4.8 TERABYTES almost AS FAST AS MEMORY?
China's Memblaze says they've got it in PCIe. Yow
Microsoft brings the CLOUD that GOES ON FOREVER
Sky's the limit with unrestricted space in the cloud
This time it's SO REAL: Overcoming the open-source orgasm myth with TODO
If the web giants need it to work, hey, maybe it'll work
'ANYTHING BUT STABLE' Netflix suffers BIG Europe-wide outage
Friday night LIVE? Nope. The only thing streaming are tears down my face
Google roolz! Nest buys Revolv, KILLS new sales of home hub
Take my temperature, I'm feeling a little bit dizzy
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.
The Heartbleed Bug: how to protect your business with Symantec
What happens when the next Heartbleed (or worse) comes along, and what can you do to weather another chapter in an all-too-familiar string of debilitating attacks?