Feeds

Facebook reveals TAO - the data store for its social graph

Calmly serves ONE BEEEELION reads per second

Next gen security for virtualised datacentres

USENIX Facebook has revealed details about Tao, its multi-petabyte data store for the company's social graph.

Though Facebook's social network may have little relevance for IT pros, its internal infrastructure does, because here the social network is dealing with quantities of information so vast that it has to come up with new ways to store, compute, and manage the data.

So the publication of details about Tao at the USENIX conference on Wednesday is novel for two reasons: one, it shows off the scale at which future enterprises are going to have to operate, and two, it highlights some of the design methodologies that modern data systems have brought about within sophisticated tech companies.

"A system like TAO is likely to be useful for any application domain that needs to efficiently generate fine-grained customized content from highly interconnected data," Facebook's employees write in the paper. "The application should not expect the data to be stale in the common case, but should be able to tolerate it. Many social networks fit in this category."

Other applications of a system like Tao could be large data sets relating to wildlife populations over time, or other complex systems with many agents whose relationship to one another is defined by a variety of actions. For the tin foil hat-aficionados, Tao would also seem to deal with the problems an intelligence agency would run into when trying to keep tabs on all its citizens.

Tao is a read-optimized data store that is deployed at Facebook as a single geographically distributed instance. It lets Facebook engineers access and write information across the company's "social graph" which stores all information about objects on Facebook (people, brands, comments and such), and associations (likes, pokes, tags).

It has been built to deal with over a billion reads per second across a data set "of many petabytes," Facebook said. Tao was designed by Facebook to better link together data kept in its main data store (MySQL) and caching layer (memcache), while being able to deal with unpredictable queries on objects.

"The fact Tao is using MySQL is completely hidden away from the client," Facebook director of engineering Venkat Venkataramani, tells The Register. "We haven't found anything that is better than MySQL, we are constantly looking at that."

Its API is mapped to a small amount of SQL queries, which ease communication with the underlying MySQL database. As Facebook's dataset is too large for a single database it has instead split data into logical shards which are handled by database servers.

TAO also has an eventually consistent caching layer which is built via a similar principle and filled with objects, associations, and association counts. The caching layer is crucial for allowing Facebook to speedily load the hundreds of objects and associations that populate any one page on the site.

Because Facebook's dataset is so large, the cache is split into a two-level hierarchy of a few "leader" caches which deal with writes and a subsidiary "follower" cache that helps with reads, which dramatically outnumber writes – Tao typically experiences a billion reads per second versus "millions of writes".

Data is cached in such a way that objects and associations have proximity to one another, Venkataramani says. "An important design decision is to keep the locality the system tries to exploit similar to the locality the workload has," he says. "That was one of the fundamental decisions that allowed us to scale."

Barack Obama's Facebook page, for instance, will generate vast numbers of reads at unpredictable times, and so many of Tao's design considerations revolve around guaranteeing read access to objects – hence its adoption of eventual consistency and high availability, over strong consistency and higher latency.

"Nothing before Facebook has seen this kind of workload," Venkataramani says. "When people think of web-scale apps people think of email, but the workload was very different because everyone checks their own email - you're not looking at other emails. The problem is very different when you take a social network because there are extremely high fan-outs."

Though the number of companies likely to deal with data in this way is quite small for now, studying Tao gives insights into the problems a company will run into when it gets really big, and shows that behind the blue and white bazaar of Facebook there's a rather sophisticated underlay.

"As the world moves more and more to cloud and a lot of data is being managed in bigger data centers I think this may be the starting of an era for new backend architectures," Venkataramani says. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
HP busts out new ProLiant Gen9 servers
Think those are cool? Wait till you get a load of our racks
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
Like condoms, data now comes in big and HUGE sizes
Linux Foundation lights a fire under storage devs with new conference
Community chest: Storage firms need to pay open-source debts
Samba implementation? Time to get some devs on the job
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up Big Data
Solving backup challenges and “protect everything from everywhere,” as we move into the era of big data management and the adoption of BYOD.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?