Feeds

Facebook reveals TAO - the data store for its social graph

Calmly serves ONE BEEEELION reads per second

Security for virtualized datacentres

USENIX Facebook has revealed details about Tao, its multi-petabyte data store for the company's social graph.

Though Facebook's social network may have little relevance for IT pros, its internal infrastructure does, because here the social network is dealing with quantities of information so vast that it has to come up with new ways to store, compute, and manage the data.

So the publication of details about Tao at the USENIX conference on Wednesday is novel for two reasons: one, it shows off the scale at which future enterprises are going to have to operate, and two, it highlights some of the design methodologies that modern data systems have brought about within sophisticated tech companies.

"A system like TAO is likely to be useful for any application domain that needs to efficiently generate fine-grained customized content from highly interconnected data," Facebook's employees write in the paper. "The application should not expect the data to be stale in the common case, but should be able to tolerate it. Many social networks fit in this category."

Other applications of a system like Tao could be large data sets relating to wildlife populations over time, or other complex systems with many agents whose relationship to one another is defined by a variety of actions. For the tin foil hat-aficionados, Tao would also seem to deal with the problems an intelligence agency would run into when trying to keep tabs on all its citizens.

Tao is a read-optimized data store that is deployed at Facebook as a single geographically distributed instance. It lets Facebook engineers access and write information across the company's "social graph" which stores all information about objects on Facebook (people, brands, comments and such), and associations (likes, pokes, tags).

It has been built to deal with over a billion reads per second across a data set "of many petabytes," Facebook said. Tao was designed by Facebook to better link together data kept in its main data store (MySQL) and caching layer (memcache), while being able to deal with unpredictable queries on objects.

"The fact Tao is using MySQL is completely hidden away from the client," Facebook director of engineering Venkat Venkataramani, tells The Register. "We haven't found anything that is better than MySQL, we are constantly looking at that."

Its API is mapped to a small amount of SQL queries, which ease communication with the underlying MySQL database. As Facebook's dataset is too large for a single database it has instead split data into logical shards which are handled by database servers.

TAO also has an eventually consistent caching layer which is built via a similar principle and filled with objects, associations, and association counts. The caching layer is crucial for allowing Facebook to speedily load the hundreds of objects and associations that populate any one page on the site.

Because Facebook's dataset is so large, the cache is split into a two-level hierarchy of a few "leader" caches which deal with writes and a subsidiary "follower" cache that helps with reads, which dramatically outnumber writes – Tao typically experiences a billion reads per second versus "millions of writes".

Data is cached in such a way that objects and associations have proximity to one another, Venkataramani says. "An important design decision is to keep the locality the system tries to exploit similar to the locality the workload has," he says. "That was one of the fundamental decisions that allowed us to scale."

Barack Obama's Facebook page, for instance, will generate vast numbers of reads at unpredictable times, and so many of Tao's design considerations revolve around guaranteeing read access to objects – hence its adoption of eventual consistency and high availability, over strong consistency and higher latency.

"Nothing before Facebook has seen this kind of workload," Venkataramani says. "When people think of web-scale apps people think of email, but the workload was very different because everyone checks their own email - you're not looking at other emails. The problem is very different when you take a social network because there are extremely high fan-outs."

Though the number of companies likely to deal with data in this way is quite small for now, studying Tao gives insights into the problems a company will run into when it gets really big, and shows that behind the blue and white bazaar of Facebook there's a rather sophisticated underlay.

"As the world moves more and more to cloud and a lot of data is being managed in bigger data centers I think this may be the starting of an era for new backend architectures," Venkataramani says. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.