Facebook: Why our 'next-gen' comms ditched MySQL

Cassandra founder plays Ace of HBase

5 things you didn’t know about cloud backup

About a year ago, when Facebook set out to build its email-meets-chat-meets-everything-else messaging system, the company knew its infrastructure couldn't run the thing. "[The Facebook infrastructure] wasn't really ready to handle a bunch of different forms of messaging and have it happen in real time," says Joel Seligstein, a Facebook engineer who worked on the project. So Seligstein and crew mocked up a multifaceted messaging prototype, tossed it onto various distributed storage platforms, and ran a Big Data bake off.

The winner was HBase, the open source distributed database modeled after Google's proprietary BigTable platform. Facebook was already using MySQL for message storage, the open source Cassandra platform for inbox search, and the proprietary Haystack platform for storing photos. But in the company's mind, HBase was better equipped to handle a new-age messaging system that would seek to seamlessly juggle email, chat, and SMS as well as traditional on-site Facebook messages.

Originally built by Powerset – a semantic search outfit now owned by Microsoft – HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's back-end infrastructure. Like so many other big-name web outfits – including Microsoft, Yahoo!, and Twitter – Facebook uses the core Hadoop MapReduce platform to handle its back-end number-crunching tasks. It now spans a petabyte-scale cluster. "MapReduce is so useful," says Facebook infrastructure guru Karthik Ranganathan. "It's like the air you breathe."

Like Hadoop MapReduce, HBase uses the distributed Hadoop File System (HDFS), and Facebook chose it in part because it has experience scaling and debugging HDFS. But Hadoop isn't the only distributed platform the company uses on the back end. HBase was chosen, the company says, because it suited the task at hand.

The new Facebook messaging system uses Hbase to store the text and metadata for the messages as well as the indices needed to search these messages – an epic amount of data. Even before the new messaging system was rolled out, Facebook was juggling about 15 billion on-site messages a month (about 14 TB) and 120 billion chat messages (11 TB). It needed about three times more space to store all that data, with the average message going to multiple people, and as the new system adds good, old-fashioned email (from @facebook.com addresses), the storage requirements will only expand. HBase doesn't store everything – Haystack handles attachments and overly large messages – but it handles most of the data that's frequently updated.

"The email workload is a write dominated workload. We need to take a lot of writes very quickly," Ranganathan says. "We used HBase for the data the grows very fast, which is essentially the metadata."

Speaking alongside Seligstein at a recent "tech talk" that examined the infrastructure behind the company's new messaging system, unveiled last month, Ranganathan explained that the company chose HBase in part because its "strong consistency model." As it replicates data across myriad machines, he says, HBase is particularly adept at keeping all those replicas synchronized. "If something fails, we want to be able to read the data from somewhere else," Ranganathan says. "Once it's written, it doesn't matter what replica you go to. It's always in sync."

Facebook also picked HBase because it's designed for automatic failover. When a server goes down, it automatically reverts to another – at least in theory. "When you're talking about a lot of data, you're obviously talking about a lot of machines, and when you're talking about a lot of machines, failure is the norm," Ranganathan says. "When failure is the norm, we want to be able to automatically serve that data without too much intervention or it taking too much time."

At the same time, Ranganathan and crew like the way HBase works to prevent failure. As he puts it, when HBase breaks data into pieces and distributes them across a cluster, it puts multiple and completely separate data "shards" on each machine. This means that if one server dies, its work can be picked up several different machines and not just one. In other words: better load balancing.

If you put your data shards on just a few machines and a machine goes down, Ranganathan says, the burden is picked up by a single server, and this may end-up toppling other servers like dominos. "[The second server] dies, and then two machines' slack has to be taken up by a third, and this thing cascades all the way down," he says. "HBase shards the data into a lot of virtual shards and puts multiple shards on a single machine. So a machine dying spreads [its data] across multiple machines and the utilization of your entire cluster goes up."

Ranganathan also likes HBase's LZO (Lempel-Ziv-Oberhumer) data compression – "it's easier on the CPU," he says – and he likes that when you read, modify, and write in HBase, you can expect all readers to get the same values. "This really helps with counting – number of unread messages and things like that," he says.

But HBase did require modification. Facebook engineer Nicolas Spiegelberg and another company developer spent a year adding commits to the platform, mainly in an effort to minimize data loss. "The goal was zero data loss," says Spiegelberg. Committers updated HDFS Sync support, added some ACID properties, and even redesigned the HBase master.

Ranganathan and company used Haystack for attachment and very large messages because they didn't need the same write speed with these beefy files and Haystack could use far fewer servers in replicating the files across data centers. But they ditched MySQL entirely. They felt it couldn't provide quick access to such a large amount of data. "We have a lot of data that's quick and temporal," Seligstein says. "You need access to your first page of messages quite often."

For many, it's surprising that Facebook didn't put the messaging system on Cassandra, which was originally developed at the company and has since become hugely popular across the interwebs. But Ranganathan and others felt that it was too difficult to reconcile Cassandra's "eventual consistency model" with the messaging system setup.

"If you're going to front the data with some sort of a cache...then it becomes very difficult for use to program a cache where the database heals from underneath," Ranganathan says. "For a product like messaging, we needed a strong consistency model [as HBase offers]." If a user sends an email, he says, you have to be able to tell the user – immediately the email was sent. "If it doesn't show that it's been sent, I think 'Oh, I didn't send it,' and then I send it again. Then I come in the next day and I see it's been sent twice, and I get pissed at Facebook."

He also felt that the system needed HBase physical replication as opposed to Cassandra's logical replication. With Cassandra, he says, if something gets corrupted and you lose a disk and you have to restore the data, you have to restore the entire replica. Because Facebook uses very dense machines, this would mean a very long recovery time.

Some Cassandra backers don't see it that way. But Ranganathan is at least worth listening to. He's one of Cassandra's original authors. ®

Gartner critical capabilities for enterprise endpoint backup

More from The Register

next story
Why has the web gone to hell? Market chaos and HUMAN NATURE
Tim Berners-Lee isn't happy, but we should be
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
'Stop dissing Google or quit': OK, I quit, says Code Club co-founder
And now a message from our sponsors: 'STFU or else'
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
Scratched PC-dispatch patch patched, hatched in batch rematch
Windows security update fixed after triggering blue screens (and screams) of death
This is how I set about making a fortune with my own startup
Would you leave your well-paid job to chase your dream?
prev story


Best practices for enterprise data
Discussing how technology providers have innovated in order to solve new challenges, creating a new framework for enterprise data.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Advanced data protection for your virtualized environments
Find a natural fit for optimizing protection for the often resource-constrained data protection process found in virtual environments.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?