Facebook: Why our 'next-gen' comms ditched MySQL

Cassandra founder plays Ace of HBase

Combat fraud and increase customer satisfaction

About a year ago, when Facebook set out to build its email-meets-chat-meets-everything-else messaging system, the company knew its infrastructure couldn't run the thing. "[The Facebook infrastructure] wasn't really ready to handle a bunch of different forms of messaging and have it happen in real time," says Joel Seligstein, a Facebook engineer who worked on the project. So Seligstein and crew mocked up a multifaceted messaging prototype, tossed it onto various distributed storage platforms, and ran a Big Data bake off.

The winner was HBase, the open source distributed database modeled after Google's proprietary BigTable platform. Facebook was already using MySQL for message storage, the open source Cassandra platform for inbox search, and the proprietary Haystack platform for storing photos. But in the company's mind, HBase was better equipped to handle a new-age messaging system that would seek to seamlessly juggle email, chat, and SMS as well as traditional on-site Facebook messages.

Originally built by Powerset – a semantic search outfit now owned by Microsoft – HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's back-end infrastructure. Like so many other big-name web outfits – including Microsoft, Yahoo!, and Twitter – Facebook uses the core Hadoop MapReduce platform to handle its back-end number-crunching tasks. It now spans a petabyte-scale cluster. "MapReduce is so useful," says Facebook infrastructure guru Karthik Ranganathan. "It's like the air you breathe."

Like Hadoop MapReduce, HBase uses the distributed Hadoop File System (HDFS), and Facebook chose it in part because it has experience scaling and debugging HDFS. But Hadoop isn't the only distributed platform the company uses on the back end. HBase was chosen, the company says, because it suited the task at hand.

The new Facebook messaging system uses Hbase to store the text and metadata for the messages as well as the indices needed to search these messages – an epic amount of data. Even before the new messaging system was rolled out, Facebook was juggling about 15 billion on-site messages a month (about 14 TB) and 120 billion chat messages (11 TB). It needed about three times more space to store all that data, with the average message going to multiple people, and as the new system adds good, old-fashioned email (from @facebook.com addresses), the storage requirements will only expand. HBase doesn't store everything – Haystack handles attachments and overly large messages – but it handles most of the data that's frequently updated.

"The email workload is a write dominated workload. We need to take a lot of writes very quickly," Ranganathan says. "We used HBase for the data the grows very fast, which is essentially the metadata."

Speaking alongside Seligstein at a recent "tech talk" that examined the infrastructure behind the company's new messaging system, unveiled last month, Ranganathan explained that the company chose HBase in part because its "strong consistency model." As it replicates data across myriad machines, he says, HBase is particularly adept at keeping all those replicas synchronized. "If something fails, we want to be able to read the data from somewhere else," Ranganathan says. "Once it's written, it doesn't matter what replica you go to. It's always in sync."

Facebook also picked HBase because it's designed for automatic failover. When a server goes down, it automatically reverts to another – at least in theory. "When you're talking about a lot of data, you're obviously talking about a lot of machines, and when you're talking about a lot of machines, failure is the norm," Ranganathan says. "When failure is the norm, we want to be able to automatically serve that data without too much intervention or it taking too much time."

At the same time, Ranganathan and crew like the way HBase works to prevent failure. As he puts it, when HBase breaks data into pieces and distributes them across a cluster, it puts multiple and completely separate data "shards" on each machine. This means that if one server dies, its work can be picked up several different machines and not just one. In other words: better load balancing.

If you put your data shards on just a few machines and a machine goes down, Ranganathan says, the burden is picked up by a single server, and this may end-up toppling other servers like dominos. "[The second server] dies, and then two machines' slack has to be taken up by a third, and this thing cascades all the way down," he says. "HBase shards the data into a lot of virtual shards and puts multiple shards on a single machine. So a machine dying spreads [its data] across multiple machines and the utilization of your entire cluster goes up."

Ranganathan also likes HBase's LZO (Lempel-Ziv-Oberhumer) data compression – "it's easier on the CPU," he says – and he likes that when you read, modify, and write in HBase, you can expect all readers to get the same values. "This really helps with counting – number of unread messages and things like that," he says.

But HBase did require modification. Facebook engineer Nicolas Spiegelberg and another company developer spent a year adding commits to the platform, mainly in an effort to minimize data loss. "The goal was zero data loss," says Spiegelberg. Committers updated HDFS Sync support, added some ACID properties, and even redesigned the HBase master.

Ranganathan and company used Haystack for attachment and very large messages because they didn't need the same write speed with these beefy files and Haystack could use far fewer servers in replicating the files across data centers. But they ditched MySQL entirely. They felt it couldn't provide quick access to such a large amount of data. "We have a lot of data that's quick and temporal," Seligstein says. "You need access to your first page of messages quite often."

For many, it's surprising that Facebook didn't put the messaging system on Cassandra, which was originally developed at the company and has since become hugely popular across the interwebs. But Ranganathan and others felt that it was too difficult to reconcile Cassandra's "eventual consistency model" with the messaging system setup.

"If you're going to front the data with some sort of a cache...then it becomes very difficult for use to program a cache where the database heals from underneath," Ranganathan says. "For a product like messaging, we needed a strong consistency model [as HBase offers]." If a user sends an email, he says, you have to be able to tell the user – immediately the email was sent. "If it doesn't show that it's been sent, I think 'Oh, I didn't send it,' and then I send it again. Then I come in the next day and I see it's been sent twice, and I get pissed at Facebook."

He also felt that the system needed HBase physical replication as opposed to Cassandra's logical replication. With Cassandra, he says, if something gets corrupted and you lose a disk and you have to restore the data, you have to restore the entire replica. Because Facebook uses very dense machines, this would mean a very long recovery time.

Some Cassandra backers don't see it that way. But Ranganathan is at least worth listening to. He's one of Cassandra's original authors. ®

3 Big data security analytics techniques

More from The Register

next story
Ubuntu 14.04 LTS: Great changes, but sssh don't mention the...
Why HELLO Amazon! You weren't here last time
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Got Windows 8.1 Update yet? Get ready for YET ANOTHER ONE – rumor
Leaker claims big release due this fall as Microsoft herds us into the CLOUD
Next Windows obsolescence panic is 450 days from … NOW!
The clock is ticking louder for Windows Server 2003 R2 users
OpenBSD founder wants to bin buggy OpenSSL library, launches fork
One Heartbleed vuln was too many for Theo de Raadt
Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
Admins dab straining server brows in advance of Trusty Tahr's long-term support landing
Red Hat to ship RHEL 7 release candidate with a taste of container tech
Grab 'near-final' version of next Enterprise Linux next week
Apple inaugurates free OS X beta program for world+dog
Prerelease software now open to anyone, not just developers – as long as you keep quiet
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
prev story


Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.