Feeds

Facebook simulated entire data center prior to launch

Zuckerberg's Project Triforce

High performance access to file storage

Before turning on its new custom-built data center in Prineville, Oregon, Facebook simulated the facility inside one of its two existing data center regions. Known as Project Triforce, the simulation was designed to pinpoint places where engineers had unknowingly fashioned the company's back-end services under the assumption that it would run across only two regions, not three or more.

"We now have hundreds of back-end services, and in going to a third data center, we needed to make sure all of them worked," Facebook's Sanjeev Kumar tells The Register. "These services are designed to work with many different data centers. But the trick is that if you haven't tested them on more than two data centers, you may not catch some practices that subtly crept into the system that would cause it to not work."

In the beginning, Facebook served up its site from leased data center space in North California. Then, in 2007, it leased additional space in Northern Virginia, spreading the load from the West Coast of the United States to the East. This move, Kumar says, was relatively simple because at the time, the Facebook back-end was relatively simple. Facebook's software stack consisted of a web tier, a caching tier, a MySQL tier, and just handful of other services. But today, its infrastructure is significantly more complex, and this created additional worries prior to the launch of the Prineville facility, the first data center designed, built, and owned by Facebook itself – and "open sourced" to the rest of the world.

According to Kumar, Facebook didn't have the option of individually testing each service for use across a third data center region. It needed to test all services concurrently, with real user data, before the third region actually went live. "The number of components in our infrastructure meant that testing each independently would be inadequate: it would be difficult to have confidence that we had full test coverage of all components, and unexpected interactions between components wouldn’t be tested," Kumar writes in a post to the Facebook engineering blog.

"This required a more macro approach – we needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible."

In an effort to improve throughput, the company was also moving to a new MySQL setup that used the FlashCache – the open source Linux block cache – and since this required the use of two MySQL instances on each machine, the company needed to test changes to its software stack as well.

So, Kumar and his team commandeered a cluster within one of its Virginia data centers and used it to simulate the new Prineville facility. This data center simulation spanned tens of thousands of machines, and it was tested with live Facebook traffic. "A cluster of thousands of machines was the smallest thing we could use to serve production traffic, and production traffic had to be used to ensure it hit all the use cases," Kumars tells us.

To facilitate the creation of its simulated data center, the company built a new software suite known as Kobold, which automated the configuration of each machine. "Kobold gives our cluster deployment team the ability to build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way," Kumar says. The entire cluster was online within 30 days, and it started serving production traffic within 60 days.

The only thing the company's didn't replicate was Prineville's MySQL database layer. This would have meant buying a whole new set of physical machines – the company uses a different breed of machines for MySQL than for other pieces of its back-end – and it couldn't justify the cost. The machines in the simulation cluster will eventually be folded back into the everyday operations of the Virginia data center region, but at this point, the region has all the MySQL machines it needs.

The simulation began in October, about three months before the Prineville data center was turned on. Since the simulation was even further from the company's Northern California data center than Prineville is, it duplicated the inter-data center latency the company would experience – and then some. "The latency we needed to worry about was between Prineville and Northern California," Kumar says. "Any issue you might see in Prineville, you would definitely see in Virginia." The latency between Prineville and Northern California is about 10 to 20 milliseconds (one way), and between Virginia and Northern California is roughly 70 milliseconds.

And yes, there were cases when the company's services were ill-suited for a trio of data centers, but these were relatively minor problems. "None of the problems we discovered involved a software component that was fundamentally coded for only two data centers and needed major changes," Kumar explains. "These were all situations where, say, someone just didn't realize the service A depended on service B."

This sort of thing shouldn't crop up again as the company moves from three data centers to four and beyond. "One to two is a big change, and two to four is a big change," Kumar says. "But after that, it tends not to be a big deal."

Plus, the Virginia simulation is still up and running. In essence, Facebook is already tested its infrastructure across a fourth data center. ®

High performance access to file storage

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
USA opposes 'Schengen cloud' Eurocentric routing plan
All routes should transit America, apparently
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.