Feeds

Facebook simulated entire data center prior to launch

Zuckerberg's Project Triforce

Top three mobile application threats

Before turning on its new custom-built data center in Prineville, Oregon, Facebook simulated the facility inside one of its two existing data center regions. Known as Project Triforce, the simulation was designed to pinpoint places where engineers had unknowingly fashioned the company's back-end services under the assumption that it would run across only two regions, not three or more.

"We now have hundreds of back-end services, and in going to a third data center, we needed to make sure all of them worked," Facebook's Sanjeev Kumar tells The Register. "These services are designed to work with many different data centers. But the trick is that if you haven't tested them on more than two data centers, you may not catch some practices that subtly crept into the system that would cause it to not work."

In the beginning, Facebook served up its site from leased data center space in North California. Then, in 2007, it leased additional space in Northern Virginia, spreading the load from the West Coast of the United States to the East. This move, Kumar says, was relatively simple because at the time, the Facebook back-end was relatively simple. Facebook's software stack consisted of a web tier, a caching tier, a MySQL tier, and just handful of other services. But today, its infrastructure is significantly more complex, and this created additional worries prior to the launch of the Prineville facility, the first data center designed, built, and owned by Facebook itself – and "open sourced" to the rest of the world.

According to Kumar, Facebook didn't have the option of individually testing each service for use across a third data center region. It needed to test all services concurrently, with real user data, before the third region actually went live. "The number of components in our infrastructure meant that testing each independently would be inadequate: it would be difficult to have confidence that we had full test coverage of all components, and unexpected interactions between components wouldn’t be tested," Kumar writes in a post to the Facebook engineering blog.

"This required a more macro approach – we needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible."

In an effort to improve throughput, the company was also moving to a new MySQL setup that used the FlashCache – the open source Linux block cache – and since this required the use of two MySQL instances on each machine, the company needed to test changes to its software stack as well.

So, Kumar and his team commandeered a cluster within one of its Virginia data centers and used it to simulate the new Prineville facility. This data center simulation spanned tens of thousands of machines, and it was tested with live Facebook traffic. "A cluster of thousands of machines was the smallest thing we could use to serve production traffic, and production traffic had to be used to ensure it hit all the use cases," Kumars tells us.

To facilitate the creation of its simulated data center, the company built a new software suite known as Kobold, which automated the configuration of each machine. "Kobold gives our cluster deployment team the ability to build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way," Kumar says. The entire cluster was online within 30 days, and it started serving production traffic within 60 days.

The only thing the company's didn't replicate was Prineville's MySQL database layer. This would have meant buying a whole new set of physical machines – the company uses a different breed of machines for MySQL than for other pieces of its back-end – and it couldn't justify the cost. The machines in the simulation cluster will eventually be folded back into the everyday operations of the Virginia data center region, but at this point, the region has all the MySQL machines it needs.

The simulation began in October, about three months before the Prineville data center was turned on. Since the simulation was even further from the company's Northern California data center than Prineville is, it duplicated the inter-data center latency the company would experience – and then some. "The latency we needed to worry about was between Prineville and Northern California," Kumar says. "Any issue you might see in Prineville, you would definitely see in Virginia." The latency between Prineville and Northern California is about 10 to 20 milliseconds (one way), and between Virginia and Northern California is roughly 70 milliseconds.

And yes, there were cases when the company's services were ill-suited for a trio of data centers, but these were relatively minor problems. "None of the problems we discovered involved a software component that was fundamentally coded for only two data centers and needed major changes," Kumar explains. "These were all situations where, say, someone just didn't realize the service A depended on service B."

This sort of thing shouldn't crop up again as the company moves from three data centers to four and beyond. "One to two is a big change, and two to four is a big change," Kumar says. "But after that, it tends not to be a big deal."

Plus, the Virginia simulation is still up and running. In essence, Facebook is already tested its infrastructure across a fourth data center. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story

Whitepapers

Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.