Facebook simulated entire data center prior to launch
Zuckerberg's Project Triforce
Before turning on its new custom-built data center in Prineville, Oregon, Facebook simulated the facility inside one of its two existing data center regions. Known as Project Triforce, the simulation was designed to pinpoint places where engineers had unknowingly fashioned the company's back-end services under the assumption that it would run across only two regions, not three or more.
"We now have hundreds of back-end services, and in going to a third data center, we needed to make sure all of them worked," Facebook's Sanjeev Kumar tells The Register. "These services are designed to work with many different data centers. But the trick is that if you haven't tested them on more than two data centers, you may not catch some practices that subtly crept into the system that would cause it to not work."
In the beginning, Facebook served up its site from leased data center space in North California. Then, in 2007, it leased additional space in Northern Virginia, spreading the load from the West Coast of the United States to the East. This move, Kumar says, was relatively simple because at the time, the Facebook back-end was relatively simple. Facebook's software stack consisted of a web tier, a caching tier, a MySQL tier, and just handful of other services. But today, its infrastructure is significantly more complex, and this created additional worries prior to the launch of the Prineville facility, the first data center designed, built, and owned by Facebook itself – and "open sourced" to the rest of the world.
According to Kumar, Facebook didn't have the option of individually testing each service for use across a third data center region. It needed to test all services concurrently, with real user data, before the third region actually went live. "The number of components in our infrastructure meant that testing each independently would be inadequate: it would be difficult to have confidence that we had full test coverage of all components, and unexpected interactions between components wouldn’t be tested," Kumar writes in a post to the Facebook engineering blog.
"This required a more macro approach – we needed to test the entire infrastructure in an environment that resembled the Oregon data center as closely as possible."
In an effort to improve throughput, the company was also moving to a new MySQL setup that used the FlashCache – the open source Linux block cache – and since this required the use of two MySQL instances on each machine, the company needed to test changes to its software stack as well.
So, Kumar and his team commandeered a cluster within one of its Virginia data centers and used it to simulate the new Prineville facility. This data center simulation spanned tens of thousands of machines, and it was tested with live Facebook traffic. "A cluster of thousands of machines was the smallest thing we could use to serve production traffic, and production traffic had to be used to ensure it hit all the use cases," Kumars tells us.
To facilitate the creation of its simulated data center, the company built a new software suite known as Kobold, which automated the configuration of each machine. "Kobold gives our cluster deployment team the ability to build up and tear down clusters quickly, conduct synthetic load and power tests without impacting user traffic, and audit our steps along the way," Kumar says. The entire cluster was online within 30 days, and it started serving production traffic within 60 days.
The only thing the company's didn't replicate was Prineville's MySQL database layer. This would have meant buying a whole new set of physical machines – the company uses a different breed of machines for MySQL than for other pieces of its back-end – and it couldn't justify the cost. The machines in the simulation cluster will eventually be folded back into the everyday operations of the Virginia data center region, but at this point, the region has all the MySQL machines it needs.
The simulation began in October, about three months before the Prineville data center was turned on. Since the simulation was even further from the company's Northern California data center than Prineville is, it duplicated the inter-data center latency the company would experience – and then some. "The latency we needed to worry about was between Prineville and Northern California," Kumar says. "Any issue you might see in Prineville, you would definitely see in Virginia." The latency between Prineville and Northern California is about 10 to 20 milliseconds (one way), and between Virginia and Northern California is roughly 70 milliseconds.
And yes, there were cases when the company's services were ill-suited for a trio of data centers, but these were relatively minor problems. "None of the problems we discovered involved a software component that was fundamentally coded for only two data centers and needed major changes," Kumar explains. "These were all situations where, say, someone just didn't realize the service A depended on service B."
This sort of thing shouldn't crop up again as the company moves from three data centers to four and beyond. "One to two is a big change, and two to four is a big change," Kumar says. "But after that, it tends not to be a big deal."
Plus, the Virginia simulation is still up and running. In essence, Facebook is already tested its infrastructure across a fourth data center. ®
Another thing that I see with CS guys straight from Uni is that you've got to school them in the way things are done in business. You see incredibly piss-poor treatment of students', who these days are multi thousand pound paying customers, by their IT / IS services departments. My partner was at Oxford a few years back and their webmail would regularly be taken down in the middle of the day for upgrade work. Having never worked in business, she didn't really understand that this isn't normal and shouldn't be put up with. Now move that to a new graduate moving into business and there just isn't the uptime and customer services ethos drummed into you at Uni that you really need to hit the ground running.
Re: Up to a point.
Of course, hiring the RIGHT developers to start with is usually cheaper than hiring the wrong developers and then throwing hardware at the problem this causes...
the importance of planning for growth
I have a dev right now that is telling me we can just throw hardware at a design we are building. My old database and design experience tells me no amount of hardware makes up for sloppy code, it follows the 2nd Law of Thermodynamics. We are bringing in an architect and performance analyst.