Salesforce's data-center design: 'Go for web scale, and build it out of s**t!'
Oracle sits at the core, but after that things get ugly - and cheap!
RICON West 2013 If you're building systems to run at a large scale, then rather than waste time and money trying to avoid any failure, you need to suck it up and accept that faults will happen – and make sure you've got enough cheap gear to recover.
So says Salesforce, which argued that this strategy can save you a bunch of cash you'd otherwise spend on expensive hardware, and makes it easier for your applications to survive catastrophes.
In a candid keynote speech at the Ricon West distributed systems conference on Tuesday, Salesforce architect and former Amazon infrastructure brain Pat Helland talked up Salesforce's internal "Keystone" system: this technology lets the company provide greater backup and replication capabilities for data stored in Oracle without having to spend Oracle prices on supporting infrastructure.
"You need to have enterprise-trust with web-class resilience," Helland said. "Failures are normal, mean time between failure makes failures common. You have to expect things to break so you have to layer it with immutable data*.
"The ideal design approach is 'web scale and I want to build it out of shit'."
Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers, Helland explained. The Oracle technology gives Salesforce confidence and consistency, he said, and the secondary layer of commodity systems and open-source software can give the company greater flexibility and a cheaper way of providing storage infrastructure, he explained.
What's inside Keystone
Keystone consists of several clusters of storage servers running 10 four-terabyte drives and two 750GB SSDs apiece, he explained. Salesforce has a preference for buying "the shittiest SSDs money can buy," he said, then designing systems to route around failures. Keystone's storage underbelly consists of clusters of 10s to 100s of nodes, possibly scaling up to thousands he explained.
The design methodology behind Keystone comes from two techniques named ROC - recovery-oriented computation - and SOFT - storage over flaky technologies. ROC comes from some research done at the University of California at Berkeley in the mid-2000s and is about designing systems to rapidly recover from failures, while SOFT is Helland's own term for building storage systems with guarantees despite its cheap-as-chips hardware.
'You get better system behavior if you assume everything is a hunk of crap'
Salesforce is designing systems using these techniques so that it can better deal with the flakiness of commodity infrastructure, without having to upgrade to more expensive systems, he said. "Storage servers may crash and they can lie with their frickin' teeth. You get better system behavior if you assume everything is a hunk of crap."
Keystone has four main elements: a Catalog for keeping track of data, a Store for storing it, a Vault for long-term storage, and a Pump for shuffling data between systems over WAN.
The Catalog provides an intermediary between storage systems, such as Oracle, and secondary storage systems built on commodity hardware. Primary storage can point to Keystone, which then points to secondary systems, making it easy to shift the location of secondary data without having to fiddle with the primary system.
The Store keeps hold of the data which is fed into the catalog, and sees Salesforce adopt the design approach pioneered by major companies – such as Google, Facebook, and Amazon – of using large quantities of low-cost hardware to provide backend storage while achieving good guarantees and reliability through a software layer.
This design approach has one glaring problem: failures. These happen a lot, Helland explained. About 4 per cent or more SATA hard drives fail a year, he said, and so a data center with 1,200 of Salesforce's storage-stuffed servers will lose 480 drives every 12 months. Therefore storage needs to be triple replicated to deal with these failures. However, servers will fail at a shade under one percent per year, he explained, so data must also be replicated at distance from rack infrastructures.
This brings about problems regarding maintaining consistency when pulling from a replicate, and so requires a caching layer to store locations of the data and keep everything up to date. This is built on cheap consumer-grade SSDs, he said, which have awful reliability.
"Consumer-grade SSDs will just die - the device stops less than 1 percent per year," he said. They also wear out after about 3,000 write cycles, so with a 750GB SSD you can probably write 2.25PBs to it before it fails, he said.
They also sometimes suffer from bit rot, which in his experience using the cheapest possible systems has one uncorrected bit error for every 1014 bits (11.3TB) written. By comparison, top-of-the-line enterprise-class SSDs have a rot rate of one bit in every 1019 (1.08EB), but they are much more expensive.
Modern software needs to be built to fail, he says, because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.
"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says. ®
* Salesforce is able to use its cheap'n'cheerful infrastructure and still offer strong guarantees to users because Keystone exclusively relies on immutable data structures, Helland said in a chat with El Reg today.
This means once a block of data is stored in the system, it is given a unique ID number and then split into fragments smaller than a megabyte, which each have their own IDs. These fragments are then run through a cyclic-redundancy check (CRC) algorithm that calculates a simple fingerprint from the contents of each fragment.
This metadata is combined by Keystone into a naming scheme that can detect changes to the data, whether by bit rot or other failures: if the contents of a fragment is altered, a subsequently calculated fingerprint will differ from the original value, which will alert the system.
By using such techniques, Salesforce is able to tolerate inexpensive but troublesome hardware in its backend, increase overall reliability for its users, and give its programmers more freedom when writing distributed apps.
"What we do that is special is we make it look just like the classic enterprise IT while we're doing the cloud," Helland said.
Sponsored: Flash storage buyer's guide