Salesforce's data-center design: 'Go for web scale, and build it out of s**t!'

Oracle sits at the core, but after that things get ugly - and cheap!

7 Elements of Radically Simple OS Migration

RICON West 2013 If you're building systems to run at a large scale, then rather than waste time and money trying to avoid any failure, you need to suck it up and accept that faults will happen – and make sure you've got enough cheap gear to recover.

So says Salesforce, which argued that this strategy can save you a bunch of cash you'd otherwise spend on expensive hardware, and makes it easier for your applications to survive catastrophes.

In a candid keynote speech at the Ricon West distributed systems conference on Tuesday, Salesforce architect and former Amazon infrastructure brain Pat Helland talked up Salesforce's internal "Keystone" system: this technology lets the company provide greater backup and replication capabilities for data stored in Oracle without having to spend Oracle prices on supporting infrastructure.

"You need to have enterprise-trust with web-class resilience," Helland said. "Failures are normal, mean time between failure makes failures common. You have to expect things to break so you have to layer it with immutable data*.

"The ideal design approach is 'web scale and I want to build it out of shit'."

Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers, Helland explained. The Oracle technology gives Salesforce confidence and consistency, he said, and the secondary layer of commodity systems and open-source software can give the company greater flexibility and a cheaper way of providing storage infrastructure, he explained.

What's inside Keystone

Keystone consists of several clusters of storage servers running 10 four-terabyte drives and two 750GB SSDs apiece, he explained. Salesforce has a preference for buying "the shittiest SSDs money can buy," he said, then designing systems to route around failures. Keystone's storage underbelly consists of clusters of 10s to 100s of nodes, possibly scaling up to thousands he explained.

The design methodology behind Keystone comes from two techniques named ROC - recovery-oriented computation - and SOFT - storage over flaky technologies. ROC comes from some research done at the University of California at Berkeley in the mid-2000s and is about designing systems to rapidly recover from failures, while SOFT is Helland's own term for building storage systems with guarantees despite its cheap-as-chips hardware.

'You get better system behavior if you assume everything is a hunk of crap'

Salesforce is designing systems using these techniques so that it can better deal with the flakiness of commodity infrastructure, without having to upgrade to more expensive systems, he said. "Storage servers may crash and they can lie with their frickin' teeth. You get better system behavior if you assume everything is a hunk of crap."

Keystone has four main elements: a Catalog for keeping track of data, a Store for storing it, a Vault for long-term storage, and a Pump for shuffling data between systems over WAN.

The Catalog provides an intermediary between storage systems, such as Oracle, and secondary storage systems built on commodity hardware. Primary storage can point to Keystone, which then points to secondary systems, making it easy to shift the location of secondary data without having to fiddle with the primary system.

The Store keeps hold of the data which is fed into the catalog, and sees Salesforce adopt the design approach pioneered by major companies – such as Google, Facebook, and Amazon – of using large quantities of low-cost hardware to provide backend storage while achieving good guarantees and reliability through a software layer.

This design approach has one glaring problem: failures. These happen a lot, Helland explained. About 4 per cent or more SATA hard drives fail a year, he said, and so a data center with 1,200 of Salesforce's storage-stuffed servers will lose 480 drives every 12 months. Therefore storage needs to be triple replicated to deal with these failures. However, servers will fail at a shade under one percent per year, he explained, so data must also be replicated at distance from rack infrastructures.

This brings about problems regarding maintaining consistency when pulling from a replicate, and so requires a caching layer to store locations of the data and keep everything up to date. This is built on cheap consumer-grade SSDs, he said, which have awful reliability.

"Consumer-grade SSDs will just die - the device stops less than 1 percent per year," he said. They also wear out after about 3,000 write cycles, so with a 750GB SSD you can probably write 2.25PBs to it before it fails, he said.

They also sometimes suffer from bit rot, which in his experience using the cheapest possible systems has one uncorrected bit error for every 1014 bits (11.3TB) written. By comparison, top-of-the-line enterprise-class SSDs have a rot rate of one bit in every 1019 (1.08EB), but they are much more expensive.

Modern software needs to be built to fail, he says, because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.

"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says. ®


* Salesforce is able to use its cheap'n'cheerful infrastructure and still offer strong guarantees to users because Keystone exclusively relies on immutable data structures, Helland said in a chat with El Reg today.

This means once a block of data is stored in the system, it is given a unique ID number and then split into fragments smaller than a megabyte, which each have their own IDs. These fragments are then run through a cyclic-redundancy check (CRC) algorithm that calculates a simple fingerprint from the contents of each fragment.

This metadata is combined by Keystone into a naming scheme that can detect changes to the data, whether by bit rot or other failures: if the contents of a fragment is altered, a subsequently calculated fingerprint will differ from the original value, which will alert the system.

By using such techniques, Salesforce is able to tolerate inexpensive but troublesome hardware in its backend, increase overall reliability for its users, and give its programmers more freedom when writing distributed apps.

"What we do that is special is we make it look just like the classic enterprise IT while we're doing the cloud," Helland said.

Best practices for enterprise data

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
VMware builds product executables on 50 Mac Minis
And goes to the Genius Bar for support
Multipath TCP speeds up the internet so much that security breaks
Black Hat research says proposed protocol will bork network probes, flummox firewalls
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Microsoft's Euro cloud darkens: US FEDS can dig into foreign servers
They're not emails, they're business records, says court
Microsoft says 'weird things' can happen during Windows Server 2003 migrations
Fix coming for bug that makes Kerberos croak when you run two domain controllers
Cisco says network virtualisation won't pay off everywhere
Another sign of strain in the Borg/VMware relationship?
prev story


7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
A new approach to endpoint data protection
What is the best way to ensure comprehensive visibility, management, and control of information on both company-owned and employee-owned devices?