Feeds

Salesforce's data-center design: 'Go for web scale, and build it out of s**t!'

Oracle sits at the core, but after that things get ugly - and cheap!

Internet Security Threat Report 2014

RICON West 2013 If you're building systems to run at a large scale, then rather than waste time and money trying to avoid any failure, you need to suck it up and accept that faults will happen – and make sure you've got enough cheap gear to recover.

So says Salesforce, which argued that this strategy can save you a bunch of cash you'd otherwise spend on expensive hardware, and makes it easier for your applications to survive catastrophes.

In a candid keynote speech at the Ricon West distributed systems conference on Tuesday, Salesforce architect and former Amazon infrastructure brain Pat Helland talked up Salesforce's internal "Keystone" system: this technology lets the company provide greater backup and replication capabilities for data stored in Oracle without having to spend Oracle prices on supporting infrastructure.

"You need to have enterprise-trust with web-class resilience," Helland said. "Failures are normal, mean time between failure makes failures common. You have to expect things to break so you have to layer it with immutable data*.

"The ideal design approach is 'web scale and I want to build it out of shit'."

Salesforce's Keystone system takes data from Oracle and then layers it on top of a set of cheap infrastructure running on commodity servers, Helland explained. The Oracle technology gives Salesforce confidence and consistency, he said, and the secondary layer of commodity systems and open-source software can give the company greater flexibility and a cheaper way of providing storage infrastructure, he explained.

What's inside Keystone

Keystone consists of several clusters of storage servers running 10 four-terabyte drives and two 750GB SSDs apiece, he explained. Salesforce has a preference for buying "the shittiest SSDs money can buy," he said, then designing systems to route around failures. Keystone's storage underbelly consists of clusters of 10s to 100s of nodes, possibly scaling up to thousands he explained.

The design methodology behind Keystone comes from two techniques named ROC - recovery-oriented computation - and SOFT - storage over flaky technologies. ROC comes from some research done at the University of California at Berkeley in the mid-2000s and is about designing systems to rapidly recover from failures, while SOFT is Helland's own term for building storage systems with guarantees despite its cheap-as-chips hardware.

'You get better system behavior if you assume everything is a hunk of crap'

Salesforce is designing systems using these techniques so that it can better deal with the flakiness of commodity infrastructure, without having to upgrade to more expensive systems, he said. "Storage servers may crash and they can lie with their frickin' teeth. You get better system behavior if you assume everything is a hunk of crap."

Keystone has four main elements: a Catalog for keeping track of data, a Store for storing it, a Vault for long-term storage, and a Pump for shuffling data between systems over WAN.

The Catalog provides an intermediary between storage systems, such as Oracle, and secondary storage systems built on commodity hardware. Primary storage can point to Keystone, which then points to secondary systems, making it easy to shift the location of secondary data without having to fiddle with the primary system.

The Store keeps hold of the data which is fed into the catalog, and sees Salesforce adopt the design approach pioneered by major companies – such as Google, Facebook, and Amazon – of using large quantities of low-cost hardware to provide backend storage while achieving good guarantees and reliability through a software layer.

This design approach has one glaring problem: failures. These happen a lot, Helland explained. About 4 per cent or more SATA hard drives fail a year, he said, and so a data center with 1,200 of Salesforce's storage-stuffed servers will lose 480 drives every 12 months. Therefore storage needs to be triple replicated to deal with these failures. However, servers will fail at a shade under one percent per year, he explained, so data must also be replicated at distance from rack infrastructures.

This brings about problems regarding maintaining consistency when pulling from a replicate, and so requires a caching layer to store locations of the data and keep everything up to date. This is built on cheap consumer-grade SSDs, he said, which have awful reliability.

"Consumer-grade SSDs will just die - the device stops less than 1 percent per year," he said. They also wear out after about 3,000 write cycles, so with a 750GB SSD you can probably write 2.25PBs to it before it fails, he said.

They also sometimes suffer from bit rot, which in his experience using the cheapest possible systems has one uncorrected bit error for every 1014 bits (11.3TB) written. By comparison, top-of-the-line enterprise-class SSDs have a rot rate of one bit in every 1019 (1.08EB), but they are much more expensive.

Modern software needs to be built to fail, he says, because if you design it in a monolithic, interlinked manner, then a simple hardware brownout can ripple through the entire system and take you offline.

"If everything in the system can break it's more robust if it does break. If you run around and nobody knows what happens when it breaks then you don't have a robust system," he says. ®

Bootnote

* Salesforce is able to use its cheap'n'cheerful infrastructure and still offer strong guarantees to users because Keystone exclusively relies on immutable data structures, Helland said in a chat with El Reg today.

This means once a block of data is stored in the system, it is given a unique ID number and then split into fragments smaller than a megabyte, which each have their own IDs. These fragments are then run through a cyclic-redundancy check (CRC) algorithm that calculates a simple fingerprint from the contents of each fragment.

This metadata is combined by Keystone into a naming scheme that can detect changes to the data, whether by bit rot or other failures: if the contents of a fragment is altered, a subsequently calculated fingerprint will differ from the original value, which will alert the system.

By using such techniques, Salesforce is able to tolerate inexpensive but troublesome hardware in its backend, increase overall reliability for its users, and give its programmers more freedom when writing distributed apps.

"What we do that is special is we make it look just like the classic enterprise IT while we're doing the cloud," Helland said.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
Intel, Cisco and co reveal PLANS to keep tabs on WORLD'S MACHINES
Connecting everything to everything... Er, good idea?
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.