Feeds

Google: Servers are DIMM witted

Servers in the wild have a touch of Alzheimer's

7 Elements of Radically Simple OS Migration

The heat and stress testing of computer components in the lab does not necessarily bear out how components will behave in the field, according to a study done by Google.

When you are Google, and you have millions of server nodes in production using a mix of different technology, you can actually study component failures with a statistically significant sample. That is what Google has done, tracking memory failures in a subset of its servers over the past two and a half years.

Google techies Eduardo Pinheiro and Wolf-Dietrich Weber and their collaborator, Bianca Schroeder of the University of Toronto, have produced a research paper on the subject, entitled DRAM Errors in the Wild: A Large-Scale Field Study. In it, they point out that the number of soft errors - where error correction algorithms can keep a server running after fixing the memory errors - is lower than you might expect in the field based on lab tests. This is good. But the number of hard errors - such as when bits get stuck and a machine crashes and you need to replace a memory module - is a lot higher than current lab tests from memory and server makers might suggest.

Google ran its memory crash tests on six different server platforms in its data centres from January 2006 through June 2008. Three of the six platforms had hardware memory scrubbing technologies that allowed for single-bit soft errors to be washed out of memory systems, at about a rate of 1GB in 45 minutes, according to Google. Three of the platforms didn't have such memory scrubbing electronics, which means soft single-bit errors can accumulate and turn into multi-bit errors.

Google would not say how many machines were in the sample, but rather said that in the 30-month study, the sample had an aggregate of "many millions" of DIMM-days. The servers in the sample used a mix of 1 GB, 2 GB, and 4 GB DIMMs, and DDR1, DDR2, and FB-DIMM memory types. Google does not discuss what processor architecture it uses, but there is little doubt that most - if not all - of Google's machines are x64 (with maybe some still being x86) architecture.

Google had a monitor program that logged correctable errors, uncorrectable errors, CPU utilization, temperature, and memory allocation to see what the relationships were.

One of the interesting bits is that Google discovered that some servers are just plain crankier than others, which is something that system administrators can attest to, even for identical machines. "Some machines develop a very large number of correctable errors compared to others," the authors of the study write. "We find that for all platforms, 20 per cent of the machines with errors make up more than 90 per cent of all observed errors for that platform."

Across all server platforms tested by Google and all of their DIMMs, 8.2 per cent of the memory modules have correctable errors and an average DIMM has almost 4,000 correctable errors per year, if it is on the blink. Some of the server types among the six that Google monitored had much higher error rates than others, but the reasons why were not obvious.

"There is not one memory technology that is clearly superior to the others when it comes to error behaviour," the authors write. So that isn't it. Whatever the problem is, it was not attributable to different memory manufacturers - Google couldn't find any correlation between who made the memory and error rates. Pinheiro, Weber, and Schroeder speculate that higher memory error rates are caused by DIMM layout and differences in the error correction algorithms used by different memory makers.

Interestingly, the platforms that did not have chipkill error correction - which can recover from multiple bit errors in memory subsystems - had lower correctable error rates, but their servers could not survive multi-bit errors. Clearly, there is some kind of tradeoff here. But Google's research also suggests that more power error correction (chipkill versus normal ECC scrubbing) can reduce unrecoverable error rates by a factor of 4 to 10.

The point is, memory error rates on servers are much higher than the lab tests done to date might suggest. Depending on the server platform, Google said it saw per-DIMM correctable error rates that convert to something on the order of 25,000 to 75,000 failures in time (FIT) per billion hours of operation per Mbit. Compared to this, prior lab tests (using the stresses of higher utilization or temperature to simulate a longer time) showed a failure rate of between 200 and 5,000 FIT per Mbit. This is a huge difference, and you can see now why Google invented its Google File System and massive clustering done on the cheap.

The other interesting finding in the research, and one that system admins will nod their heads at almost immediately, is that the number of correctable errors increases as memory modules age, with error rates spiking up after between 10 and 18 months in the field. The incidence of uncorrectable errors goes down over time, however, as crappy components are replaced and hardy ones are left in the systems.

Google's research also suggests that faster and denser memory technologies have had no appreciable effect on increasing memory error rates, contrary to what many server vendors and customers have feared it might - hence the invention of chipkill, to compensate. And while higher temperatures can cause higher memory error rates, the effect is not as high as many would think. Instead, error rates are strongly correlated with the utilization rates on the DIMMs. Temperature is not the biggest cause of stress - swapping data in and out is. ®

Best practices for enterprise data

More from The Register

next story
Microsoft's Euro cloud darkens: US FEDS can dig into foreign servers
They're not emails, they're business records, says court
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
VMware builds product executables on 50 Mac Minis
And goes to the Genius Bar for support
Multipath TCP speeds up the internet so much that security breaks
Black Hat research says proposed protocol will bork network probes, flummox firewalls
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Microsoft says 'weird things' can happen during Windows Server 2003 migrations
Fix coming for bug that makes Kerberos croak when you run two domain controllers
Cisco says network virtualisation won't pay off everywhere
Another sign of strain in the Borg/VMware relationship?
prev story

Whitepapers

7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
A new approach to endpoint data protection
What is the best way to ensure comprehensive visibility, management, and control of information on both company-owned and employee-owned devices?