Google: Servers are DIMM witted

Servers in the wild have a touch of Alzheimer's

Maximizing your infrastructure through virtualization

The heat and stress testing of computer components in the lab does not necessarily bear out how components will behave in the field, according to a study done by Google.

When you are Google, and you have millions of server nodes in production using a mix of different technology, you can actually study component failures with a statistically significant sample. That is what Google has done, tracking memory failures in a subset of its servers over the past two and a half years.

Google techies Eduardo Pinheiro and Wolf-Dietrich Weber and their collaborator, Bianca Schroeder of the University of Toronto, have produced a research paper on the subject, entitled DRAM Errors in the Wild: A Large-Scale Field Study. In it, they point out that the number of soft errors - where error correction algorithms can keep a server running after fixing the memory errors - is lower than you might expect in the field based on lab tests. This is good. But the number of hard errors - such as when bits get stuck and a machine crashes and you need to replace a memory module - is a lot higher than current lab tests from memory and server makers might suggest.

Google ran its memory crash tests on six different server platforms in its data centres from January 2006 through June 2008. Three of the six platforms had hardware memory scrubbing technologies that allowed for single-bit soft errors to be washed out of memory systems, at about a rate of 1GB in 45 minutes, according to Google. Three of the platforms didn't have such memory scrubbing electronics, which means soft single-bit errors can accumulate and turn into multi-bit errors.

Google would not say how many machines were in the sample, but rather said that in the 30-month study, the sample had an aggregate of "many millions" of DIMM-days. The servers in the sample used a mix of 1 GB, 2 GB, and 4 GB DIMMs, and DDR1, DDR2, and FB-DIMM memory types. Google does not discuss what processor architecture it uses, but there is little doubt that most - if not all - of Google's machines are x64 (with maybe some still being x86) architecture.

Google had a monitor program that logged correctable errors, uncorrectable errors, CPU utilization, temperature, and memory allocation to see what the relationships were.

One of the interesting bits is that Google discovered that some servers are just plain crankier than others, which is something that system administrators can attest to, even for identical machines. "Some machines develop a very large number of correctable errors compared to others," the authors of the study write. "We find that for all platforms, 20 per cent of the machines with errors make up more than 90 per cent of all observed errors for that platform."

Across all server platforms tested by Google and all of their DIMMs, 8.2 per cent of the memory modules have correctable errors and an average DIMM has almost 4,000 correctable errors per year, if it is on the blink. Some of the server types among the six that Google monitored had much higher error rates than others, but the reasons why were not obvious.

"There is not one memory technology that is clearly superior to the others when it comes to error behaviour," the authors write. So that isn't it. Whatever the problem is, it was not attributable to different memory manufacturers - Google couldn't find any correlation between who made the memory and error rates. Pinheiro, Weber, and Schroeder speculate that higher memory error rates are caused by DIMM layout and differences in the error correction algorithms used by different memory makers.

Interestingly, the platforms that did not have chipkill error correction - which can recover from multiple bit errors in memory subsystems - had lower correctable error rates, but their servers could not survive multi-bit errors. Clearly, there is some kind of tradeoff here. But Google's research also suggests that more power error correction (chipkill versus normal ECC scrubbing) can reduce unrecoverable error rates by a factor of 4 to 10.

The point is, memory error rates on servers are much higher than the lab tests done to date might suggest. Depending on the server platform, Google said it saw per-DIMM correctable error rates that convert to something on the order of 25,000 to 75,000 failures in time (FIT) per billion hours of operation per Mbit. Compared to this, prior lab tests (using the stresses of higher utilization or temperature to simulate a longer time) showed a failure rate of between 200 and 5,000 FIT per Mbit. This is a huge difference, and you can see now why Google invented its Google File System and massive clustering done on the cheap.

The other interesting finding in the research, and one that system admins will nod their heads at almost immediately, is that the number of correctable errors increases as memory modules age, with error rates spiking up after between 10 and 18 months in the field. The incidence of uncorrectable errors goes down over time, however, as crappy components are replaced and hardy ones are left in the systems.

Google's research also suggests that faster and denser memory technologies have had no appreciable effect on increasing memory error rates, contrary to what many server vendors and customers have feared it might - hence the invention of chipkill, to compensate. And while higher temperatures can cause higher memory error rates, the effect is not as high as many would think. Instead, error rates are strongly correlated with the utilization rates on the DIMMs. Temperature is not the biggest cause of stress - swapping data in and out is. ®

The Power of One eBook: Top reasons to choose HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
Carbon tax repeal won't see data centre operators cut prices
Rackspace says electricity isn't a major cost, Equinix promises 'no levy'
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Securing Web Applications Made Simple and Scalable
Learn how automated security testing can provide a simple and scalable way to protect your web applications.