Feeds

Google: Servers are DIMM witted

Servers in the wild have a touch of Alzheimer's

Mobile application security vulnerability report

The heat and stress testing of computer components in the lab does not necessarily bear out how components will behave in the field, according to a study done by Google.

When you are Google, and you have millions of server nodes in production using a mix of different technology, you can actually study component failures with a statistically significant sample. That is what Google has done, tracking memory failures in a subset of its servers over the past two and a half years.

Google techies Eduardo Pinheiro and Wolf-Dietrich Weber and their collaborator, Bianca Schroeder of the University of Toronto, have produced a research paper on the subject, entitled DRAM Errors in the Wild: A Large-Scale Field Study. In it, they point out that the number of soft errors - where error correction algorithms can keep a server running after fixing the memory errors - is lower than you might expect in the field based on lab tests. This is good. But the number of hard errors - such as when bits get stuck and a machine crashes and you need to replace a memory module - is a lot higher than current lab tests from memory and server makers might suggest.

Google ran its memory crash tests on six different server platforms in its data centres from January 2006 through June 2008. Three of the six platforms had hardware memory scrubbing technologies that allowed for single-bit soft errors to be washed out of memory systems, at about a rate of 1GB in 45 minutes, according to Google. Three of the platforms didn't have such memory scrubbing electronics, which means soft single-bit errors can accumulate and turn into multi-bit errors.

Google would not say how many machines were in the sample, but rather said that in the 30-month study, the sample had an aggregate of "many millions" of DIMM-days. The servers in the sample used a mix of 1 GB, 2 GB, and 4 GB DIMMs, and DDR1, DDR2, and FB-DIMM memory types. Google does not discuss what processor architecture it uses, but there is little doubt that most - if not all - of Google's machines are x64 (with maybe some still being x86) architecture.

Google had a monitor program that logged correctable errors, uncorrectable errors, CPU utilization, temperature, and memory allocation to see what the relationships were.

One of the interesting bits is that Google discovered that some servers are just plain crankier than others, which is something that system administrators can attest to, even for identical machines. "Some machines develop a very large number of correctable errors compared to others," the authors of the study write. "We find that for all platforms, 20 per cent of the machines with errors make up more than 90 per cent of all observed errors for that platform."

Across all server platforms tested by Google and all of their DIMMs, 8.2 per cent of the memory modules have correctable errors and an average DIMM has almost 4,000 correctable errors per year, if it is on the blink. Some of the server types among the six that Google monitored had much higher error rates than others, but the reasons why were not obvious.

"There is not one memory technology that is clearly superior to the others when it comes to error behaviour," the authors write. So that isn't it. Whatever the problem is, it was not attributable to different memory manufacturers - Google couldn't find any correlation between who made the memory and error rates. Pinheiro, Weber, and Schroeder speculate that higher memory error rates are caused by DIMM layout and differences in the error correction algorithms used by different memory makers.

Interestingly, the platforms that did not have chipkill error correction - which can recover from multiple bit errors in memory subsystems - had lower correctable error rates, but their servers could not survive multi-bit errors. Clearly, there is some kind of tradeoff here. But Google's research also suggests that more power error correction (chipkill versus normal ECC scrubbing) can reduce unrecoverable error rates by a factor of 4 to 10.

The point is, memory error rates on servers are much higher than the lab tests done to date might suggest. Depending on the server platform, Google said it saw per-DIMM correctable error rates that convert to something on the order of 25,000 to 75,000 failures in time (FIT) per billion hours of operation per Mbit. Compared to this, prior lab tests (using the stresses of higher utilization or temperature to simulate a longer time) showed a failure rate of between 200 and 5,000 FIT per Mbit. This is a huge difference, and you can see now why Google invented its Google File System and massive clustering done on the cheap.

The other interesting finding in the research, and one that system admins will nod their heads at almost immediately, is that the number of correctable errors increases as memory modules age, with error rates spiking up after between 10 and 18 months in the field. The incidence of uncorrectable errors goes down over time, however, as crappy components are replaced and hardy ones are left in the systems.

Google's research also suggests that faster and denser memory technologies have had no appreciable effect on increasing memory error rates, contrary to what many server vendors and customers have feared it might - hence the invention of chipkill, to compensate. And while higher temperatures can cause higher memory error rates, the effect is not as high as many would think. Instead, error rates are strongly correlated with the utilization rates on the DIMMs. Temperature is not the biggest cause of stress - swapping data in and out is. ®

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
FLAPE – the next BIG THING in storage
Find cold data with flash, transmit it from tape
Seagate chances ARM with NAS boxes for the SOHO crowd
There's an Atom-powered offering, too
Gartner: To the right, to the right – biz sync firms who've won in a box to the right...
Magic quadrant: Top marks for, er, completeness of vision, EMC
prev story

Whitepapers

Reducing security risks from open source software
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.