Feeds

Google: Servers are DIMM witted

Servers in the wild have a touch of Alzheimer's

Next gen security for virtualised datacentres

The heat and stress testing of computer components in the lab does not necessarily bear out how components will behave in the field, according to a study done by Google.

When you are Google, and you have millions of server nodes in production using a mix of different technology, you can actually study component failures with a statistically significant sample. That is what Google has done, tracking memory failures in a subset of its servers over the past two and a half years.

Google techies Eduardo Pinheiro and Wolf-Dietrich Weber and their collaborator, Bianca Schroeder of the University of Toronto, have produced a research paper on the subject, entitled DRAM Errors in the Wild: A Large-Scale Field Study. In it, they point out that the number of soft errors - where error correction algorithms can keep a server running after fixing the memory errors - is lower than you might expect in the field based on lab tests. This is good. But the number of hard errors - such as when bits get stuck and a machine crashes and you need to replace a memory module - is a lot higher than current lab tests from memory and server makers might suggest.

Google ran its memory crash tests on six different server platforms in its data centres from January 2006 through June 2008. Three of the six platforms had hardware memory scrubbing technologies that allowed for single-bit soft errors to be washed out of memory systems, at about a rate of 1GB in 45 minutes, according to Google. Three of the platforms didn't have such memory scrubbing electronics, which means soft single-bit errors can accumulate and turn into multi-bit errors.

Google would not say how many machines were in the sample, but rather said that in the 30-month study, the sample had an aggregate of "many millions" of DIMM-days. The servers in the sample used a mix of 1 GB, 2 GB, and 4 GB DIMMs, and DDR1, DDR2, and FB-DIMM memory types. Google does not discuss what processor architecture it uses, but there is little doubt that most - if not all - of Google's machines are x64 (with maybe some still being x86) architecture.

Google had a monitor program that logged correctable errors, uncorrectable errors, CPU utilization, temperature, and memory allocation to see what the relationships were.

One of the interesting bits is that Google discovered that some servers are just plain crankier than others, which is something that system administrators can attest to, even for identical machines. "Some machines develop a very large number of correctable errors compared to others," the authors of the study write. "We find that for all platforms, 20 per cent of the machines with errors make up more than 90 per cent of all observed errors for that platform."

Across all server platforms tested by Google and all of their DIMMs, 8.2 per cent of the memory modules have correctable errors and an average DIMM has almost 4,000 correctable errors per year, if it is on the blink. Some of the server types among the six that Google monitored had much higher error rates than others, but the reasons why were not obvious.

"There is not one memory technology that is clearly superior to the others when it comes to error behaviour," the authors write. So that isn't it. Whatever the problem is, it was not attributable to different memory manufacturers - Google couldn't find any correlation between who made the memory and error rates. Pinheiro, Weber, and Schroeder speculate that higher memory error rates are caused by DIMM layout and differences in the error correction algorithms used by different memory makers.

Interestingly, the platforms that did not have chipkill error correction - which can recover from multiple bit errors in memory subsystems - had lower correctable error rates, but their servers could not survive multi-bit errors. Clearly, there is some kind of tradeoff here. But Google's research also suggests that more power error correction (chipkill versus normal ECC scrubbing) can reduce unrecoverable error rates by a factor of 4 to 10.

The point is, memory error rates on servers are much higher than the lab tests done to date might suggest. Depending on the server platform, Google said it saw per-DIMM correctable error rates that convert to something on the order of 25,000 to 75,000 failures in time (FIT) per billion hours of operation per Mbit. Compared to this, prior lab tests (using the stresses of higher utilization or temperature to simulate a longer time) showed a failure rate of between 200 and 5,000 FIT per Mbit. This is a huge difference, and you can see now why Google invented its Google File System and massive clustering done on the cheap.

The other interesting finding in the research, and one that system admins will nod their heads at almost immediately, is that the number of correctable errors increases as memory modules age, with error rates spiking up after between 10 and 18 months in the field. The incidence of uncorrectable errors goes down over time, however, as crappy components are replaced and hardy ones are left in the systems.

Google's research also suggests that faster and denser memory technologies have had no appreciable effect on increasing memory error rates, contrary to what many server vendors and customers have feared it might - hence the invention of chipkill, to compensate. And while higher temperatures can cause higher memory error rates, the effect is not as high as many would think. Instead, error rates are strongly correlated with the utilization rates on the DIMMs. Temperature is not the biggest cause of stress - swapping data in and out is. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Death by 1,000 cuts: Mainstream storage array suppliers are bleeding
Cloud, all-flash kit, object storage slicing away at titans of storage
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
VMware vaporises vCHS hybrid cloud service
AnD yEt mOre cRazy cAps to dEal wIth
El Reg's virtualisation desk pulls out the VMworld crystal ball
MARVIN musings and other Gelsinger Gang guessing games
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?