Forgot your password?
typodupeerror
Data Storage

Google Finds DRAM Errors More Common Than Believed 333

Posted by kdawson
from the forget-me-not dept.
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
This discussion has been archived. No new comments can be posted.

Google Finds DRAM Errors More Common Than Believed

Comments Filter:
  • Percentage? (Score:4, Interesting)

    by Runaway1956 (1322357) * on Tuesday October 06, 2009 @02:57PM (#29661065) Homepage Journal

    "a mean of 3,751 correctable errors per DIMM per year."

    I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

  • Re:Percentage? (Score:3, Interesting)

    by Red Flayer (890720) on Tuesday October 06, 2009 @03:13PM (#29661277) Journal
    Humorous ordering of replies to this article.

    Your post:

    Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

    Post before yours:

    From the study's abstract:
    "We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

    The 'components bit' of your post may be spot-on, but the juxtaposition of your temperature claim, along with the previous poster's quoting of the abstract FTA, is funny (to me, anyway).

  • by eison (56778) <pkteison AT hotmail DOT com> on Tuesday October 06, 2009 @03:14PM (#29661301) Homepage

    I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?

  • Dell (Score:5, Interesting)

    by ^_^x (178540) on Tuesday October 06, 2009 @03:20PM (#29661389)

    In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...

  • Re:ZFS (Score:3, Interesting)

    by profplump (309017) <zach-slashjunk@kotlarek.com> on Tuesday October 06, 2009 @03:24PM (#29661439)

    Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.

  • by jhfry (829244) on Tuesday October 06, 2009 @03:25PM (#29661453)

    Read the article and remember they are talking averages here.

    They give it away with this line:

    Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems

    Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.

    Also this was pretty telling:

    Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

    And this:

    For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.

    So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.

    I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.

    What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

  • by MattRog (527508) on Tuesday October 06, 2009 @03:28PM (#29661483)

    RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com]) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)

  • Re:Percentage? (Score:3, Interesting)

    by HornWumpus (783565) on Tuesday October 06, 2009 @03:31PM (#29661513)

    IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.

    Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.

    It's the same RAM, just arranged differently on the DIMM.

    I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.

  • Re:Percentage? (Score:2, Interesting)

    by skirtsteak_asshat (1622625) on Tuesday October 06, 2009 @03:37PM (#29661603)
    Well, consider that they had a board CUSTOM MADE for them, which means custom BIOS fitments, custom feature implementations, custom BUGS Then add the reality that is DRAM - an imperfect 'art' form of data storage and retrieval. No two chips are EXACTLY the same... though very close. Manufacturing defects may not manifest themselves under normal conditions, and require heating/cooling cycles or fluctuating voltages to break down. Running ECC performs a basic parity check, nothing more, and it's still possible to pass bad bits with ECC enabled, just much less likely. The idea is that you can't really test subcomponents individually and have them check out, and then assemble a system and expect it to just 'work'. Some ram is pretty damn finicky. Standards are anything but.
  • by sshir (623215) on Tuesday October 06, 2009 @03:38PM (#29661627)
    Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
    Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.

    They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.

    I remember folks who did complete checkers wrote that they had a lot of them too.
  • Re:Dell (Score:5, Interesting)

    by Jah-Wren Ryel (80510) on Tuesday October 06, 2009 @03:43PM (#29661723)

    Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).

    Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.

  • Re:Bus errors! (Score:2, Interesting)

    by dotgain (630123) on Tuesday October 06, 2009 @03:44PM (#29661745) Homepage Journal
    I had one mobo, can't remember brand/model exactly but CPU was an AMD K6-2 450MHz, and back then we ran XFree86 which came as seven gzipped-tarballs (if you compile from source). I think it was file number three that would never gunzip on my PC, "invalid compressed data - CRC error", but the MD5 checked out, so I tried it on another machine and it was indeed fine.(and this is back when MD5 was thought secure)

    This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?

    ~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.

  • Re:Percentage? (Score:0, Interesting)

    by Anonymous Coward on Tuesday October 06, 2009 @03:58PM (#29661983)

    the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram

  • Radiation Effects (Score:5, Interesting)

    by Maximum Prophet (716608) on Tuesday October 06, 2009 @04:01PM (#29662023)
    At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
  • by rdebath (884132) on Tuesday October 06, 2009 @04:30PM (#29662421)

    The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.

    If you're transferring gigabytes of data you really need something a lot better.

    Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.

  • by RAMMS+EIN (578166) on Tuesday October 06, 2009 @04:46PM (#29662659) Homepage Journal

    When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.

    Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".

    Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.

    Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.

  • by TJamieson (218336) on Tuesday October 06, 2009 @04:51PM (#29662725)

    I think OP's point was, say you have 4G of non-ECC RAM. It would be neat if you could turn that into, say, 2G of "RAID RAM".

  • Re:Percentage? (Score:4, Interesting)

    by wagnerrp (1305589) on Tuesday October 06, 2009 @05:54PM (#29663529)
    Actually, they are custom motherboards. They are a non-standard form factor, using a custom 12V power connector, instead of a normal ATX/EPS plug. When you figure they're buying tens of thousands of these systems, why would you not have an OEM build you custom boards?

COMPASS [for the CDC-6000 series] is the sort of assembler one expects from a corporation whose president codes in octal. -- J.N. Gray

Working...