Forgot your password?
typodupeerror
Data Storage

Google Finds DRAM Errors More Common Than Believed 333

Posted by kdawson
from the forget-me-not dept.
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
This discussion has been archived. No new comments can be posted.

Google Finds DRAM Errors More Common Than Believed

Comments Filter:
  • Re:Percentage? (Score:5, Informative)

    by gspear (1166721) on Tuesday October 06, 2009 @03:05PM (#29661171)
    From the study's abstract:

    "We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."

  • Re:Percentage? (Score:5, Informative)

    by Runaway1956 (1322357) * on Tuesday October 06, 2009 @03:06PM (#29661191) Homepage Journal

    No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html [cnet.com] If you're really interested, that story gives you a starting point to google from.

  • Bus errors! (Score:5, Informative)

    by redelm (54142) on Tuesday October 06, 2009 @03:11PM (#29661251) Homepage
    Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use memtest86+ weeklong), I've yet to find bad cells.

    What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.

  • Re:ZFS (Score:3, Informative)

    by fuzzyfuzzyfungus (1223518) on Tuesday October 06, 2009 @03:13PM (#29661287) Journal
    Just as likely to crash, less likely to silently scribble bits of nonsense all over the filesystem before doing so...

    Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.
  • Re:Bus errors! (Score:3, Informative)

    by marcansoft (727665) <hector@@@marcansoft...com> on Tuesday October 06, 2009 @03:16PM (#29661327) Homepage

    I had a RAM stick (256MB DDR I think) with a stuck bit once. At first I just noticed a few odd kernel panics, but then I got a syntax error in a system Perl script. One letter had changed from lowercase to uppercase. That's when I ran memtest86 and found the culprit.

    At the time, a "mark pages of memory bad" patch for the kernel did the trick and I happily used that borked stick for a year or so.

  • Re:Percentage? (Score:5, Informative)

    by jasonwc (939262) on Tuesday October 06, 2009 @03:18PM (#29661345)
    The article suggests that errors are less likely on systems with few DIMMS, those which are less heavily used, and that there was no significant difference among types of RAM or vendors, at least with regard to ECC RAM. Thus, laptop and desktop users, who likely only have 2 or 3 DIMMs and make only casual use of their systems have lower risk of errors. ECC RAM may in general be of much higher quality than non-ECC RAM, and thus more prone to error, but its usage is also less mission-critical. In addition, ECC RAM is usually used in systems with many DIMMs that are run 24/7/365.

    Good news
    The study had several findings that are good news for consumers:

            * Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn&#226;&#8364;(TM)t necessary.
            * The problem isn&#226;&#8364;(TM)t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
            * Heavily used systems have more errors - meaning casual users have less to worry about.
            * No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
            * Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.
  • by DAldredge (2353) <SlashdotEmail@GMail.Com> on Tuesday October 06, 2009 @03:40PM (#29661659) Journal
    A lot of the AMD boards support ECC RAM but newegg doesn't show it. Most every AM2 motherboard supports it. My main workstation at home is a Phenom II with 8GB ECC RAM mainly for that reason.
  • by vadim_t (324782) on Tuesday October 06, 2009 @03:43PM (#29661719) Homepage

    ECC is slower by something like 1%, which is completely unnoticeable since RAM contributes relatively little to the overall system performance. 2x faster RAM won't make things run twice as fast, because normally CPU caches get a > 90% hit ratio. Otherwise things would be incredibly slow, as the fastest RAM still is horribly slow and has a horrible latency compared to the cache.

  • Re:Percentage? (Score:4, Informative)

    by poetmatt (793785) on Tuesday October 06, 2009 @03:57PM (#29661961) Journal

    uh, article showed that temperature has nothing to do with it.

    the rest is accurate.

  • Re:Percentage? (Score:3, Informative)

    by osu-neko (2604) on Tuesday October 06, 2009 @04:33PM (#29662467)

    ... Running ECC performs a basic parity check, nothing more...

    Not [wikipedia.org] exactly [wikipedia.org]...

  • by phantomcircuit (938963) on Tuesday October 06, 2009 @04:39PM (#29662547) Homepage

    The checksum used by TCP is several orders of magnitude more likely to match a corrupted packet than the checksum used by bittorrent. (citation [psu.edu])

    More than likely these are transmission errors where the TCP checksum matched but the bittorrent checksum did not.

  • Re:Percentage? (Score:3, Informative)

    by phantomcircuit (938963) on Tuesday October 06, 2009 @04:44PM (#29662617) Homepage

    UPS - Uninterruptible Power Supply

    Now many UPSs also include a Power Conditioner, but a UPS is not a power conditioner.

  • by MattRog (527508) on Tuesday October 06, 2009 @04:53PM (#29662769)

    No, not really.

    RAID-5 allows for disk failure via distributed block parity. ECC recovers single bit error.

    The "Memory RAID" design should prevent a larger issue (multi-bit/DIMM failure/etc. that ECC cannot prevent) from taking the whole system out.

    I would imagine that ECC memory would be used in conjunction with higher-level striping or mirroring to prevent and recover from both failures.

  • Re:Percentage? (Score:4, Informative)

    by PitaBred (632671) <slashdot.pitabred@dyndns@org> on Tuesday October 06, 2009 @05:15PM (#29663065) Homepage
    Did you even read the article? They found that heat WAS NOT one of the factors. Which makes the rest of your statement seem like just as much bullshit.
  • by PitaBred (632671) <slashdot.pitabred@dyndns@org> on Tuesday October 06, 2009 @05:22PM (#29663153) Homepage
    The article states 5-6%, which jives with benchmarks I've found [computerpoweruser.com].
  • by evil-barn (464762) on Tuesday October 06, 2009 @06:01PM (#29663605)

    You can do this. My IBM x3550 servers (which are ancient) has this option. It's set by jumpers on the motherboard.

  • Re:Percentage? (Score:5, Informative)

    by Austerity Empowers (669817) on Tuesday October 06, 2009 @07:28PM (#29664455)

    I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.

    In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.

    Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.

  • by Anonymous Coward on Tuesday October 06, 2009 @08:54PM (#29665117)
    Here's the technique I use on Linux, for a K10. The scrubber can be accessed via the PCI config space of vendor:device 1022:1203, using registers starting at offset 0x40, just afte rthe 64-byte standard PCI config space.
    • Turn off ECC error reporting with the low 3 bits of 40.L
    • Turn on ECC (bit 22 of 44.L)
    • Set the scrub address to 0 (64 bits in 5C.L and 60.L), with the lsbit set to 1 (write back after correction)
    • Set the scrub rate to the maximum of 64 bytes/40 ns (1.5 GiB/s) using lsbyte of 58.L
    • Set the L1, L2, and L3 cache scrub rates to the AMD-recommended values (other bytes of 58.L).
    • Wait 6 seconds (5.37, actually) for 8 GiB of memory to be scrubbed
    • Set the scrub rate to 2^13 times less (0.66 GiB/hour) to scrub 8 GiB every 12 hours
    • Enable ECC error reporting.

    The commands to do this are:

    • setpci -v -d 1022:1203 40.L=0:3 44.L=00400000:00400000 5C.L=1,0 58.L=0F121001
    • sleep 6
    • setpci -v -d 1022:1203 58.L=0E:FF 40.L=3:3

    You can watch the scrub address register incrementing using
    setpci -d 1022:1203 60.L 5C.L

    Similar commands work on the K8 (single-core Athlon 64), but the device is :1103, and leave the msbyte of 58.L alone (there is no L3 cache scrubber).

  • Re:Percentage? (Score:5, Informative)

    by Mr Z (6791) on Wednesday October 07, 2009 @10:30AM (#29669689) Homepage Journal

    "Regular RAM" has neither parity nor ECC.

    The original PC added a 9th bit to each byte, creating parity RAM. It was unique among personal computers at the time. None (or nearly none) of the original PC's contemporaries did this. But, since IBM did, many clones followed suit in the PC space. Macs, notably, didn't support ECC for many, many years, but if you pop open a Columbia Data Products PC [textfiles.com], you'll see parity RAM. (Note "128K RAM with parity" in that scan.) IBM went with byte parity in part because bytes were the smallest memory unit the CPU read or wrote to the memory. With byte parity, every memory access could be protected.

    This ratio of 9/8 stuck with the PC's memory system over the years, following it to ever wider interfaces. That includes the 16 bit buses of the 286 and 386SX, the 32-bit buses of the 386DX and 486, and the 64 bit bus of the original Pentium. While many manufacturers made the byte parity optional as a cost saver, it was still rather common.

    Once you get to 64 bits, you have 8 extra parity bits for a total memory width of 72 bits. This is enough bits to implement a single-error correct, double-error detect Hamming code [wikipedia.org] on the 64-bit data. As long as you always read or write in multiples of 64 bits, you can also generate the Hamming code on writes and check it on reads.

    Note that caveat: "As long as you always read or write in multiples of 64 bits." By the time you get to the 486 era, on-board L1 caches started to become standard equipment. Caches can turn a single byte read or write into a multiple byte line-fill (assuming they do read-allocate and write-allocate). They can also make writes wider. In write-back mode, they tend to write back the entire cache line if any portion was updated. In write-through mode, they could theoretically package additional bytes from the cache line to go with whatever bytes the CPU wrote to get to a minimum data size. (I don't know if the 486 or Pentium actually did this, FWIW. I'm speaking of general principles of operation.)

    The combination of caches and wider buses made ECC practical for PC hardware starting with the Pentium. That's why you started to see it in that time frame and not before.

    BTW, the error rate for individual DRAM bit flips should increase as the bits get smaller. It doesn't surprise me that your Pentium Pro's bits never flipped. It was probably built around 16 megabit DRAM chips, or maybe 64 megabit. If you compare a 16 megabit DRAM chip to a 1 gigabit DRAM chip of the same physical size, the bit cells on the gigabit chip are 1/64th the size. That means far fewer electrons holding the bit. As you can imagine, that might increase the likelihood of error per bit. Google's study didn't show an increase in error rate across memory technologies, but its window of memory technologies didn't stretch back 15 years to the Pentium Pro era.

    There's also just the total quantity of memory. Your Pentium Pro system probably had at most 128MB. Compare that to a modern system with 4GB. A 4GB system has 32x the memory of a 128MB system. Even if the per-bit error rate remained constant, there are 32x as many bits, so 32x as many errors. Modern systems also implement scrubbing, meaning they actively read all of memory in the background looking for errors. Older systems just waited for the CPU to access a word with a bad bit to raise an error. This also makes the observed error rate drastically different, since many errors would go by unnoticed in a system without scrubbing, but would get proactively noticed (and fixed) in a system with scrubbing.

    FWIW, I run my systems these days with ChipKill ECC enabled and scrubbing enabled. Not taking chances. I'll give up 3-5% on performance since most of the time I won't notice it.

panic: kernel trap (ignored)

Working...