Google Finds DRAM Errors More Common Than Believed 333
An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
Percentage? (Score:4, Interesting)
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
Re:Percentage? (Score:3, Interesting)
Your post:
Post before yours:
The 'components bit' of your post may be spot-on, but the juxtaposition of your temperature claim, along with the previous poster's quoting of the abstract FTA, is funny (to me, anyway).
ECC on a home system? (Score:5, Interesting)
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
Dell (Score:5, Interesting)
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
Re:ZFS (Score:3, Interesting)
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.
Misleading, to say the very least. (Score:5, Interesting)
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
"RAID"-style system for RAM... (Score:4, Interesting)
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com]) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Re:Percentage? (Score:3, Interesting)
IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.
Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.
It's the same RAM, just arranged differently on the DIMM.
I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.
Re:Percentage? (Score:2, Interesting)
Want to confirm? Look at your bittorrent log. (Score:5, Interesting)
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
Re:Dell (Score:5, Interesting)
Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).
Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.
Re:Bus errors! (Score:2, Interesting)
This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?
~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.
Re:Percentage? (Score:0, Interesting)
the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram
Radiation Effects (Score:5, Interesting)
Re:Want to confirm? Look at your bittorrent log. (Score:5, Interesting)
The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.
If you're transferring gigabytes of data you really need something a lot better.
Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.
Difficult to find parts that support ECC (Score:5, Interesting)
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Re:"RAID"-style system for RAM... (Score:3, Interesting)
I think OP's point was, say you have 4G of non-ECC RAM. It would be neat if you could turn that into, say, 2G of "RAID RAM".
Re:Percentage? (Score:4, Interesting)