Google Finds DRAM Errors More Common Than Believed - Slashdot

Please create an account to participate in the Slashdot moderation system

×

Google Finds DRAM Errors More Common Than Believed 333

Posted by kdawson on Tuesday October 06, 2009 @02:57PM from the forget-me-not dept.

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

This discussion has been archived. No new comments can be posted.

Google Finds DRAM Errors More Common Than Believed

Search 333 Comments Log In/Create an Account

Comments Filter:

Percentage? (Score:4, Interesting)

by Runaway1956 ( 1322357 ) * writes: on Tuesday October 06, 2009 @02:57PM (#29661065) Homepage Journal

"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

Share
twitter facebook
Re:Percentage? (Score:3, Interesting)

by Red Flayer ( 890720 ) writes: on Tuesday October 06, 2009 @03:13PM (#29661277) Journal

Humorous ordering of replies to this article.

Your post:
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Post before yours:
From the study's abstract:
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."
The 'components bit' of your post may be spot-on, but the juxtaposition of your temperature claim, along with the previous poster's quoting of the abstract FTA, is funny (to me, anyway).

Parent Share
twitter facebook
ECC on a home system? (Score:5, Interesting)

by eison ( 56778 ) writes: <pkteison&hotmail,com> on Tuesday October 06, 2009 @03:14PM (#29661301) Homepage

I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?

Share
twitter facebook
Dell (Score:5, Interesting)

by ^_^x ( 178540 ) writes: on Tuesday October 06, 2009 @03:20PM (#29661389)

In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...

Share
twitter facebook
Re:ZFS (Score:3, Interesting)

by profplump ( 309017 ) writes: <zach-slashjunk@kotlarek.com> on Tuesday October 06, 2009 @03:24PM (#29661439)

Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.

Parent Share
twitter facebook
Misleading, to say the very least. (Score:5, Interesting)

by jhfry ( 829244 ) writes: on Tuesday October 06, 2009 @03:25PM (#29661453)

Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.

And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.

Share
twitter facebook
"RAID"-style system for RAM... (Score:4, Interesting)

by MattRog ( 527508 ) writes: on Tuesday October 06, 2009 @03:28PM (#29661483)

RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com]) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)

Share
twitter facebook
Re:Percentage? (Score:3, Interesting)

by HornWumpus ( 783565 ) writes: on Tuesday October 06, 2009 @03:31PM (#29661513)

IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.
Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.
It's the same RAM, just arranged differently on the DIMM.
I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.

Parent Share
twitter facebook
Re:Percentage? (Score:2, Interesting)

by skirtsteak_asshat ( 1622625 ) writes: on Tuesday October 06, 2009 @03:37PM (#29661603)

Well, consider that they had a board CUSTOM MADE for them, which means custom BIOS fitments, custom feature implementations, custom BUGS Then add the reality that is DRAM - an imperfect 'art' form of data storage and retrieval. No two chips are EXACTLY the same... though very close. Manufacturing defects may not manifest themselves under normal conditions, and require heating/cooling cycles or fluctuating voltages to break down. Running ECC performs a basic parity check, nothing more, and it's still possible to pass bad bits with ECC enabled, just much less likely. The idea is that you can't really test subcomponents individually and have them check out, and then assemble a system and expect it to just 'work'. Some ram is pretty damn finicky. Standards are anything but.

Parent Share
twitter facebook
Want to confirm? Look at your bittorrent log. (Score:5, Interesting)

by sshir ( 623215 ) writes: on Tuesday October 06, 2009 @03:38PM (#29661627)

Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.

They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.

I remember folks who did complete checkers wrote that they had a lot of them too.

Share
twitter facebook
Re:Dell (Score:5, Interesting)

by Jah-Wren Ryel ( 80510 ) writes: on Tuesday October 06, 2009 @03:43PM (#29661723)

Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).
Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.

Parent Share
twitter facebook
Re:Bus errors! (Score:2, Interesting)

by dotgain ( 630123 ) writes: on Tuesday October 06, 2009 @03:44PM (#29661745) Homepage Journal

I had one mobo, can't remember brand/model exactly but CPU was an AMD K6-2 450MHz, and back then we ran XFree86 which came as seven gzipped-tarballs (if you compile from source). I think it was file number three that would never gunzip on my PC, "invalid compressed data - CRC error", but the MD5 checked out, so I tried it on another machine and it was indeed fine.(and this is back when MD5 was thought secure)
This machine compiled a lot of source (it was a Gentoo box), so surely if errors like these had been happening frequently we'd have known from heaps of signal-elevens killing the compiles all the time, right?
~24 hours of Memtest86 revealed nothing. Googling at the time found someone with the exact same mobo+CPU having problems gunzipping the exact same file (with the correct MD5), and I wondered if there was some specific bit-pattern in the file (or gunzip's state) that b0rked on my mobo. In retrospect I should have tried Solaris x86 on the same machine to try gunzipping the file.

Parent Share
twitter facebook
Re:Percentage? (Score:0, Interesting)

by Anonymous Coward writes: on Tuesday October 06, 2009 @03:58PM (#29661983)

the mobo's used by google are the cheapest boards they can get made. There is NO testing until they hit the datacenter floor. Crap mobo plus poor environment (high heat and vibration + poor power controls) makes for a high failure rate. ECC ram has an odd number of memory chips. The odd chip allows for the parity ram. Google memory has even chip counts since non-ECC ram is much MUCH less expensive. So the bios is custom and carves out ECC function from non-ECC ram

Parent Share
twitter facebook
Radiation Effects (Score:5, Interesting)

by Maximum Prophet ( 716608 ) writes: on Tuesday October 06, 2009 @04:01PM (#29662023)

At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

Share
twitter facebook
Re:Want to confirm? Look at your bittorrent log. (Score:5, Interesting)

by rdebath ( 884132 ) writes: on Tuesday October 06, 2009 @04:30PM (#29662421)

The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.
If you're transferring gigabytes of data you really need something a lot better.
Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.

Parent Share
twitter facebook
Difficult to find parts that support ECC (Score:5, Interesting)

by RAMMS+EIN ( 578166 ) writes: on Tuesday October 06, 2009 @04:46PM (#29662659) Homepage Journal

When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.

Share
twitter facebook
Re:"RAID"-style system for RAM... (Score:3, Interesting)

by TJamieson ( 218336 ) writes: on Tuesday October 06, 2009 @04:51PM (#29662725)

I think OP's point was, say you have 4G of non-ECC RAM. It would be neat if you could turn that into, say, 2G of "RAID RAM".

Parent Share
twitter facebook
Re:Percentage? (Score:4, Interesting)

by wagnerrp ( 1305589 ) writes: on Tuesday October 06, 2009 @05:54PM (#29663529)

Actually, they are custom motherboards. They are a non-standard form factor, using a custom 12V power connector, instead of a normal ATX/EPS plug. When you figure they're buying tens of thousands of these systems, why would you not have an OEM build you custom boards?

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

613 commentsIs the Obsession with EV Range All Wrong?
463 commentsElon Musk Predicts Electricity Shortage in Two Years
438 commentsIs 8GB of RAM Enough For a Mac?
426 commentsWhat's the Solution to Gridlocked EV Chargers?
418 commentsWhy EVs Won't Crash the Electric Grid

Remember, UNIX spelled backwards is XINU. -- Mt.