An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).
I'd not be surprised to find the problem much more prevalent in non-datacenter environments.
Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.
The article suggests that errors are less likely on systems with few DIMMS, those which are less heavily used, and that there was no significant difference among types of RAM or vendors, at least with regard to ECC RAM. Thus, laptop and desktop users, who likely only have 2 or 3 DIMMs and make only casual use of their systems have lower risk of errors. ECC RAM may in general be of much higher quality than non-ECC RAM, and thus more prone to error, but its usage is also less mission-critical. In addition, ECC RAM is usually used in systems with many DIMMs that are run 24/7/365.
Good news The study had several findings that are good news for consumers:
* Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isnâ€(TM)t necessary.
* The problem isnâ€(TM)t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
* Heavily used systems have more errors - meaning casual users have less to worry about.
* No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
* Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.
IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.
Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.
It's the same RAM, just arranged differently on the DIMM.
I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.
The original PC added a 9th bit to each byte, creating parity RAM. It was unique among personal computers at the time. None (or nearly none) of the original PC's contemporaries did this. But, since IBM did, many clones followed suit in the PC space. Macs, notably, didn't support ECC for many, many years, but if you pop open a Columbia Data Products PC [textfiles.com], you'll see parity RAM. (Note "128K RAM with parity" in that scan.) IBM went with byte parity in part because bytes were the smallest memory unit the CPU read or wrote to the memory. With byte parity, every memory access could be protected.
This ratio of 9/8 stuck with the PC's memory system over the years, following it to ever wider interfaces. That includes the 16 bit buses of the 286 and 386SX, the 32-bit buses of the 386DX and 486, and the 64 bit bus of the original Pentium. While many manufacturers made the byte parity optional as a cost saver, it was still rather common.
Once you get to 64 bits, you have 8 extra parity bits for a total memory width of 72 bits. This is enough bits to implement a single-error correct, double-error detect Hamming code [wikipedia.org] on the 64-bit data. As long as you always read or write in multiples of 64 bits, you can also generate the Hamming code on writes and check it on reads.
Note that caveat: "As long as you always read or write in multiples of 64 bits." By the time you get to the 486 era, on-board L1 caches started to become standard equipment. Caches can turn a single byte read or write into a multiple byte line-fill (assuming they do read-allocate and write-allocate). They can also make writes wider. In write-back mode, they tend to write back the entire cache line if any portion was updated. In write-through mode, they could theoretically package additional bytes from the cache line to go with whatever bytes the CPU wrote to get to a minimum data size. (I don't know if the 486 or Pentium actually did this, FWIW. I'm speaking of general principles of operation.)
The combination of caches and wider buses made ECC practical for PC hardware starting with the Pentium. That's why you started to see it in that time frame and not before.
BTW, the error rate for individual DRAM bit flips should increase as the bits get smaller. It doesn't surprise me that your Pentium Pro's bits never flipped. It was probably built around 16 megabit DRAM chips, or maybe 64 megabit. If you compare a 16 megabit DRAM chip to a 1 gigabit DRAM chip of the same physical size, the bit cells on the gigabit chip are 1/64th the size. That means far fewer electrons holding the bit. As you can imagine, that might increase the likelihood of error per bit. Google's study didn't show an increase in error rate across memory technologies, but its window of memory technologies didn't stretch back 15 years to the Pentium Pro era.
There's also just the total quantity of memory. Your Pentium Pro system probably had at most 128MB. Compare that to a modern system with 4GB. A 4GB system has 32x the memory of a 128MB system. Even if the per-bit error rate remained constant, there are 32x as many bits, so 32x as many errors. Modern systems also implement scrubbing, meaning they actively read all of memory in the background looking for errors. Older systems just waited for the CPU to access a word with a bad bit to raise an error. This also makes the observed error rate drastically different, since many errors would go by unnoticed in a system without scrubbing, but would get proactively noticed (and fixed) in a system with scrubbing.
FWIW, I run my systems these days with ChipKill ECC enabled and scrubbing enabled. Not taking chances. I'll give up 3-5% on performance since most of the time I won't notice it.
Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).
The vast majority of people have laptop's now which come with a built in UPS.
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Post before yours:
From the study's abstract:
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
Except it depends on how the modules were originally tested. The study is saying that they break more than previously thought, rather than they break a lot. If they were originally tested in a stressed system similar to Googles and Google is finding that they have far more errors than they should then their study is still valid.
"a mean of 3,751 correctable errors per DIMM per year."
Hey, the ECC did its job! Let's all go home.
I'm much to lazy to do the math.
I tried, based on the abstract. Wound up getting a figure of 8% of 2 gigabyte systems having 10 RAM failures per hour and the other 92% being just peachy. While a few bits going south is AFAIK the most common failure state for RAM, some of those RAM sticks must be complete no-POST duds and some are errors-up-the-wazoo massive swaths of RAM corrupted, so that throws my back of the envelope math WAY off....
In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.
Sorry, lost my thread there. So yeah, complex numbers, hard math, random assumptions that bugger our conclusions and maybe bugger theirs.
The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
The problem with something like this is the assumption that Google world == real world.
This RAM is all running on custom Google boards that no one else has access to, with custom power supplies in custom cases in custom storage units. To the researchers' credit, they split things by platform later on, but that just means Google-custom-jobbie-1 and Google-custom-jobbie-2, not Intel board/Asus board/Gigabyte board. Without listing the platforms down to chipsets and CPU types (not gonna happen), it's hard to compare data and check methodology.
While Google is the only place you're going to find literal metric tons of RAM to play with, the common factor that it's all Google might be throwing the numbers. At least some confirmation that these numbers hold at someone else's data center would be nice.
But then, I didn't RTWholeFA, so maybe I missed something.
No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html [cnet.com] If you're really interested, that story gives you a starting point to google from.
No, I don't believe so. They use server boards, custom made to their specs.
I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.
And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all.
Section 7: "All DIMMs were equipped with error correcting logic (ECC) to correct at least single bit errors."
Actually, they are custom motherboards. They are a non-standard form factor, using a custom 12V power connector, instead of a normal ATX/EPS plug. When you figure they're buying tens of thousands of these systems, why would you not have an OEM build you custom boards?
I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.
In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.
Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.
Comparing ECC to mob protection is not a very good analogy. ECC lets you detect and in some cases fix memory errors. The key is the detection part.
If you get a single bit error which results in corrupt data, unless you verify that data some other way you won't know about it unless you have ECC. Verifying data multiple times is computationally expensive and degrades performance, and most server OSs and software don't do it anyway.
As well as error detection the fact that you know it was the memory which corru
Did you even read the article? They found that heat WAS NOT one of the factors. Which makes the rest of your statement seem like just as much bullshit.
Hard DRAM errors are rather hard to explain if the cells are good -- maybe a bad write. After much DRAM testing (I use
memtest86+ weeklong), I've yet to find bad cells.
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a
few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners
sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull
differential like ethernet) and should be less succeptible.
I had a RAM stick (256MB DDR I think) with a stuck bit once. At first I just noticed a few odd kernel panics, but then I got a syntax error in a system Perl script. One letter had changed from lowercase to uppercase. That's when I ran memtest86 and found the culprit.
At the time, a "mark pages of memory bad" patch for the kernel did the trick and I happily used that borked stick for a year or so.
I find that more often then not, when people get blue screens or frequent crashes, that it's due to a bad RAM chip. I think it's kind of a bad thing that most motherboards don't really test the RAM when you book up. Usually running the real RAM test will pick up on most memory errors. You don't even need to run memtest. Sure you save a few seconds on boot up, but it's often better to know there is a problem with your memory then go on for months thinking there is some other problem.
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
ECC is slower by something like 1%, which is completely unnoticeable since RAM contributes relatively little to the overall system performance. 2x faster RAM won't make things run twice as fast, because normally CPU caches get a > 90% hit ratio. Otherwise things would be incredibly slow, as the fastest RAM still is horribly slow and has a horrible latency compared to the cache.
A lot of the AMD boards support ECC RAM but newegg doesn't show it. Most every AM2 motherboard supports it. My main workstation at home is a Phenom II with 8GB ECC RAM mainly for that reason.
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).
Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com]) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Seriously. If you download a lot, and I do, you see quite a few checksum mismatches in the log.
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.
If you're transferring gigabytes of data you really need something a lot better.
Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.
The checksum used by TCP is several orders of magnitude more likely to match a corrupted packet than the checksum used by bittorrent. (citation [psu.edu])
More than likely these are transmission errors where the TCP checksum matched but the bittorrent checksum did not.
At Purdue, many years ago, one of the engineers mapped the ECC RAM errors in a room with hundreds of sparc stations and found that it was mostly in a cone shape pointed toward the window. That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.
What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.
My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.
From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Just as likely to crash, less likely to silently scribble bits of nonsense all over the filesystem before doing so...
Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.
Percentage? (Score:4, Interesting)
"a mean of 3,751 correctable errors per DIMM per year."
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
Re:Percentage? (Score:5, Informative)
"We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a suprisingly small effect on error behavior in the field, when taking all other factors into account."
Parent
Re: (Score:2)
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Re:Percentage? (Score:5, Insightful)
Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.
Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).
I'd not be surprised to find the problem much more prevalent in non-datacenter environments.
Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.
Parent
Re:Percentage? (Score:5, Informative)
Good news
The study had several findings that are good news for consumers:
* Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isnâ€(TM)t necessary.
* The problem isnâ€(TM)t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
* Heavily used systems have more errors - meaning casual users have less to worry about.
* No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
* Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.
Parent
Re: (Score:3, Interesting)
IIRC ECC ram has extra bits and hardware to fix any single bit error and record that it happened.
Regular ram only has parity which can tell the MB the data is suspect but not which bit flipped. Kernel panic, Blue Screen, Guru Meditation# whatever.
It's the same RAM, just arranged differently on the DIMM.
I once had a dual Pentium PRO that required ECC RAM. BIOS recorded 0 ECC errors in the three years or so that was my primary machine. Which is what the Google study would lead me to expect.
Re:Percentage? (Score:5, Informative)
"Regular RAM" has neither parity nor ECC.
The original PC added a 9th bit to each byte, creating parity RAM. It was unique among personal computers at the time. None (or nearly none) of the original PC's contemporaries did this. But, since IBM did, many clones followed suit in the PC space. Macs, notably, didn't support ECC for many, many years, but if you pop open a Columbia Data Products PC [textfiles.com], you'll see parity RAM. (Note "128K RAM with parity" in that scan.) IBM went with byte parity in part because bytes were the smallest memory unit the CPU read or wrote to the memory. With byte parity, every memory access could be protected.
This ratio of 9/8 stuck with the PC's memory system over the years, following it to ever wider interfaces. That includes the 16 bit buses of the 286 and 386SX, the 32-bit buses of the 386DX and 486, and the 64 bit bus of the original Pentium. While many manufacturers made the byte parity optional as a cost saver, it was still rather common.
Once you get to 64 bits, you have 8 extra parity bits for a total memory width of 72 bits. This is enough bits to implement a single-error correct, double-error detect Hamming code [wikipedia.org] on the 64-bit data. As long as you always read or write in multiples of 64 bits, you can also generate the Hamming code on writes and check it on reads.
Note that caveat: "As long as you always read or write in multiples of 64 bits." By the time you get to the 486 era, on-board L1 caches started to become standard equipment. Caches can turn a single byte read or write into a multiple byte line-fill (assuming they do read-allocate and write-allocate). They can also make writes wider. In write-back mode, they tend to write back the entire cache line if any portion was updated. In write-through mode, they could theoretically package additional bytes from the cache line to go with whatever bytes the CPU wrote to get to a minimum data size. (I don't know if the 486 or Pentium actually did this, FWIW. I'm speaking of general principles of operation.)
The combination of caches and wider buses made ECC practical for PC hardware starting with the Pentium. That's why you started to see it in that time frame and not before.
BTW, the error rate for individual DRAM bit flips should increase as the bits get smaller. It doesn't surprise me that your Pentium Pro's bits never flipped. It was probably built around 16 megabit DRAM chips, or maybe 64 megabit. If you compare a 16 megabit DRAM chip to a 1 gigabit DRAM chip of the same physical size, the bit cells on the gigabit chip are 1/64th the size. That means far fewer electrons holding the bit. As you can imagine, that might increase the likelihood of error per bit. Google's study didn't show an increase in error rate across memory technologies, but its window of memory technologies didn't stretch back 15 years to the Pentium Pro era.
There's also just the total quantity of memory. Your Pentium Pro system probably had at most 128MB. Compare that to a modern system with 4GB. A 4GB system has 32x the memory of a 128MB system. Even if the per-bit error rate remained constant, there are 32x as many bits, so 32x as many errors. Modern systems also implement scrubbing, meaning they actively read all of memory in the background looking for errors. Older systems just waited for the CPU to access a word with a bad bit to raise an error. This also makes the observed error rate drastically different, since many errors would go by unnoticed in a system without scrubbing, but would get proactively noticed (and fixed) in a system with scrubbing.
FWIW, I run my systems these days with ChipKill ECC enabled and scrubbing enabled. Not taking chances. I'll give up 3-5% on performance since most of the time I won't notice it.
Parent
Re: (Score:3, Insightful)
Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).
The vast majority of people have laptop's now which come with a built in UPS.
Re: (Score:3, Informative)
UPS - Uninterruptible Power Supply
Now many UPSs also include a Power Conditioner, but a UPS is not a power conditioner.
Re: (Score:3, Interesting)
Your post:
Post before yours:
Re: (Score:2)
I'm much to lazy to do the math. Let's round up - 4k errors per year has to be a vanishingly small percentage for a system that is up 24/7/365, or 5 nines. The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
Except it depends on how the modules were originally tested. The study is saying that they break more than previously thought, rather than they break a lot. If they were originally tested in a stressed system similar to Googles and Google is finding that they have far more errors than they should then their study is still valid.
Re:Percentage? (Score:4, Insightful)
"a mean of 3,751 correctable errors per DIMM per year."
Hey, the ECC did its job! Let's all go home.
I'm much to lazy to do the math.
I tried, based on the abstract. Wound up getting a figure of 8% of 2 gigabyte systems having 10 RAM failures per hour and the other 92% being just peachy. While a few bits going south is AFAIK the most common failure state for RAM, some of those RAM sticks must be complete no-POST duds and some are errors-up-the-wazoo massive swaths of RAM corrupted, so that throws my back of the envelope math WAY off....
In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.
Sorry, lost my thread there. So yeah, complex numbers, hard math, random assumptions that bugger our conclusions and maybe bugger theirs.
The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.
The problem with something like this is the assumption that Google world == real world.
This RAM is all running on custom Google boards that no one else has access to, with custom power supplies in custom cases in custom storage units. To the researchers' credit, they split things by platform later on, but that just means Google-custom-jobbie-1 and Google-custom-jobbie-2, not Intel board/Asus board/Gigabyte board. Without listing the platforms down to chipsets and CPU types (not gonna happen), it's hard to compare data and check methodology.
While Google is the only place you're going to find literal metric tons of RAM to play with, the common factor that it's all Google might be throwing the numbers. At least some confirmation that these numbers hold at someone else's data center would be nice.
But then, I didn't RTWholeFA, so maybe I missed something.
Parent
Re:Percentage? (Score:5, Informative)
No, I don't believe so. They use server boards, custom made to their specs. And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all. http://news.cnet.com/8301-1001_3-10209580-92.html [cnet.com] If you're really interested, that story gives you a starting point to google from.
Parent
Re: (Score:3, Insightful)
No, I don't believe so. They use server boards, custom made to their specs.
I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.
And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all.
Section 7: "All DIMMs were equipped with error correcting logic (ECC) to correct at least single bit errors."
So, yes, it's ECC.
Re:Percentage? (Score:4, Interesting)
Parent
Re:Percentage? (Score:5, Funny)
Parent
Re:Percentage? (Score:5, Informative)
I work on server design, specifically motherboards. ECC is a feature, it helps prevent bit errors from passing through undetected. It is not a method for preventing errors from happening in the first place, nor does it influence the number of bit errors. That is a property of the motherboard design, the chipset, the DIMM PCB and the DRAM. Second, just because you provide a spec for a mobo, does not mean that it is all inclusive. Generally people specify form factor, power, features. They don't specify quality and in most cases don't give a criteria for what it means for a feature to "work". In fact most customers I've talked to don't really understand what quality means from hardware (and sometimes in general). Hardware management, much like software, is designed with similar principles of impact/effort: if customers don't care, we don't test. In other words if it ain't listed on the box, or the salesman won't write it down, just assume it wasn't done.
In spite of the fact that computer motherboards are digital electronics, there is in fact anything but a binary determination of "work" and "not work". Digital signals are an engineering approximation, one which falls apart at high speeds, dense routing and inexpensive design. Well designed and tested motherboards have a well known bit error rate, and reliable companies will not ship a new design until they meet their target. I do this on systems I design, but they aren't cheap, not by a lot. It is a very expensive, time consuming process, one which most companies really want to get rid of. Not all systems are so thoroughly tested, in fact the vast majority of boards out there, server or otherwise, aren't tested much at all.
Forking money for ECC is very similar to paying the mob to protect you. Yes, it will give you more peace of mind, but what you really want is to not be having these problems to begin with. For people who care about data integrity, you should be asking what the bit error rate is and how they know. If they don't know, then you don't want it, ECC or no ECC. Don't assume "the industry" is equal, and don't assume that because a vendor's product X is really good that their product Y is really good too: you WILL be wrong, particularly on computers.
Parent
Re: (Score:3, Insightful)
Comparing ECC to mob protection is not a very good analogy. ECC lets you detect and in some cases fix memory errors. The key is the detection part.
If you get a single bit error which results in corrupt data, unless you verify that data some other way you won't know about it unless you have ECC. Verifying data multiple times is computationally expensive and degrades performance, and most server OSs and software don't do it anyway.
As well as error detection the fact that you know it was the memory which corru
Re: (Score:3, Insightful)
Then, for leaping gods sake, tell us who you work for!
Re:Percentage? (Score:4, Informative)
uh, article showed that temperature has nothing to do with it.
the rest is accurate.
Parent
Re: (Score:3, Informative)
... Running ECC performs a basic parity check, nothing more...
Not [wikipedia.org] exactly [wikipedia.org]...
Re:Percentage? (Score:4, Informative)
Parent
Bus errors! (Score:5, Informative)
What I have seen (and generated) is the occasional (2-3/day) bus error with specific (nasty) datapatterns. Usually at a few addr. I write that off to mobo trace design and crosstalk between the signals. Failing to round the corners sufficiently, or leaving spurs is the likely problem. I think Hypertransport is a balanced design (push-pull differential like ethernet) and should be less succeptible.
Re: (Score:3, Informative)
I had a RAM stick (256MB DDR I think) with a stuck bit once. At first I just noticed a few odd kernel panics, but then I got a syntax error in a system Perl script. One letter had changed from lowercase to uppercase. That's when I ran memtest86 and found the culprit.
At the time, a "mark pages of memory bad" patch for the kernel did the trick and I happily used that borked stick for a year or so.
Re: (Score:3, Insightful)
ECC on a home system? (Score:5, Interesting)
I've always thought it would be a nice-to-have feature for my home system to have ECC - perhaps it might degrade over time and misbehave less if it could detect and fix some errors. But my normal sources don't seem to stock many choices. E.g. Newegg appears to have 2 motherboards to choose from, both for AMD CPUs, nothing for Intel. Frys appears to have one, same, AMD only. Is this just the way things are, or do I need to be looking somewhere else? Would picking one of these motherboards end up in not working out well for my gaming rig?
Re: (Score:2)
ECC is slightly slower.
Re: (Score:3, Informative)
ECC is slower by something like 1%, which is completely unnoticeable since RAM contributes relatively little to the overall system performance. 2x faster RAM won't make things run twice as fast, because normally CPU caches get a > 90% hit ratio. Otherwise things would be incredibly slow, as the fastest RAM still is horribly slow and has a horrible latency compared to the cache.
Re: (Score:3, Informative)
Re:ECC on a home system? (Score:5, Informative)
Parent
Dell (Score:5, Interesting)
In my experience at work ordering Dell desktops and laptops, by far the most common defect is 1-3% of machines with bad RAM. Typically it's made by Hynix, occasionally Hyundai, and I've never seen other brands fail. On many occasions though, I've predicted Hynix, pulled it, and sure enough theirs was the piece causing the errors in Memtest86+...
Re:Dell (Score:5, Interesting)
Hyundai is Hynix and they are the second largest DRAM manufacturer by marketshare (roughly 20% second to Samsung's 30%).
Its no surprise that you've only seen Hynix brand fail in Dells, chances are they are in 90%+ of Dell (and HP and Apple) boxes because they primarily buy from Hynix in the first place. Its selection bias.
Parent
I thought that an inability to recall events (Score:4, Funny)
Misleading, to say the very least. (Score:5, Interesting)
Read the article and remember they are talking averages here.
They give it away with this line:
Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems
Essentially, only 8% of their ECC DIMM's reported ANY errors in a given year.
Also this was pretty telling:
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent.
And this:
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform.
So essentially, they are saying that only 8% of DIMMSs reported errors, 90% of which were on 20% of the machines that had errors, mostly because of motherboard issues... yet DIMMs are less reliable than previously thought.
I would imagine that if you removed all of the bad motherboards, power supplies, environmental, and other issues... that DIMMs are actually more reliable than I previously thought, not less! I wonder what percentage of CPU operations yield incorrect results. With Billions of instructions per second, even an astronomically low average of undetected cpu errors would guarantee an error at least as often as failed DIMMs.
What I did take from the article was that without ECC ram, you have no way of knowing that your RAM has errors. I guess I should rethink my belief that ECC was a waste of money.
"RAID"-style system for RAM... (Score:4, Interesting)
RAM is dirt cheap and most server systems support significantly more RAM than most people bother to install. For critical systems, ECC works but that doesn't prevent everything (double bit errors etc.). Is it time for a Redundant Array of Inexpensive DIMMs? Many HA servers now support Memory Mirroring (aka RAID-1 http://www.rackaid.com/resources/rackaid-blog/server-dysfunction/memory_mirroring_to_the_rescue/ [rackaid.com]) but should there be more research into different RAID levels for memory (RAID5-6, 10, etc?)
Re: (Score:3, Insightful)
ECC IS Raid5 for RAM....
Re: (Score:3, Interesting)
I think OP's point was, say you have 4G of non-ECC RAM. It would be neat if you could turn that into, say, 2G of "RAID RAM".
Want to confirm? Look at your bittorrent log. (Score:5, Interesting)
Especially if the torrent is old. Some of them may be sabotage activity, but I doubt that, considering kind of files.
They are not transmission errors: TCP-IP checks for that. Not hard drive errors - again checksums. They can be intrasystem transmission errors though.
I remember folks who did complete checkers wrote that they had a lot of them too.
Re:Want to confirm? Look at your bittorrent log. (Score:5, Interesting)
The TCP/IP checksums are really weak, only 16bits and rather a poor algorithm anyway. So more than one in 65 thousand errors will be undetected by a TCP/IP checksum. And that's not including buggy network adaptors and drivers that 'fix' or ignore the checksums.
If you're transferring gigabytes of data you really need something a lot better.
Still that's probably not the most common source of errors. You see the same problem exists when data is transferred across an IDE or SCSI bus if there's a checksum at all it's very weak and the amounts of data transferred across a disk bus are scary.
Parent
Re:Want to confirm? Look at your bittorrent log. (Score:5, Informative)
The checksum used by TCP is several orders of magnitude more likely to match a corrupted packet than the checksum used by bittorrent. (citation [psu.edu])
More than likely these are transmission errors where the TCP checksum matched but the bittorrent checksum did not.
Parent
Radiation Effects (Score:5, Interesting)
clearly not a radiation engineer (Score:5, Insightful)
That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.
Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.
What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.
Parent
Lessons learned from *Non* ECC RAM (Score:4, Insightful)
My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.
From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.
Difficult to find parts that support ECC (Score:5, Interesting)
When I was building the computer I'm typing this on, I had the grand idea of building it with so much RAM that I could basically work from RAM. Meaning, for example, that all my running programs and the project I was working on would have to fit in RAM.
Of course, with such a dream, I was concerned about the reliability of my memory. So I wanted ECC. I found out that having ECC memory is not just a matter of buying ECC memory. There are different kinds of ECC memory, and you need to find a combination of memory, motherboard, and CPU that works together. Many sites that offer CPUs and/or motherboards don't list support for ECC among the specifications. Searching for it is difficult, because searching for "ECC" also returns hits for things like "non-ECC" and "ECC: no".
Finally, I found a combination of motherboard and CPU that would support unbuffered ECC DDR2, and a matching pair of memory modules to go with it. And then, when I got all the parts, the RAM didn't fit in the motherboard. Turns out the RAM was FB-DIMM, which had not been listed in the advertisement. I gave up and just bought 2GB of non-ECC RAM to just get the system working. The FB-DIMM (all 8GB of it) is still sitting here, because I haven't found anyone who wants to buy it from me.
Lessons learned: 1. The saying "the nice thing about standards is that there are so many to choose from" is still relevant. I don't know why there have to be so many hardware interfaces to memory chips, but there are, so be careful. 2. Apparently, nobody really cares about ECC RAM, otherwise information would be easier to find. 3. Apparently, AMD CPUs and matching motherboards more usually support ECC RAM than Intel parts and matching motherboards.
Re:ZFS (Score:5, Insightful)
Changing your file system solves RAM errors how?
Parent
Re: (Score:3, Informative)
Obviously, not having RAM errors would be even nicer; but, if you can at least detect trouble when it arises rather than well afterwords, you can avoid having it propagate further, and get away with using cheap redundancy instead of expensive perfection.
Re: (Score:3, Interesting)
Adding checksumming adds another place for errors to occur though -- if data is written correctly but the checksum is-miscalculated, either before it is stored or when the data is being verified -- you'll end up throwing out perfectly good data. If you also have redundancy you're probably willing to live with that, but if you're running on single disk ZFS is just adding more opportunities for data corruption in RAM.
Re: (Score:3, Funny)
Re:Gentoo?? (Score:5, Funny)
I would suspect that it has no bearing on you at all. Simply chanting "Gentoo Gentoo Gentoo" should cure any and all hardware errors. You're safe, AC.
I'll keep this fool occupied, someone go call the guys in white coats for me.
Parent
Re: (Score:3, Funny)
If you use Gentoo, you'll have to make your own DRAM from the schematics.