Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Technology

Reliability of Computer Memory? 724

olddoc writes "In the days of 512MB systems, I remember reading about cosmic rays causing memory errors and how errors become more frequent with more RAM. Now, home PCs are stuffed with 6GB or 8GB and no one uses ECC memory in them. Recently I had consistent BSODs with Vista64 on a PC with 4GB; I tried memtest86 and it always failed within hours. Yet when I ran 64-bit Ubuntu at 100% load and using all memory, it ran fine for days. I have two questions: 1) Do people trust a memtest86 error to mean a bad memory module or motherboard or CPU? 2) When I check my email on my desktop 16GB PC next year, should I be running ECC memory?"
This discussion has been archived. No new comments can be posted.

Reliability of Computer Memory?

Comments Filter:
  • Answers (Score:5, Interesting)

    by jawtheshark ( 198669 ) * <slashdot@nosPAm.jawtheshark.com> on Monday March 30, 2009 @02:10AM (#27384835) Homepage Journal

    1) Yes

    2) No

    Now to be serious. Home PC do not come yet with 6GB or 8GB. Most new home PC still seem to have between 1GB and 4GB. Where the 4GB variety is rare because of the fact that most home PCs still come with a 32-bit operating system. 3GB seems to be the sweet spot for higher-end-home-pcs. Your home PC will most likely not have 16GB next year. Your workstation at work, perhaps, but then even perhaps.

    At the risk of sounding like "640KByte is enough for everyone", I have to ask why you think why you need 16GB to check your email next year. I'm typing this on a 6 year old computer, I'm running quite a few applications at the same time and I know a second user is logged in. Current memory usage: 764Meg RAM. As a general rule, I know that Windows XP runs fine on 512Meg RAM and is comfortable with 1GB RAM. The same is true for GNU/Linux running Gnome.

    Now, at work with Eclipse loaded, a couple of application servers, a database and a few VMs... Yeah, there indeed you get memory starved quickly. You have to keep in mind that such usage pattern is not that of a typical office worker. I can imagine that a heavy Photoshop user would want every bit of RAM he can get too. The Word-wielding-office-worker? I don't think so.

    Now, I can't speak for Vista. I heard it runs well on 2GB systems, but I can't say. I got a new work laptop last week and booted briefly in Vista. It felt extremely sluggish and my machine does have 4Gig RAM. Anyway, I didn't bother and put Debian Lenny/amd64 on it and didn't look back.

    I my idea, you have quite a twisted sense of reality regarding to the computers people actually use.

    Oh, and frankly... If cosmic rays would be a big issue by now with huge memories, don't you think that more people would be complaining? I can't say why Ubuntu/amd64 ran fine on your machine. Perhaps GNU/Linux has built-in error correction and marks bad RAM as "bad".

  • by Anonymous Coward on Monday March 30, 2009 @02:14AM (#27384849)

    Another nice tool is prime95. I've used it when doing memory overclocking and it seemed to find the threshold fairly quickly. Of course your comment still stands - even if a software tool says the memory is good, it might not necessarily be true.

  • RAID(?) for RAM (Score:5, Interesting)

    by Xyde ( 415798 ) <slashdot@purrrrTIGER.net minus cat> on Monday March 30, 2009 @02:19AM (#27384875)

    With memory becoming so plentiful these days (I haven't seen many home PC's with 6 or 8GB granted, but we're getting there) it seems that a single error on a large capacity chip is getting more and more trivial. Isn't it a waste to throw away a whole DIMM? Why isn't it possible to "remap" this known-bad address, or allocate some amount of RAM for parity the way software like PAR2 works? Hard drive manufacturers already remap bad blocks on new drives. Also it seems to me that, being a solid state device, small failures in RAM aren't necessarily indicative of a failing component like bad sectors on a hard drive are. Am I missing something really obvious here or is it really just easier/cheaper to throw it away?

  • Re:Error response (Score:2, Interesting)

    by gabebear ( 251933 ) on Monday March 30, 2009 @02:24AM (#27384899) Homepage Journal
    Anyone else have RAM modules degrade over time? I've never seen this.

    I always buy faster modules than I'm actually using. I usually test the system with memtest at a higher frequency than what it's going to run. My last build overclocked to [2.7Ghz CPU, 1066 FSB, 1066/CL7 DDR3] with memtest still reporting no errors; I run it at [2.1Ghz, 800, 800/CL7](a.k.a stock speed).
  • Depends (Score:5, Interesting)

    by gweihir ( 88907 ) on Monday March 30, 2009 @02:24AM (#27384901)

    My experience with a server that recorded about 15TB of data is something like 6 bit-errors per year that could not be traced to any source. This was a server with ECC RAM, so the problem likely occured in busses, network cards, and the like, not in RAM.

    For non-ECC memory, I would strongly syggest running memtest86+ at least a day before using the system and if it gives you errors, replace the memory. I had one very persistend bit-error in a PC in a cluster, that actually reqired 2 days of memtest86+ to show up once, but did occure about once per hour for some computations. I also had one other bit-error that memtest86+ did not find, but the Linux commandline memory tester found after about 12 hours.

    The problem here is that different testing/usage patterns result in different occurence probability for weak bits, i.e. bits that only sometimes fail. Any failure in memtest86+ or any other RAM tester indicates a serious problem. The absence of errors in a RAM test does not indicate the memory is necessarily fine.

    That said, I do not believe memory errors have become more common on a per computer basis. RAM has become larger, but also more reliable. Of course, people participating in the stupidity called "overclocking" will see a lot more memory errors and other errors as well. But a well-designed system with quality hardware and a thourough initial test should typically not have memory issues.

    However there is "quality" hardware, that gets it wrong. My ASUS board sets the timing for 2 and 4 memory modules to the values for 1 module. This resulted in stable 1 and 2 module operation, but got flaky for 4 modules. Finally I moved to ECC memory before I figuerd out that I had to manually set the correct timings. (No BIOS upgrade available that fixed this...) This board has a "professional" in its name, but apparently, "professional" does not include use of generic (Kingston, no less) memory modules. Other people have memory issues with this board as well that they could not fix this way, seems that somethimes a design just is bad or even reputed manufacturers do not spend a lot of effort to fix issues in some cases.In can only advise you to do a thourough forum-search before buying a specific mainboard.

     

  • by a09bdb811a ( 1453409 ) on Monday March 30, 2009 @02:41AM (#27384967)

    Is ECC memory worth the money in a machine you use to check your E-mail?

    Unbuffered ECC is only a few $ more than unbuffered non-ECC. It's only 9 chips per side instead of 8, after all. The performance impact is marginal.

    I see no reason not to use ECC except that Intel doesn't want you to. It seems they want to keep ECC as a 'server' feature (as if your desktop at home isn't 'serving' you your data). So all their consumer chipsets don't support it, and the i7's memory controller doesn't either. AMD doesn't play that game with their chips, but it seems only ASUS actually implements the ECC support on most of their boards.

  • Re:RAID(?) for RAM (Score:3, Interesting)

    by Rufus211 ( 221883 ) <rufus-slashdotNO@SPAMhackish.org> on Monday March 30, 2009 @02:46AM (#27384995) Homepage

    You just described ECC scrubbing [wikipedia.org] and Chipkill [wikipedia.org]. The technology's been around for a while, but it costs >$0 to implement so most people don't bother. As with most RAS [wikipedia.org] features most people don't know anything about it, so would rather pay $50 less than have a strange feature that could end up saving them hours of downtime. At the same time if you actually know what these features are and you need them, you're probably going to be willing to shell out the money to pay for them.

  • Re:The truth (Score:5, Interesting)

    by Mr Z ( 6791 ) on Monday March 30, 2009 @02:49AM (#27385013) Homepage Journal

    Note: having more memory increases your error rate assuming a constant rate of error (per megabyte) in the memory. However, if the error rate drops as technology advances, adding more memory does not necessarily result in a higher system error rate. And based on what I've seen, this most definitely seems to be the case.

    Actually, error rates per bit are increasing, because bits are getting smaller and fewer electrons are holding the value for your bit. An alpha particle whizzing through your RAM will take out several bits if it hits the memory array at the right angle. Previously, the bits were so large that there was a good chance the bit wouldn't flip. Now they're small enough that multiple bits might flip.

    This is why I run my systems with ECC memory and background scrubbing enabled. Scrubbing is where the system actively picks up lines and proactively fixes bit-flips as a background activity. I've actually had a bitflip translate into persistent corruption on the hard drive. I don't want that again.

    FWIW, I work in the embedded space architecting chips with large amounts of on-chip RAM. These chips go into various infrastructure pieces, such as cell phone towers. These days we can't sell such a part without ECC, and customers are always wanting more. We actually characterize our chip's RAM's bit-flip behavior by actively trying to cause bit-flips in a radiation-filled environment. Serious business.

    Now, other errors that parity/ECC used to catch, such as signal integrity issues from mismatched components or devices pushed beyond their margins... Yeah, I can see improved technology helping that.

  • Re:Paranoia? (Score:5, Interesting)

    by dgatwood ( 11270 ) on Monday March 30, 2009 @02:54AM (#27385029) Homepage Journal

    The probability of a cosmic ray at precisely the right angle and speed to cause a single bit error and cause an app to crash is somewhere on the same order as your chances of getting hit by a car, getting struck by lightning, getting torn apart by rabid wolves, and having sex in the back of a red 1948 Buick convertible at a drive-in movie theater on Tuesday night, Feb. 29th under a blue moon... all at the same time.... Sure, given enough bits, it's bound to happen sooner or later, but it isn't something I'd worry about. :-)

    The probability of RAM just plain being defective---failing to operate correctly due to bugs in handling of certain low power states, having actual bad bits, having insufficient decoupling capacitance to work correctly in the presence of power supply rail noise, etc---is probably several hundred thousand orders of magnitude greater (probably on the order of a one in several thousand chance of a given part being bad versus happening to a given part a few times before the heat death of the universe).

    Memory test failures (other than mapping errors) are pretty much always caused by hardware failing. If running memtest86 in Linux works correctly for days, this probably means one of three things:

    • A. Linux is detecting the bad part and is mapping out the RAM in question.
    • B. The Linux VM system doesn't move things around RAM as much as Windows. Thus, random chunks of code don't end up there, and the few that do are in rarely used parts of background daemons or unused kernel modules so you don't notice the problem.
    • C. Linux power management isn't as rough on the RAM or CPU as Windows. Dodgy RAM/CPUs are most likely to fail when you take them through power state changes like putting the machine to sleep or switching the CPU into or out of an idle state. If Linux is making power state changes less frequently, is not using some of the lowest power states, is not stepping clock speeds, is not dropping the RAM refresh rate in sleep mode, etc., then you are less likely to see memory corruption. Similarly, power state changes can increase the rate of crashes due to a defective CPU or memory controller (northbridge).

    I couldn't tell you which of these is the case without swapping out parts, of course. You should definitely take the time to replace whatever is bad even if it seems to be "working" in Linux. In the worst case, you have a few bad bits of RAM, they're somewhere in the middle of your disk cache in Linux, and you are slowly and silently corrupting data periodically on its way out to disk.... You definitely need to figure out what's wrong with the hardware and why it is only failing in Windows, and it sounds like the only way to do that is to swap out parts, boot into Windows, and see if the problem is still reproducible in under a couple of days, repeating with different part swaps until the problem goes away. Don't forget to try a different power supply.

  • Re:Paranoia? (Score:1, Interesting)

    by Jamie's Nightmare ( 1410247 ) on Monday March 30, 2009 @03:41AM (#27385233)
    This is a load of crap and your Pro-Linux bias stinks up to high heaven.

    If running memtest86 in Linux works correctly for days, this probably means one of three things:

    First of all, you don't run Memtest86 under Windows, Linux, or any other operating system. Why? Because you can't test memory that is in use by any other program. This already tells us that you probably haven't used Memtest86 recently enough to remember you would run this from a bootable CD or Floppy. It's downhill from here.

    A. Linux is detecting the bad part and is mapping out the RAM in question.

    No. Linux doesn't do this. Can you imagine the extra overhead of double checking every single read and write to RAM? Jesus Christ.

    B. The Linux VM system doesn't move things around RAM as much as Windows.

    Nice, baseless troll argument.

    C. Linux power management isn't as rough on the RAM or CPU as Windows.

    Isn't as rough? Because half the time it doesn't work as intended? So now a negative becomes a plus? Give us a break.

  • by Idaho ( 12907 ) on Monday March 30, 2009 @04:20AM (#27385409)

    My experience with memtest is you can trust the results if it says the memory is bad, however if the memory passed it could still be bad.

    I wonder how strongly RAM stability depends on power fluctuations. While you're testing memory using Memtest, the GPU is not used at all, for example. When playing a game and/or running some heavy compile-jobs, on the other hand, overall power usage will be much higher. I wonder if this may reflect on RAM stability, especially if the power supply is not really up to par?

    If so, you might never find out about such a problem by using (only) memtest.

  • Re:Yawn (Score:1, Interesting)

    by easyTree ( 1042254 ) on Monday March 30, 2009 @04:35AM (#27385489)

    I'd mod you down but you're already at -1. Stop whining about kdawson and whine about the posts instead! n00b

  • Re:Paranoia? (Score:3, Interesting)

    by dgatwood ( 11270 ) on Monday March 30, 2009 @04:37AM (#27385503) Homepage Journal

    You're right that I've never run memtest86 at all. I hadn't regularly worked with any hardware based on an Intel architecture until about two years ago, and haven't experienced any RAM problems in that relatively short period. That is the sole valid criticism in your post, and even that was redundant. The rest of your post consists of you putting words in my mouth that I did not say.

    Regarding point A., many Linux systems do perform at least rudimentary RAM checks. What I said was that it is remotely possible that it got lucky and detected the problem during such screening, then flagged that page of physical RAM as defective. I never said anything about checking every write to RAM. That was you putting words in my mouth, and completely ludicrous words that I'd have to know almost nothing about hardware to say, at that. NIce straw man.

    Regarding point B., that's not a baseless troll argument by any stretch of the imagination. First, running a lean Linux distro will almost certainly thrash pages around far less than 64-bit Vista simply because the OS uses far less RAM. Second, last time I used it, Linux wired down a -lot- of pages down in the kernel. All of those pages are just going to sit there. If anything, this was a criticism of Linux's tendency to wire too many pages, not any sort of "pro-Linux" comment. Maybe it might be taken to mean that Linux is less likely to eject pages belonging to one process in favor of another process---indeed, my experience has been that it does seem to do so less frequently than some other operating systems, though this can either be good or bad depending on the workload in question---but that was in no way implied by my previous comment, nor certainly was there any value judgment on my part as to whether such behavior is good or bad.

    Likewise on point C., I was actually being harshly critical of Linux's power management, albeit without coming right out and saying it. Nowhere in my statement did I in ANY way insinuate that failing to switch into the lowest power states was in any way a good thing. It isn't. Poor power management leads to diminished battery life in portables and increased electric bills from computers of all types.

    Before you go painting me as a pro-Linux troll, you need to learn some reading comprehension skills and stop trying to put words in my mouth. It only makes you look like a troll yourself.

  • by Sen.NullProcPntr ( 855073 ) on Monday March 30, 2009 @05:39AM (#27385807)

    While you're testing memory using Memtest, the GPU is not used at all, for example. When playing a game and/or running some heavy compile-jobs, on the other hand, overall power usage will be much higher.

    I think memtest is a good first level test - it will pinpoint gross errors in memory. But probably won't detect more subtle problems. For me the best extended test is to enable all the opengl screen savers and let the system run overnight cycling through each of them. If the system doesn't crash with this it will probably be solid under a normal load. For me this has been the best test of overall system stability. Unfortunately if it fails won't know exactly what is wrong.

  • by rant64 ( 1148751 ) on Monday March 30, 2009 @07:00AM (#27386141)
    Which will work, as a matter of fact, given the proper hardware: http://www.microsoft.com/whdc/system/pnppwr/hotadd/hotaddmem.mspx [microsoft.com]
  • Re:Surprise? (Score:3, Interesting)

    by robthebloke ( 1308483 ) on Monday March 30, 2009 @07:39AM (#27386319)
    I have never in my lifetime managed to banjax an install of firefox/Safari/IE to the point it wouldn't work or un-install, which begs the question: what the hell are you doing to it? (If i didn't know any better, I'd suspect the cause maybe downloading too much pron?)
  • by Joce640k ( 829181 ) on Monday March 30, 2009 @07:52AM (#27386389) Homepage

    I've had a lot more success with Microsoft's RAM tester, free download here: http://oca.microsoft.com/en/windiag.asp [microsoft.com]

    See, good things do come out of Redmond!

  • Re:Surprise? (Score:3, Interesting)

    by Zero__Kelvin ( 151819 ) on Monday March 30, 2009 @08:15AM (#27386547) Homepage

    "Vista is as reliable as Linux."

    I can definately attest to this fact! The family computer has dual boot with Vista (It shipped with the 64 bit machine, and is 32 bit of course) and Mandriva Linux 2009 x86_64. Vista has been used to view Oprah's website with it's proprietary garbage, but other than that is unused and unmolested. It is a stock install. No third party stuff has been added other than iTunes. I recently had to install iTunes to restore my ipod after trashing the filesystem, and I can tell you Vista was very reliable. I could rely on it to apply updates in the background without my knowledge and interrupt the install process to reboot. Whenever I wanted to take a coffee break I could set it off to start installing iTunes, go off and have my coffee, and rely on it either having failed with some obscure error or to still be busy with the task! I reliably had iTunes installed in just 1 hour and a few minutes!

    Compare and contrast that to Linux. I can't rely on it to still be busy doing a trivial task after a coffee break. I cannot rely on it to apply updates without my knowledge. I can't rely on the occaissonal opportunity to see my pretty splash boot screen when I am forced to reboot, since I never am. Worse still, I usually don't even get to see my pretty splash screen for OS level updates unless the kernel or glibc is updated! I'm damn lucky I have been using it on my Laptop for years even though it still isn't ready for the desktop yet, or I'd never get to see what this thing looks like when it boots!

    So yes, they are each very reliable in their own special way. Frankly, I learned how to drink coffee at the computer without spilling it quite some time ago, so I far prefer the kind of reliability Linux offers, but clearly your experience differs. I'm going to go out on a limb and say you don't use both Vista and Linux. You use Vista, tried Linux, had no idea what you were doing and "wrecked" your Linux installation. Conclusion: There is no difference between Vista and Linux! I can fsck 'em both up!

    If you don't want to take my word for it, take the word of every other competent systems level software engineer out there. You won't find anyone saying Vista is great unless they are making money off of it, yet Linux has dedicated developers working on it who make NO money for doing so. Ask yourself this question: If M$ Open Sourced the code tomorrow, do you think they would have any more skilled people working on it than they they do today? Absolutely not. Nobody competent and in their right mind would invest their time and effort in developing that garbage without getting paid for it. It really is that simple.

  • Re:Surprise? (Score:3, Interesting)

    by Zero__Kelvin ( 151819 ) on Monday March 30, 2009 @08:20AM (#27386583) Homepage

    "Windows stopped being generally unstable years ago."

    Agreed. They have moved away from generalization to specialization now, and Vista is much more specific about how, when, and where it is unstable. Essentially, they pushed the crashes out of the kernel, and all the applications now act funny or crash instead of crashing the kernel.

    "People who will sit and tell me with a straight face that Vista, in their experience, is unstable are either very unlucky ..."

    Saying they are unlucky, when they are unfortunate enough to be stuck using Vista, is redundant.

  • Re:Surprise? (Score:3, Interesting)

    by poetmatt ( 793785 ) on Monday March 30, 2009 @08:33AM (#27386705) Journal

    What?

    Vista is not 100% stable, never has been, obviously never will be. Do you think it's magically immune to its own BSOD's? I run Vista 64bit myself, and it's "better than XP", but not stable. Apps still get random errors, etc.

    Windows is as stable as it will ever be; at least with Ubuntu you can have a month's uptime and be fine. Now if only Wine was 100% there for gaming (it's getting there).

  • Re:Surprise? (Score:5, Interesting)

    by Lumpy ( 12016 ) on Monday March 30, 2009 @09:07AM (#27387003) Homepage

    Vista can hose it's user profiles easily and they get the white scrteen loading bug that causes lots of problems and even networking to fail for that user.

    It's a profile problem that can be fixed easily by creating a new profile and deleting the old one, but that is way out of the ability of most users.

    This happens a LOT with home users. Out of the last 30 vista support calls I got 6 were this problem of corrupt user profiles.

    Honestly user profiles under Windows have sucked cince the 2000 days.

  • by mysticgoat ( 582871 ) on Monday March 30, 2009 @09:46AM (#27387447) Homepage Journal

    I run memtest86 overnight (12+ hrs) as a routine part of the initial evaluation of a sick machine. Occasionally it finds errors after several hours that were not present on a single pass test. The last instance was a few months ago: a single stuck bit in one of the progressive pattern memory tests that only showed up after 4+ hours of repetitive testing. Replacing that mem module cured WinXP of a lot of weird flakey behavior involving IEv7 and Word.

    The overnight memtest86 runs have only kicked out errors that were not found on single pass testing maybe 3 or 4 times in the last 10 years. But it happens.

  • by MikeBabcock ( 65886 ) <mtb-slashdot@mikebabcock.ca> on Monday March 30, 2009 @09:52AM (#27387521) Homepage Journal

    Actually its worth noting that several motherboards on the market automatically over-clock the timings on the board under high load situations to improve performance. These same situations may not happen while simply running memtest86[+].

    I've often thought that throwing in a copy of Folding@Home or Distributed.NET running in the background would be fun while memory testing, to juice the CPU and test the system under a heavier load.

    Unfortunately, isolating the memory to run said software and relocating it periodically could be a pain.

  • Re:Paranoia? (Score:3, Interesting)

    by pz ( 113803 ) on Monday March 30, 2009 @10:23AM (#27387901) Journal

    The real issue with memory cells flipping is not cosmic rays -- at least not with terrestrially deployed memory, it's alpha particle emissions from radioactive decay of the plastics in the memory package. Yes, the plastics surrounding the silicon.

    A lot of work has been done to reduce the radioactivity of plastics used in IC packaging from normal background levels that you don't worry about in day-to-day life, to as quiet as possible, by carefully selecting source materials that have few naturally occurring radioisotopes.

    From my chip-designer days I recall that the minimum charge required on a dynamic memory cell (like the ones in your computer's DRAM) to prevent spurious bit flips is one million electrons, give-or-take. The various designs back then were coming up with ways to reduce the footprint of the elements used to store that charge.

    That said, it's been about 10 years since I've been in that line of work, and things have probably changed -- strike that, they've definitely changed -- substantially.

  • by Westmalle ( 633931 ) on Monday March 30, 2009 @11:36AM (#27388983)

    http://www.ida.liu.se/~abdmo/SNDFT/docs/ram-soft.html [ida.liu.se]

    This references an IBM study, which is what I think I actually remember but could not find quickly this morning.

    "In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level."

    IBM research is a wonderful resource in the area of soft errors. I do remember exactly reading your quote, I didn't bother to track the exact article, but it should be part of this special issue http://www.research.ibm.com/journal/rd40-1.html [ibm.com], the banner article mentions Denver but doesn't have the exact quote. The web shows it would be "Terrestrial Cosmic Rays", the second article in that issue. They have a more recent special issue on the same subject http://www.research.ibm.com/journal/rd52-3.html [ibm.com]

THEGODDESSOFTHENETHASTWISTINGFINGERSANDHERVOICEISLIKEAJAVELININTHENIGHTDUDE

Working...