Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Hardware

Salvaging Defective DRAM 220

An anonymous reader writes "Ever wonder what happens to DRAM that fails quality assurance testing during manufacturing? Turns out a lot of it ends up as 'downgrade' memory and ends up in OEM memory modules. Last resort: use it in an answering machine, where the sampled audio can be very tolerant of bit errors."
This discussion has been archived. No new comments can be posted.

Salvaging Defective DRAM

Comments Filter:
  • Comment removed (Score:3, Insightful)

    by account_deleted ( 4530225 ) on Sunday March 09, 2003 @02:38PM (#5472042)
    Comment removed based on user account deletion
    • by acidrain69 ( 632468 ) on Sunday March 09, 2003 @03:18PM (#5472227) Journal
      There are a lot of peeps complaining about substandard ram. If you had RTFA, you'd realize that the downgrade ram is reconfigured to skip the bad parts in the chips, so that it comes out as a normal module. Just because there is a faulty bit or 10 in a modules, doesn't mean the reast of that module is bound to fail. It could just have been an imperfection in the silicon or the circuit process.

      The downgrade ram has to pass further tests to insure the detours around the bad parts worked.

      Granted, I probably wouldn't use this stuff in a mission critical server, but if you are buying for a mission critical server, you should be getting ECC registered with lifetime warranties anyway. Now for a small web or file server, or even a desktop, I'd use this.

      Other people have mentioned memtest86. This program is your friend. Don't even bother with BIOS POST tests of RAM, just use this every once in a while if you REALLY want to find the problems. Too bad it won't run on my alpha server :(
      • by alienw ( 585907 ) <alienw.slashdotNO@SPAMgmail.com> on Sunday March 09, 2003 @04:16PM (#5472470)
        If the chip is half-bad, there are good chances that it has defects in the other half. Usually, it's a problem with the process and not just random quirks. It's just that one half works better than the other. In fact, many windows crashes are not caused by Windows, but by bad RAM. And good luck finding anything with memtest86. Once, I ran that program for about 3 hours on a machine with bad RAM. It didn't find anything. When I replaced one of the sticks, all the problems went away.
        • by Anonymous Coward
          "If the chip is half-bad, there are good chances that it has defects in the other half."

          Actually, no that is not correct. Errors are caused by a localized defect which affects what is really (in human terms) a small point on the die. A particle of comtaminant, for instance, only a micron or two in size.

          Ever wonder why NAND FLASH (used in Smart Media, Compact FLASH, etc) are cheaper than NOR FLASH (called linear FLASH, used for BIOS and other code storage, etc)? Because not only is it designed to be fault correcting, but the spec allows for up to a certain number of sectors to be completely bad (uncorrectable by the on board ECC bits). This means higher yeild since many more get to pass in spite of defects.

          J
        • >When I replaced one of the sticks, all the problems went away.

          The big question is:

          Did you replace it with an identical type and speed of RAM? Or did it perhaps have fewer chips?

          memtest86 may not detect overclocked RAM, and on some boards, if the RAM is double sided, the extra "stress" on the bus of a poorly desgined board may be enough to cause errors when reading or writing the RAM.

          I've seen other strange effects that only happen to windows, such as a board that detects a full complement of 384 MB of RAM in the BIOS (1 each of 256 MB and 128 MB) but only 128 MB in windows. Moving the RAM about on the board would cause windows to _sometimes_ detect the rest of the memory. Swapping the 256 MB stick with another machine's 256 MB caused both machines to reliably detect and use the memory.

          While I never bothered with memtest86, I'm betting it would see the same amount of memory as the BIOS.

          Can you tell I hate modern memory modules yet? :-)
        • If the chip is half-bad, there are good chances that it has defects in the other half. Usually, it's a problem with the process and not just random quirks.

          Not true. All processes are subject to variation.

          When a wafer is produced with hundreds or thousands of discrete die on it, some are always better than others. For instance, in the 5" process where the first Pentiums were fabricated, you could have a yield of 60%-80% good die with those 60%-80% spanning a whole range of marked chip speeds. Same process, same wafer, different mhz. Different price when sold.

          If you've ever seen a fab in production, you would also see steps where manual (vacuum wand) handling is needed. Even in the filtered air of a clean room, the open movement of a wafer handled like this often leads to particles becoming affixed to the surface. The smaller the process (e.g. .09u vs..9u)the more damage a single particle can do.

          Process washings with chemicals or pure water do a good job of assuring no (well, few) particles stay affixed, but even so, some steps of metrology show that all cannot be avoided.

          Will a single particle hurt a single die? Maybe. Maybe not. It depends on where it lands and at what step in the process.

          Once the die are tested for yield and function and sorted by this performance, they are sold in batches.

          Not every die is tested completely though, but rather a restrictive set of "tell-tale" measurements are taken on most (at good fabs) and exhaustive testing done only on a small sample. Lots of statistical analysis helps know what to test and how hard to test it.

          Move to the final assembler, and all sorts of production glitches can cause bad modules. Primarily though, either minimally qualifying RAM or random sample tested RAM makes it into generic modules. Still, all the other components, the circuit board, connectors and solder itself can contribute to problems.

          In any case, the bad part in any chip is likely local because even minimal QA testing will eliminate obvious or widespread failures.

          Of course, piss-poor process does yield chips more prone to failure by breakdown of the traces or local thermal failure due to bubbles, impurities, or poor assembly.

      • by chrysrobyn ( 106763 ) on Sunday March 09, 2003 @04:16PM (#5472474)

        There are a lot of peeps complaining about substandard ram. If you had RTFA, you'd realize that the downgrade ram is reconfigured to skip the bad parts in the chips, so that it comes out as a normal module. Just because there is a faulty bit or 10 in a modules, doesn't mean the reast of that module is bound to fail. It could just have been an imperfection in the silicon or the circuit process.

        You have made a statement that makes it very clear you are a very educated layman, not someone in the field. What you've said is true to the first order, but not inherantly true.

        Wafers have what can be measured as "defect density", and observe a phenomena called "defect clustering". Defects are not always hit or miss, open or short, some of them are latent or resistive. As the part ages (diffuses), electromigrates or observes hot electron effects, all parts will decrease in quality. Downgrade RAM, so to speak, would be most likely to have additional cells fail due to the above effects -- because it had failures that made it marginal in the first place. Testing methodologies at higher quality manufacturers build in guardbands to make sure that nobody ever experiences the defects when used in-spec. (This is why many overclockers lose their chips after only a year or two, they cause latent defects to surface and suddenly the chip won't even operate at nominal frequencies; the guardband effect also explains to a great degree why many chips can be overclocked in the first place.)

        I'm not dis'n you, just trying to fill in a few more holes.

      • Granted, I probably wouldn't use this stuff in a mission critical server, but if you are buying for a mission critical server, you should be getting ECC registered with lifetime warranties anyway. Now for a small web or file server, or even a desktop, I'd use this.

        I've always been a bit dubious as to the value of ECC memory, and whether it might not just be a bit of a sales tactic. Yes, I realize that it's theoretically possible for solid state storage to spontaneously fail. But it's also theoretically possible for any number of other things to break, and spontaneous RAM failure seems very, very low on the list of things to worry about.

        I can't help but think that ECC memory is more useful from a marketing standpoint than a practical standpoint.
        • Re:ECC worth it? (Score:4, Informative)

          by Fulcrum of Evil ( 560260 ) on Sunday March 09, 2003 @06:30PM (#5473060)

          But it's also theoretically possible for any number of other things to break, and spontaneous RAM failure seems very, very low on the list of things to worry about.

          Well, the thing about RAM failure is that, unless you do something like ECC, you won't detect the errors until it causes a crash. Probably, you'll lose some data to corruption first. The other thing is that RAM errors can be induced by bad power or other transient problems. Finally, it does happen, so better safe than sorry - you're spending $2k on a server, so why cheap out on a $50 part?

        • Re:ECC worth it? (Score:4, Informative)

          by PurpleFloyd ( 149812 ) <`zeno20' `at' `attbi.com'> on Sunday March 09, 2003 @08:25PM (#5473589) Homepage
          ECC isn't there for the tiny chance that one, and only one, chip on the module would catch fire and die. It's there so that any random "bit rot" (single-bit errors) is caught and corrected before it causes damage. All RAM is susceptable to this; it can be caused by cosmic rays (!) or by radioactive decay (can't remember if it's alpha or beta) of minute quantities of radioisotopes in the chips' substrate. While it will only happen once in every ten years or so on average, it does happen and can cause a system crash. ECC is about reducing the possible risk (it would have to flip 3 bits simultaneously to fool ECC RAM).
          • It's there so that any random "bit rot" (single-bit errors) is caught and corrected before it causes damage.

            That can't be right, though. The system *bus* doesn't have ECC, and by virtue of its far greater area, would be much more vulnerable to errors induced by interference than the RAM.
      • Agreed.

        I only buy Samsung ram for example because they know good quality. Also they have great ddram -3200 while other ram manufactors are not starting to introduce it.

        I would rather pay an extra $15 per module then put up with blue screens of death and Linux kernel panics.

    • In spite of what some other posters are saying, in the large HPUX server market, HP memory is much more reliable than Kingston memory, it's also much more expensive. Having said that I have seen more memory failures attributed to non-HP memory (not just Kingston) than anything. If your downtime is not worth more then go for less expensive memory. I have also seen a client try to install his own memory and detroy a system board in the proccess. No it was not covered under the HP warranty.
      • Re:HP HP-UX memory. (Score:3, Informative)

        by Teancom ( 13486 )
        What is this "Kingston" memory of which you speak? AFAIK, Kingston does not make ram, they throw other people's die on their modules (and sometimes they just buy the modules whole). It's pretty much a crap-shoot of whether or not you're getting samsung, hynix, micron (who just signed a deal to start selling to them again), or etc. So saying kingston memory is crap would be akin to saying dell makes crappy hard drives...

        Not a flame, just a clarification :-)
        • Yes, you are of course correct. The point is, as you say, pretty much of a crap shoot. Maybe that's why some people are very happy and some are not. BTW I don't know whom HP buys their memory from !
  • by MosesJones ( 55544 ) on Sunday March 09, 2003 @02:39PM (#5472049) Homepage

    "Oh you left a message on the answering machine, naah I didn't get it must be the defective DRAM chips they use. Now you've managed to track me down using a detective agency I'll be sure to send you the cheque next week"

  • by SatanicPuppy ( 611928 ) <Satanicpuppy.gmail@com> on Sunday March 09, 2003 @02:40PM (#5472053) Journal
    You just explained a lot about my fricking answering machine! I thought that no one ever called! And now I find out it is low grade ram? My god! I may really HAVE a social life!
  • by Anonymous Coward on Sunday March 09, 2003 @02:44PM (#5472070)
    Ever wonder what happens to DRAM that fails quality assurance testing during manufacturing?

    No. I figured they forgot about it.

    • it was put onto eBay.

      As a tip to Linux users with bad ram, try append="mem=fooM" where foo is an amount of ram below the broken area.
      • Re:I figured (Score:4, Informative)

        by AvitarX ( 172628 ) <me@brandywinehund r e d .org> on Sunday March 09, 2003 @03:29PM (#5472270) Journal
        Reel advice for Linux users with bad ram.

        Run memt86 and use the output for the badram patch for the kernel.

        that will actually work and cut e vary minimal amount of ram out.

        • when you can get help on /.!?

          My server, and its cheapass ram, thank you.
        • Umm... you never actually tried that, have you? Memtest takes about 20-40 hours to find the bad clusters. I'd rather pay $30 for a new module.
          • Re:I figured (Score:3, Interesting)

            by dasunt ( 249686 )

            Badram requires a simple download, dd to floppy, booting off the floppy, and making sure it started up okay. Then, you can leave it alone for a day while you let it make passes.

            Anyways, assuming you are buying new ram and you want to be sure its okay, you'd have to do the same thing. And some older laptops have integrated onboard memory - the badram patch can work around that.

            I have a 64M proprietary memory stick for an old toshiba laptop that will be arriving soon in the email - I will be using badram to test that when it comes.

  • by rickthewizkid ( 536429 ) on Sunday March 09, 2003 @02:45PM (#5472082)
    ...I forgot it. Musta been that defective OEM memory module I had implanted in my skull...

    -RickTheWizKid
  • by Ari Rahikkala ( 608969 ) on Sunday March 09, 2003 @02:45PM (#5472083) Journal
    BadRAM patch [vanrein.org].
  • by Anonymous Coward on Sunday March 09, 2003 @02:45PM (#5472085)
    Summary: This page proposes an approach to support RAMs with defective addresses, This may open interesting business perspectives, where those RAMs can be sold under a white label for less money rather than discarded of without any profit.

    the url is:
    [vanrein.org]
    http://rick.vanrein.org/linux/badram/

  • by Anonymous Coward on Sunday March 09, 2003 @02:45PM (#5472089)
    and I suddenly though, hmmm what happens to that defective DRAM, I open up Mozilla and what do I find an answer to my question.
  • by Ogrez ( 546269 ) on Sunday March 09, 2003 @02:46PM (#5472093)
    This is the prime example of why I tell people I know not to buy ram off of the internet unless its from a major company that has good support. To many people buy 15-90 day warranty ram because its cheap, and when it fails they are upset that they have to replace it. If you pay a bit more money you get lifetime warranty ram... and why do you think they are willing to warranty it that long, because they know it works. people dont understand the testing process and think they are getting the same product buying cheap ram, as opposed to inexpensive ram...

    • Very true. I'll never buy generic RAM simply because it is more likely to malfunction. I've seen this happen to many of my friends. I'll stick with my Kingston RAM. The extra price is worth the warranty and the ability to sleep knowing my RAM won't mess my computer.
      • by sjames ( 1099 ) on Sunday March 09, 2003 @03:29PM (#5472273) Homepage Journal

        If only that was the worst of it.

        Generic RAM is also in the habit of mis-reporting it's capabilities in SPD. The problem was so bad with 512M sticks back when that was the biggest available, many BIOS would automatically disregard SPD and choose the slowest settings when a 512M stick was detected.

        Better brand names don't appear to have that problem.

    • I know not to buy ram off of the internet unless its from a major company that has good support.

      I have to disagree.

      You can do quite well by simply buying very inexpensive RAM...that's rated higher than the RAM you're trying to get. Instead of buying brand name, expensive PC100 RAM, I bought cheap, generic PC133 RAM. No problems, all good...cheap.
  • Use memtest86 (Score:5, Interesting)

    by Black Parrot ( 19622 ) on Sunday March 09, 2003 @02:48PM (#5472100)

    ...and read its documentation to find out how to make Linux skip any defects it finds.

    • Run the full test through a few times. On slow machines, this can take a day or three.

      About a third of all the machines I've delt with have had memory errors at some point in thier lifespan. Most of those errors were only found after a day of tests.

  • recycling the chips (Score:5, Interesting)

    by v1 ( 525388 ) on Sunday March 09, 2003 @02:55PM (#5472135) Homepage Journal
    I recall seeing an article awhile ago where companies were buying defective memory, and running them in these external testing units,which would identify which chip(s) on the stick were bad. I'm assuming they'd then unsolder the bad chip and recover one from another module. At that time some of those sticks had 8 chips on each side, so you could recover 15 good sticks from 16 bad ones. Considering the price of memory a few yrs ago, it was probably a worthwhile venture. Nowadays though, it's probably not worth anyone's time.
  • Tech support is busy right now...please leave a message and we'll get back to you...
  • well (Score:5, Funny)

    by odyrithm ( 461343 ) on Sunday March 09, 2003 @03:00PM (#5472162)
    I sh*t you not.. they make great keyring fobs! just dont let your gf see it ;)
    • Re:well (Score:4, Funny)

      by garbs ( 121069 ) on Sunday March 09, 2003 @03:10PM (#5472199)

      just dont let your gf see it ;)


      No problem with that happening with most of the slashdot visitors.
    • Re:well (Score:2, Funny)

      by Anonymous Coward
      just dont let your gf see it ;)

      Dude, if you have a pc-board keyring fob, this problem is totally irrelevant.

      Rob
    • why use useless memory as a keyring fob? get a small working usb drive as a keyring fob! it's better looking and you can actually use it for something!
    • Re:well (Score:5, Interesting)

      by wik ( 10258 ) on Sunday March 09, 2003 @04:06PM (#5472438) Homepage Journal
      Not to mention, give you hell at the airport. The security guys in Pittsburgh told me to put my keys in the little bucket, then when they looked closer, told me to put them through the X-ray machine.

      They were looking at the old 256k SIMM PCB (all chips removed) and asking "is that a computer chip"? Funny how they pointed at that and missed my Intel keyring fob with a real processor die on it.
      • I've had a dead memory keychain for about four years. I fly typically about two round-trips a year, and this past year I've flown more (including El Al, a.k.a. Just Because You're Paranoid Doesn't Mean They Aren't Out to Get You, and an international flight less than two weeks after Sept. 11, 2001). I have never been given any grief about my keys at the airport. They never even seem to notice it.

        Memory keychains are nice for opening boxes, too. And to the grandparent poster, my girlfriend thinks it's cool.
        • And to the grandparent poster, my girlfriend thinks it's cool.

          I should of explained my reasons for saying "dont let your gf see it", mine use to keep asking every so often "whats that again?" and after the 20th time you have to draw the line and part with it(the fob not the gf ;), oh and shes not at all stupid, she just switches of when ever computers come up.
      • Hmm I'm suprised you got that through

        *note to terrorists, if wishing to hijack planes, use sharpened RAM sticks*
    • by SuperQ ( 431 )
      bah.. several of my gf's saw my ram keyfob, and wanted one for their keys. My current gf gave me a couple Cray system boards as gifts.. I have my X-MP and Y-MP system boards at work.. and she keeps her Y-MP board on our key pedistal by our front door. Damn, those Y-MP system boards are heavy.
    • Why not let your gf see it?
      I've had 1/3rd of an old-school 32 pin SIMM as my keychain fob for quite a few years now (the ring goes through the hole.) The SIMM only had 3 chips on it, on one side, so it's nice and compact, and still had a nice geek appeal.

      It's actually helped me find people with similar interests. I had my keys in my hand one day while getting a sub at subway, and the guy behind the counter (probably about 17) said something along the lines of, "Cool, old school RAM. Haven't seen that for a while." We then had a little talk about back in the day (I'm only 21), since there were no customers in the store.
      • Why not let your gf see it?

        They ask to many questions, then every so often come out with "whats that again?", it can get tedious. ;)
  • by IvyMike ( 178408 ) on Sunday March 09, 2003 @03:04PM (#5472175)

    There are some things in the article that are pretty out of date:

    To reduce the test time, parallel chip testing usually is accomplished with eight to 16 chips in a row.

    That's pretty low parallelism; there are memory testers out there that test over 200 devices at a time right now. And even the older, more common systems are probably testing 64 in parallel.

    A special ink jet color marks the good dies.

    This hasn't been true for years. Each device's pass/fail status is stored in a database, along with all other test results, and the whole process is automated enough that good die are binned out automatically. No need to physically mark the chip.

    Due to the imperfection of the process, a percentage of the DRAM die contains some faulty cells.

    That percentage is 100%. At modern memory sizes, you never get a perfect device without going through repair.

  • by WolfWithoutAClause ( 162946 ) on Sunday March 09, 2003 @03:12PM (#5472209) Homepage
    I vaguely remember reading about a kernel patch that analysed faulty RAM, worked out what was wrong with it, and then modified the virtual memory handling so that you could carry on using it; at reduced capacity anyway.

    Needless to say I find this very cool indeed, but I'm not sure I'd want to run it on my high availability, mission-critical web server for a bank ;-)

  • by udif ( 32355 ) on Sunday March 09, 2003 @03:19PM (#5472233)
    It's quite simple. Really.

    DRAM chips are usually have either 4, 8 or 16 bits per word. In order to construct a DIMM, 64 bits are needed. This means that with 4 bit DRAMs, you need 16 chips, with 8 bit DRAMs you need 8 vhips, and with 16 bit DRAMs you need 4 chips. usually you will see only the 4 or 8 bit DRAMs, because these occupy less board area for the same capacity. 16 bit DRAMs are only used for low capacity DIMMs.

    When your DIMM supports ECC, it's 72 bits wide, which makes it more complicated. Usually its made of 18, 4-bit chips, or 9 8-bit chips.

    (back in the 30 and 72 pin SIMM days, when memories were 8 or 32 bit wide, you could see ECC SIMMs that use 3 chip for 2x4+1=9 bits, or 2x16+4=36 bits).

    If you see DIMMs with 12 chips, This is usually a cheap OEM SIMM using partially good DRAMs.

    The Best way to identify such a DIMM, is to write down the marking on ALL the chips on it, and look them up in the internet. You then sum up all the DRAM bit widths, and see what you come up with:

    If its 64 bits, its a normal DRAM.

    If its 72 bits, its probably an ECC DIMM.

    If its more, it's probably a DRAM using partially good DRAMs.

  • Does it get worse? (Score:2, Interesting)

    by Looke ( 260398 )
    I wonder, does RAM faults develop over time, or are they introduced in manufacturing? That is, if you have some bad RAM, and correct it with Linux BadRAM, can you then be reasonably safe you won't get more faults?

    Dead pixels on LCD screens are like this, if you don't have any dead pixels, you'll never get any. But how about RAM?
  • by Anonymous Coward
    The Sinclair Spectrum used half working 32k memory chips for cost reasons. In the later models, the computer used the same system, even though by then they were using mostly working chips as the cost of memory had fallen.

    You can get an extra 16k on most speccys by soldering a couple of links.
  • I'm not surprised that they use crap parts in answering machines. I've bought three digital answering machines over the years, and each one went flaky and died within a couple of years. Each time one died, I returned to my original General Electic dual-tape machine.

    Eventually, I learned my lesson: If it ain't broke, don't fix it. My mechanical answering machine is now 18 years old and running fine. I have no plans of ever replacing it again.

    • This is why I stopped using answering machines and started using the voice messaging at the telco. Millions of little plastic boxes eating up electricity in millions of homes is bound to be less efficient than voice messaging at a central server.

      The services at the telco let people leave messages when I am on the phone.
  • If you happen to get hold of some defective DRAM then there is an excellent kernel patch called badram. This will allow you to mark off all faulty bits and use the ram with no performance lost. So provided you've got enough slots, you can have 4G (or 3.99G) at no cost!
  • by hexdcml ( 553714 ) <hexdcml AT hotmail DOT com> on Sunday March 09, 2003 @03:42PM (#5472331)
    Did anyone else read the title as "Salvaging DRM"? Hmmm, for minute there I thought answering machines were DRM protected.
  • by Artifex ( 18308 ) on Sunday March 09, 2003 @03:46PM (#5472351) Journal
    Seriously, I've had some of their OEM memory as part of a package deal, and it was very nasty stuff.

    What's worse, before they would take it back, they wanted to "test" it, testing being limited to a couple runs of PC-Doctor, which is totally lightweight.

    To make a long story short, they refused to take it back the first time, later it blew up my motherboard. They replaced the motherboard (it was part of the package) and sent me home, where I discovered my Athlon XP was also damaged. I took it up there, and they wanted to run PC-Doctor on it, but the "technician" (hah!) cracked the CPU while putting it in a "test board," so "oops, I guess we're replacing that."

    P.S. One of the guys at the return desk who I got to know quite well told me, when I asked him why the "test boards" they were using always changed, that he thought they were boards that belonged to customers. Whether that meant boards in for repairs, or returned boards, I don't know or care - either is bad news.

    P.P.S. This was at the Fry's in Wilsonville, Oregon. There is also an idiotic troll in the service department there who, after ignoring me waiting at an empty counter for 10 minutes while he chatted on the phone, wanted to charge me for a "missing" monitor stand on a monitor I was returning, refusing for 15 minutes to look in the bottom of the box under the styrofoam because monitor stands always come attached to the monitors, didn't you know? He finally looked when I demanded to talk to the manager, and of course it was there. I had a long discussion with the manager anyway over his, and their, incompetence (I reminded him of the memory fiasco) but the troll was still lurking there the last time I dropped by for consumables, which is all I will ever buy from Fry's, now. You can't miss him - he looks like he'd feel more at home in a raincoat, instead of his cheesy lab coat, roaming a playground on a sunny day.
  • by HaloZero ( 610207 ) <protodeka@@@gmail...com> on Sunday March 09, 2003 @04:05PM (#5472430) Homepage
    ...no, I never really did wonder what happened to DRAM that failed the everpresent quality-assurance testing. Never really occured to me. So nyar.
  • by forged ( 206127 ) on Sunday March 09, 2003 @04:19PM (#5472489) Homepage Journal
    I just realize after reading the article, that the risk taken when buying OEM ram just isn't worth the $10 (or whetever) excuse for price difference anymore. What's the cost of 256MB nowadays ? $20-$40 depending on the type. Big deal. I'm putting $10 extra to ensure my computer will live through the next few years w/o mysterious system crashes...
  • DRAM Testing (Score:5, Informative)

    by gravygraphics ( 548287 ) on Sunday March 09, 2003 @04:56PM (#5472613)
    The RAM manufacturers are under horrendously heavy economic pressures. They are dropping like flies. So the name of the DRAM game is costs. To keep costs low, you have to keep the device small (so you can make more of them at once on a wafer). To keep the device small, you have to push your fab capability to the limit. An old DRAM mantra is if yield is too high, you are going to go out of business. In other words, if you aren't pushing the technology (and getting failures because of it), then you won't be competitive in the next generation.

    There is no way under those pressures that a company will make a perfect device. They build "redundancy" into the devices. This redundancy is used to fix the devices before they ship out of the factory (usually you can't fix the device after you package it). So most all DRAMs had some failures that were fixed during testing of the device.

    Most of the memory manufacters test to JEDEC specifications. These specifications may be too tight for some applications. For example, in a well ventilated case, your RAM won't hit 90 degrees C, but that is where the manufacters test it.

    So the "recovery" market buys bad devices from the major manufacturers and retests the to loosen specs. Some devices pass the loosened tests and are considered good. Other devices fail, but only in a single I/O. These devices are sorted so that custom DIMMs can be made that uses extra devices to make use for the bad I/O's.

    As far as using your POST memory test to figure out if you have bad RAM, forget about it. A POST memory test is like making a visual inspection of a jet engine... necessary but not sufficient. There are many failure types in DRAM. The easiest are stuck-at's. A stuck-at is a bit that is stuck at a value. A POST will find this (and probably not much else). Other failure mechanisms are:

    • cell coupling (program an bit and another bit also sets)
    • column coupling (cell is correct, but when trying to read it, neighboring column makes other column read incorrect)
    • bank noise (activating/precharging another bank causes read/write on another bank to fail)
    • retention (how long will a cell keep it's state until refreshed)
    • sense failures (sense amp can't measure the voltage differential when a cell is read)
    And bunch of others. These are only the device errors. This doesn't take into account power problems, bad connections, leakage between traces, etc.

    There are three key enablers that help the manufacturers find these device errors.

    First are test modes that manufacturers put in their devices. Test modes help the manufacturer cut down on test time and artificially stress the device to pull out failures quickly. These test modes are different between manufacturer and can't be accessed in the computers. Accidentally putting your RAM into test mode would be a very bad thing. There is nothing that a POST test (or other tests from the computer) can do to access these test modes.

    Second is controlling the temperature. DRAM's are very temperature sensitive. A DRAM that passes when you first turn on your computer, may fail once the computer gets hot. Obviously right when you turn the computer on isn't the right place to do tests that are sensitive to temperature.

    The third thing you need to know is how the cells of the memory are laid out into the addresses and I/O's of the device. Each device design is different. Even in the same manufacturer the mapping of outside addresses to physical rows and columns will not be the same. To make matter worse, the devices have been repaired, so each device will have the "spare" rows and columns in different places. So each device of each DIMM will have a different mapping of outside addresses to internal locations.

    Why is this important? Consider the cell coupling case. Two cells share charge. If you don't know which cells are adjacent, you have to read every other cell to make sure it didn't change. So to test every cell (N) you have to read every other cell (N-1). That means this "pattern" is going to take ~N^2. Now you have to do the same thing with inverted data. Make that 2N^2. Lets just think about a 256Mbit device (common on 256MByte DIMMs). Lets assume we can access addresses at 100MHz (we can't go that fast, but it doesn't matter). This pattern takes somewhere around 45 years... for one device.

    So it is impossible to fully test a memory device unless you are the manufacturer. This includes the memory recovery people, UNLESS they have a deal with the manufacter to share the layout of the device.

    Now on the other hand, most of the memories aren't all that scrambled up. And even memories that are heavily scrambled still have some regular structures. So if you assume that the external memory addresses directly map to internal, you can test quite a few of the couplings. You may get lucky and find the problem. This is why exhaustive memory tests sometimes find a problem with the bad memory. It also explains why the tests run so many different data patterns at the memory.

    What can you do? Buy good memory from the top manufacturers (there are some really smart people who put together these tests for the manufacturers). Make sure the names of major manufacturers are stamped on the chips. Micron, Samsung, Hynix, and Infineon are the big four. Beware of the off brands.

    Note: Part of this comment was copied from a post I made at Macintouch. They had a thread on bad RAM a while ago at http://www.macintouch.com/badram01.html [macintouch.com]

  • by hedley ( 8715 ) <hedley@pacbell.net> on Sunday March 09, 2003 @05:46PM (#5472854) Homepage Journal
    I wo~nder when we wil+ see tho&e defec%ive ch)ps in ou! deskt{p mach?nes?
  • This is probably a good place to mention badram [vanrein.org], the linux kernel patch that lets you use slightly defective memory modules.

    You can use memtest to generate a list of bad areas in ram, and the badram patch reserves those blocks of memory on boot such that nobody can ever use them, effectively giving you a working stick of ram, only a little bit smaller than it is marked for.

    If you're like me, you have a couple of cheapo sticks from who knows where that don't exactly work, and this patch is perfect for reviving those sticks.
  • Turns out a lot of it ends up as 'downgrade' memory and ends up in OEM memory modules.
    • Actually Dell is one of the few OEM's you can trust.

      They are more expensive then the el-cheapo ones but they only use things like crucial or micron ram, asus motherboards, reliable intel chipsets and so on.

      Dell makes alot of money from bussiness and small bussiness customers and support and quality is critical for them. Other oem's like gateway are another story. If my parents wanted a new computer I would most certainly recommend a dell.

      They also get bulk discounts for being so big so buying a dell with all the higher quality components is not that more expensive then building your own pc using cheap parts.

  • His name was Jim-- no, Joe. Anyway, his number is
    555-653 ... 635 ... 563 ... dammit.

"When the going gets tough, the tough get empirical." -- Jon Carroll

Working...