Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Ask Slashdot: How Do SSDs Die? 510

First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: How Do SSDs Die?

Comments Filter:
  • Re:CRC Errors (Score:3, Interesting)

    by Quakeulf ( 2650167 ) on Tuesday October 16, 2012 @12:29PM (#41670255)
    How big in terms of gigabytes were they? I have two disks from OCZ myself and they have been pretty fine so far. The biggest is 64 gb, the smallest is 32 gb. I was thinking of upgrading to a 256 gb SSD at some point but not knowing what might kill it is something I honestly have not thought of, and would like some input on. My theory is heat and a faulty power supply would play major roles in this, but not so sure about physical impact although to some extent it would break it.
  • SSDs do fail (Score:1, Interesting)

    by Anonymous Coward on Tuesday October 16, 2012 @12:32PM (#41670309)

    Pretty much all SSDs have more then 8 chips in a configuration similar to RAID0. If any single chip has a problem, the entire drive is useless. I've seen SSDs fail from the cheap 40GB patriots, all the way up to the high end fusion io drives. *Most* of them died after power cycles, I guess if they are going to fail, that will usually be the time it happens. At least with the mechanical disks you can spend some cash and have it recovered after it fails.

  • by theNetImp ( 190602 ) on Tuesday October 16, 2012 @12:33PM (#41670319)

    So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?

  • Re:Umm (Score:5, Interesting)

    by MightyMartian ( 840721 ) on Tuesday October 16, 2012 @12:42PM (#41670465) Journal

    Too true. Years ago we bought batches of Seagate Atlas drives, and all of them pretty much started dying within weeks of each other. They were still under warranty, so we got a bunch more of the same drives, and lo and behold within nine months they were crapping out again. It absolutely amazed me how closely together the drives crapped out.

  • by Baloroth ( 2370816 ) on Tuesday October 16, 2012 @12:45PM (#41670523)

    So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?

    Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.

  • by Anonymous Coward on Tuesday October 16, 2012 @12:50PM (#41670595)
    I had an 80G Intel X-25M fail in an interesting manner. Windows machine, formatted NTFS, Cygwin environment. Drive had been in use for about a year, "wear indicator" still read 100% fine. Only thing wrong with it is that it had been mostly (70 out of 80G full) filled, but wear leveling should have mitigated that. It had barely a terabyte written to it over its short life.

    Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.

    Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.

    Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.

    On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.

    A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.

    I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode [intel.com] that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.

    (I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)

  • Re:Umm (Score:3, Interesting)

    by Anonymous Coward on Tuesday October 16, 2012 @12:51PM (#41670615)

    Warranty replacement drivers are refurbished, meaning they've already failed once. I've never had a refurb drive last a full year without failing. It's gotten bad enough that I don't bother sending them back for warranty replacement anymore.

  • by AmiMoJo ( 196126 ) on Tuesday October 16, 2012 @12:51PM (#41670625) Homepage Journal

    I had an Intel SSD run out of spare capacity and it was not fun. Windows kept forgetting parts of my profile and resetting things to default or reverting back to backup copies. The drive didn't report a SMART failure either, even with Intel's own SSD monitoring tool. I had to run a full SMART "surface scan" before it figured it out.

    That sums up the problem. The controller doesn't start reporting failures early enough and the OS just tries to deal with it as best as possible, leaving the user to figure out what is happening.

  • Re:CRC Errors (Score:5, Interesting)

    by Synerg1y ( 2169962 ) on Tuesday October 16, 2012 @12:53PM (#41670647)

    OCZ makes several different product lines of SSDs, each line has it's own quirks, so generalizing OCZ's QA issues isn't accurate. I've always had good luck with the vertex 3s both for myself & people I install them for. I've a SSD die once and it looked identifcal to a spinning disk failure from a chkdsk point of view, can't remember what kind it was, it was either an OCZ, or a Corsair, but I can name a ton of both of those brands that are still going 2-3y+.

  • by cellocgw ( 617879 ) <cellocgw.gmail@com> on Tuesday October 16, 2012 @12:54PM (#41670653) Journal

    With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.

    OK, so I'm sure some enterprising /.-er can write a script that watches the SSD controller and issues some clicks to the sound card when cells are marked as failed.

  • by SydShamino ( 547793 ) on Tuesday October 16, 2012 @12:58PM (#41670727)

    For flash memory it is the erase cycles, not the write cycles, that drive life.
    http://en.wikipedia.org/wiki/Flash_memory [wikipedia.org]

    The quantum tunneling effect described for the erase process can weaken the insulation around the isolated gate, eventually preventing that gate from holding its charge. That's the typical end-of-life scenario for a bit of flash memory.

    You generally don't say that writes are end-of-life because you could, in theory, write the same pattern to the same byte over and over again (without erasing it) and not cause reduction in part life. Or, since bits erase high and write low, you could write the same byte location eight times, deasserting one new bit each time, then erase the whole thing once, and the total would still only be "one" cycle.

  • Re:Umm (Score:5, Interesting)

    by Anonymous Coward on Tuesday October 16, 2012 @01:01PM (#41670779)

    Google published a study they did of their own consumer grade drives, and found the same time. If the drive survives the first month of load, it will likely go on to work for years, but if it throws even just SMART errors in the first 30 days, it is likely to be dodgy

  • Theory or Practice? (Score:5, Interesting)

    by rabtech ( 223758 ) on Tuesday October 16, 2012 @01:21PM (#41671051) Homepage

    In theory they should degrade to read-only just as others have pointed out in other posts, allowing you to copy data off them.

    In reality, just like modern hard drives, they have unrecoverable firmware bugs, fuses that can blow with a power surge, controller chips that can burn up, etc.

    And just like hard drives, when that happens in theory you should still be able to read the data off the flash chips but there are revisions to the controller, firmware, etc that make that more or less successful depending on the manufacturer. You also can't just pop the board off the drive like with an HDD, you need a really good surface mount resoldering capability.

    So the answer is "it depends"... If the drive itself doesn't fail but reaches the end of its useful life or was put on the shelf less than 10 years ago (flash capacitors do slowly drain away) then the data should be readable or mostly readable.

    If the drive itself fails, good luck. Maybe you can bypass the fuse, maybe you can re-flash the firmware, or maybe it's toast. Get ready to pay big bucks to find out.

    P.S. OCZ is fine for build it yourself or cheap applications but be careful. They have been known to buy X-grade flash chips for some of their product lines - chips the manufacturers list as only good for kid toys or non-critical, low-volume applications. Don't know if they are still doing it but I avoid their stuff.
    Intel's drives are the best and have the most-tested firmware but you pay for it. Crucial is Micron's consumer brand and tends to be pretty good given they make the actual flash - they are my go-to brand right now. Samsung isn't always the fastest but seems to be reliable.

    Do your research and focus on firmware and reliability, not absolute maximum throughput/IOPs.

  • Re:Umm (Score:4, Interesting)

    by infodragon ( 38608 ) on Tuesday October 16, 2012 @02:04PM (#41671685)

    [Sarcasm]Nothing like 20/20 hindsight... If I had done anything like trying to rebuild the array it would have fallen apart... Oh wait... If I had followed what you suggested I would have been SCREWED.[/Sarcasm]

    I made a decision based on what on the information on hand.. The rebuild would have take more than a few hours, 80GB disk was SLOW, i.e. first gen SATA. By executing the DB dump I was hitting less than 1/2 the disk capacity on read than 100% disk capacity on a write. It would be significantly faster to retrieve the data than to rebuild. That time window was critical, 2 hours of read vs 4+ hours of write. I also knew I had all the data on hand and all the scripts tested monthly for rebuilding the entire DB on a different server. The decision was easy! Grab the DB data now, redeploy on another system and address the issue on the spot. The system ended up being down 3 hours rather than 24+.

    Secondly The failure was abrupt with no SMART messages, I couldn't trust the others to not have the same non-reporting issues. I made a choice on the spot on how to proceed knowing full well I may have signed my own 24h torture warrant. Fortunately I didn't have the worst case happen and I learned a critical lesson.

    A bit more information...

    +- 30 minutes on each one
    First disk failed...
    2 hours later second disk failed...
    2 hours later third disk failed.
    2 hours later 4th disk failed
    16 hours later 5th disk failed.

  • Re:Umm (Score:5, Interesting)

    by kasperd ( 592156 ) on Tuesday October 16, 2012 @03:12PM (#41672625) Homepage Journal

    That is why one uses RAID 6 with lower tier drives and hot spares.

    Best argument for RAID 6 is bad sectors discovered during reconstruction. Assume one of your disks have a bad sector somewhere. Unless you periodically read through all your disks, you may not notice this for a long time. Now assume a different disk dies [phuket-data-wizards.com]. Reconstruction starts writing on your hot spare. But during reconstruction an unreadable sector is found on a different drive. On RAID 5, that means data loss.

    I have on one occasion been assigned the task on recovering from pretty much that situation. And some of the data did not exist anywhere else. In the end my only option was to retry reading the bad media over and over until on one pass I got lucky.

    With RAID 6 you are much better off. If one disk is completely lost and you start reconstructing to the hot spare, you can tolerate lots of bad sectors. As long as you are not so unlucky to find bad sectors in the exact same location on two different drives, reconstruction will succeed. An intelligent RAID 6 system will even correct bad sectors in the process. When a bad sector is detected during this reconstruction, the data for both the bad sector as well as this location on the hot spare are reconstructed simultaneously and both can be written to the respective disk.

    At the end of the reconstruction you not only have reconstructed the lost disk, you have also reconstructed all the bad sectors found on any of the drives. Should one of the disks run out of space for remapping bad sectors in the process, then that disk is next in line to be replaced.

  • by guruevi ( 827432 ) on Tuesday October 16, 2012 @05:11PM (#41674051)

    The ultra-cheap SSD's in my severs lasted only 3 months. The 4 OCZ Vertex 3 IOPS have so far lasted over a year with ~2TB processed per disk, 2 Intel SLC and 2 MLC's already over 2 years over which time they have processed ~10TB each (those were all enterprise grade or close to it). They are in a 60TB array doing caching so they regularly get read/write/deleted. I have some OCZ Talos (SAS) as well where one was DoA and another early-death but simply shipping them into RMA and I had another one in a couple of days. But the rest of them do well over 6 months and going.

    Several other random ones still work fine in random desktop machines and workstations.

    As far as spare room on those devices, depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live. I haven't had one with bad blocks so most of mine have gone out with a bang, usually they just stop responding and drop out, totally dead. I would definitely recommend RAID6 or mirrors as they do die just like normal hard drives (I just had 3 identical 3TB drives die in the last week)

If you have a procedure with 10 parameters, you probably missed some.

Working...