Forgot your password?
typodupeerror
Data Storage Hardware

Ask Slashdot: How Do SSDs Die? 510

Posted by timothy
from the whimpery-bang dept.
First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: How Do SSDs Die?

Comments Filter:
  • CRC Errors (Score:5, Informative)

    by Anonymous Coward on Tuesday October 16, 2012 @12:22PM (#41670125)

    I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.

  • They shrink (Score:2, Informative)

    by Anonymous Coward on Tuesday October 16, 2012 @12:26PM (#41670193)

    The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

  • by Anonymous Coward on Tuesday October 16, 2012 @12:29PM (#41670259)

    From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/

  • wear leveling (Score:2, Informative)

    by Anonymous Coward on Tuesday October 16, 2012 @12:29PM (#41670261)

    SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumably still work as the RAID algorithm pumps even amounts of data to each drive (whether it is mirrored or striped). When any of the drives are shrinking in size it is presumably time to replace the array.

  • by dublin (31215) on Tuesday October 16, 2012 @12:31PM (#41670275) Homepage

    In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.

    BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)

  • by PeeAitchPee (712652) on Tuesday October 16, 2012 @12:31PM (#41670287)
    With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.
  • Re:They shrink (Score:5, Informative)

    by tgd (2822) on Tuesday October 16, 2012 @12:32PM (#41670307)

    The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.

    Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.

  • Re:Umm (Score:5, Informative)

    by kelemvor4 (1980226) on Tuesday October 16, 2012 @12:32PM (#41670315)

    It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

    Never heard of that. I've got about 450 servers each with a raid1 and raid10 array of physical disks. We always buy everything together, including all the disks. If one fails we get alerts from the monitoring software and get a technician to the site that night for a disk replacement. I think I've seen one incident in the past 14 years I've been in this department where more than one disk failed at a time.

    My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together. Not to mention you have a mess for warranty management. We replace systems (disks included) when the 4 year warranty expires.

  • Bang! (Score:5, Informative)

    by greg1104 (461138) <gsmith@gregsmith.com> on Tuesday October 16, 2012 @12:33PM (#41670325) Homepage

    All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.

    Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.

  • Re:Umm (Score:5, Informative)

    by StoneyMahoney (1488261) on Tuesday October 16, 2012 @12:35PM (#41670347)

    The rationale behind splitting hard drives in a RAID between a number of manufacturers batches, even for identical drives, it to try and avoid a problem with an entire batch that's slipped past QA from taking out an entire array of drives simultaneously.

    I'm paranoid, but am I paranoid enough....?

  • by vlm (69642) on Tuesday October 16, 2012 @12:36PM (#41670357)

    My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.

    It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards

  • by MRGB (2743757) on Tuesday October 16, 2012 @12:38PM (#41670389)
    I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.
  • Re:CRC Errors (Score:5, Informative)

    by Anonymous Coward on Tuesday October 16, 2012 @12:39PM (#41670415)

    OCZ has some pretty notorious QA issues with a few lines of their SSDs, especially if your firmware isn't brand spanking new at all times.

    I'd google your drive info to see if yours are on death row. They seem a little small (old) for that, since I only know of problems with their more recent, bigger drive.

  • by PRMan (959735) on Tuesday October 16, 2012 @12:49PM (#41670583)
    I have had two SSD crashes. One was on a very cheap Zelman 32GB drive which never really worked (OK, about twice). The other was on a Kingston 64GB that I have in my server. When it gets really hot in the room (over 100, so probably over 120 for the drive itself in the case), it will crash. But when it cools down, it works perfectly well.
  • SSD wear cliff (Score:5, Informative)

    by RichMan (8097) on Tuesday October 16, 2012 @12:50PM (#41670609)

    SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.

    Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.

    1. Watch write perforamance/spare block count
    2. If you get any read errors do a block life audit
    3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors

    Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.

  • by SydShamino (547793) on Tuesday October 16, 2012 @12:51PM (#41670633)

    In theory, yes. In flashROM devices the erase process is the aging action. Your write-once-never-erase-read-only flash should last until A) enough charge manages to leak out of gates that you get bit errors, or B) the part fails due to corrosion or other long-term aging issue, similar to any piece of electronics.

    If you have raw access to the flashROM you could in theory write the same data into the same unerased bytes to recover from bit errors (if you had an uncorrupted copy), so only aging failures would occur. But of course you can't do this with an SSD as you have no direct access to the memory, and the controller A) wouldn't let you write into unerased space, and B) wouldn't write the data into the exact same place again anyway.

  • by dougmc (70836) <dougmc+slashdot@frenzied.us> on Tuesday October 16, 2012 @12:54PM (#41670657) Homepage

    With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.

    Usually? No.

    This does happen sometimes, but it certainly doesn't happen "usually". There's enough different failure mechanisms for hard drives that there isn't any one "usual" method --

    1- drive starts reporting read and/or write errors occasionally, but otherwise seems to keep working
    2- drive just suddenly stops working completely all at once
    3- drive starts making noise (and performance usually drops massively), but the drive still works.
    4- drive seems to keep working, but smart data starts reporting all sorts of problems.

    Personally, I've had #1 happen more often than anything else, usually with a healthy serving of #4 at about the same time or shortly before. #2 is the next most common failure mode, at least in my experience.

  • Re:They shrink (Score:5, Informative)

    by v1 (525388) on Tuesday October 16, 2012 @12:55PM (#41670669) Homepage Journal

    The sectors you are talking about are often referred to as "remaps" (or "spares"), which is also used to describe the number of blocks that have been remapped. Strategies vary, but an off-the-cuff average would be around one available spare per 1000 allocatable blocks. Some firmware will only use a spare from the same track, other firmware will pull the next nearest available spare. (allowing an entire track to go south)

    The more blocks they reserve for spares, the lower the total capacity count they can list, so they don't tend to be too generous. Besides, if your drive is burning through its spares at any substantial rate, doubling the number of spares on the drive won't actually end up buying you much time, and certainly won't save any data.

    But with the hundreds of failing disks I've dealt with, when more than ~5 blocks have gone bad, the drive is heading out the door fast. Remaps only hide the problem at that point. If your drive has a single block fail when trying to write, it will be remapped silently and you won't ever see the problem unless you check the remap counter in smart. If it gets an unreadable block on a read operation, you will probably see an io error however. Some drives will immediately remap it, but most don't and will conduct the remap when you next try to write to that cell. (otherwise they'd have to return fictitious data, like all zeros)

    So I don't particularly like automatic silent remaps. I'd rather know whean the drive first looks at me funny so I can make sure my backups are current and get a replacement on order, and swap it out before it can even think about getting worse. I prefer to replace a drive on MY terms, on MY schedule, not when it croaks and triggers any grade of crisis. There are legitimate excuses for downtime, but a slowly failing drive shouldn't be one of them.

    All that said, on multiple occasions I've tried to cleanse a drive of IO errors by doing a full zero-it format. All decent OBCCs on drives should verify all writes, so in theory this should purge the drive of all IO errors, provided all available spares have not already been used. The last time I did this on a 1TB Hitachi that had ONE bad block on it, it still had one bad block (via read verify) when the format was done. The write operation did not trigger a remap, (and I presume it wasn't verified, as the format didn't fail) and I don't understand that. If it were out of remaps, the odds of it being ONE short of what it needed is essentially zero. So I wonder in reality just how many drive manufacturers aren't even bothering with remapping bad blocks. All I can attribute this to is crappy product / firmware design.

  • by YesIAmAScript (886271) on Tuesday October 16, 2012 @12:55PM (#41670673)

    It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.

    So yeah, when one goes, you should replace them all. But they won't all go at once.

    Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.

  • by SeanTobin (138474) <byrdhuntr@hotmai3.1415926l.com minus pi> on Tuesday October 16, 2012 @12:58PM (#41670729)

    I recently had a "old" (cir 2008) 64gb SSD drive die on me. It's death followed this pattern:

    • Inexplicable system slowdowns. In hindsight, this should have been a warning alarm.
    • System crash, followed by a failure to boot due to unclean ntfs volume which couldn't be fixed by chkdisk
    • Failed to mount r/w under Ubuntu. Debug logs showed that the volume was unclean and all writes failed with a timeout
    • Successful r/o mount showed that the filesystem was largely intact
    • Successful dd imaged the drive and allowed a restore to a new drive.

    After popping a new disk in and doing a partition resize, my system was back up and running with no data loss. Of all the storage hardware failures I've experienced, this was probably the most pain-free as the failure caused the drive to simply degrade into a read-only device.

  • Re:Umm (Score:5, Informative)

    by CaptSlaq (1491233) on Tuesday October 16, 2012 @01:00PM (#41670763)

    I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.

    This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.

  • Bathtub Curve (Score:5, Informative)

    by Onymous Coward (97719) on Tuesday October 16, 2012 @01:05PM (#41670833) Homepage

    The bathtub curve [wikimedia.org] is widely used in reliability engineering. It describes a particular form of the hazard function which comprises three parts:

    • The first part is a decreasing failure rate, known as early failures.
    • The second part is a constant failure rate, known as random failures.
    • The third part is an increasing failure rate, known as wear-out failures.
  • Re:CRC Errors (Score:5, Informative)

    by Dishwasha (125561) on Tuesday October 16, 2012 @01:11PM (#41670925)

    I've had over 10 replacements on the original OCZ Vertex 160GB drives and an unnecessary motherboard replacement on my laptop that I eventually figured out was due to the laptop battery reaching the end of its life and not providing enough voltage. Unfortunately OCZ's engineers did not design the drives to handle loss of voltage and the drives absolutely corrupt. Eventually OCZ sneakily modified their warranty to include not providing warranty when the drives don't receive enough power rather than getting their engineers to just fix the problem. I'm actually running on a Vertex 3 and as of yet have not had that problem, but I am crossing my fingers.

  • Re:Umm (Score:5, Informative)

    by Bob the Super Hamste (1152367) on Tuesday October 16, 2012 @01:37PM (#41671247) Homepage
    For those who are interested the white paper is titled "Failure Trends in a Large Disk Drive Population" and can be found here [googleusercontent.com]. It is a fairly short read (13 total pages) and quite interesting if you are into monitoring stuff.
  • Re:They shrink (Score:3, Informative)

    by Bob the Super Hamste (1152367) on Tuesday October 16, 2012 @01:49PM (#41671427) Homepage
    From my understanding this is exactly the type of thing that S.M.A.R.T is going to detect along with a number of other issues. If you are interested I suggest checking out the paper from Google entitled "Failure Trends in a Large Disk Drive Population" as they made extensive use of S.M.A.R.T and tracked an extremely large number of drives for a number of years for the analysis. [slashdot.org]
  • Re:CRC Errors (Score:5, Informative)

    by anne on E. mouse cow (867445) on Tuesday October 16, 2012 @01:53PM (#41671497) Journal

    http://www.behardware.com/articles/862-7/components-returns-rates-6.html [behardware.com]

    Personally, I'm glad my SSDs aren't OCZ.

  • Re:They shrink (Score:4, Informative)

    by v1 (525388) on Tuesday October 16, 2012 @02:09PM (#41671785) Homepage Journal

    SMART is implemented in different ways by different manufacturers. The idea is that the host can ask the peripheral "what value does slot xx contain?" This can refer to an instantaneous condition, such as the temperature of the hard drive, a static value such as how many spares are currently available, a semidynamic value such as is this hard drive failing, and a dynamic value such as how many remap operations have occurred. There's a short list of "basic/standard" values, and then there's the "extended/optional" metrics that not all devices need to support. Each smart slot will also specify the min and max values. If any smart slot has a value outside its allowed range, overall smart status will report as failing. Once a drive toggles over to failing, there's no going back, unless you figure out a way to reset the counters.

    One of the standard set is the "is the hard drive failing" metric. It allows the host to get a simple yes/no answer to summarize whether any of the metrics have gone beyond their tolerated values. For example, one drive I worked with recently was allowed to overtemp twice. If it had experienced a third overtemp during its lifetime, the drive would then permanently fail the overall test. This allows the host to "check smart status" without really having to think much about what it's doing. This is the basic test that most modern OS's check to see if a hard drive needs to be replaced. You usually need to run a special tool to check individual values being returned by smart. These tools need to have a list of what each slot means, and often will report fairly meaningless information near the end of the list, where they don't know what this 23 means in slot 85 etc.

    Other known values may slowly increment over the lifetime of the drive, such as "head re-calibrations", "remaps", SMS head parks, max g forces experienced, etc. You'd have to compare their current values with their claimed limits to see how close each of these metrics is to causing overall smart to toggle to failed. Without knowing what the metric is, or what it's expected limit is, the numbers aren't useful.

  • Re:Umm (Score:4, Informative)

    by Bob the Super Hamste (1152367) on Tuesday October 16, 2012 @02:53PM (#41672395) Homepage
    Mostly the methadology as well as it disproving some of the standard thought (heat or activity kills drives). While they were looking for some leading indicator for all drive failures (were some error reported before a given drive crapped out) which is what they didn't find as a large portion of the drives just crapped out without warning any drives that did start to report warnings were very likely to crap out shortly (I think their threshold was 60 days) which does help to prevent down time. Interestingly I had to look into disk monitoring at my job and ran across that paper, implemented some automated S.M.A.R.T. monitoring and one of the disks in a box had tossed some errors. People complained because my code was alarming this issue so they thought my code was bad. A couple days later the drive gave up the ghost and I was vindicated.
  • by Anonymous Coward on Tuesday October 16, 2012 @03:22PM (#41672751)

    A: Memory cells begin to die off faster than the SSD's controller can annotate them as bad and reallocate the memory which initially shows up as major slowdown, then as crc32 errors which increase in frequency and severity due to overwrites not completing correctly. The issue accelerates until the drive becomes unusable. This failure is usually due to heavy use, age and cheap, cheap memory.

    B: Solder joint on a chip cracks takes out the chip and, since the entire array of chips are set up RAID0 style, the entire drive is dead one day mysteriously. This occurs due to an extreme difference in hot temp and cold temp the drive is exposed to not by itself but by other components; lead-free solder has multiple metals in it which expand and contract at different rates, as you heat up and cool down you cause extreme contraction and expansion. Like bending a fork too many times, microfractures form which eventually coalesce to become one big open in the circuit.

    C: Shorting of the internal chip components causing the infamous "black glass" situation where the voltage and grounding planes of the chip short out, heat up, and you get to see black glass on the very top of the chip and sometimes a small distortion.

    D: Firmware memory fails. Shows up as every single wierd issue you can imagine.

    E: Defects in the drive such as poor connectors between the die and external connectors, or lack of shock resistance during shipping for certain solder joints, usually the drives fail quick and hard.

    All of the above are basically possible, save for Point A, on a regular hard drive.

    Fact: If a Harddrive goes, drivesavers can toss it under an electron microscope and recover the data. SSD's have no known recovery methodologies because the above failure modes usually physically destroys the data.

    Point A makes RAID arrays using SSD's particularily interesting since if you purchase a box of drives with similar Serial numbers and start running them at the same load over time, you're bound to end up with the them failing near the same point in time. Thankfully, however, different cells on each drive are going to fail at different times. The majority of harddrive failures are mechanical in nature as wear occurs at different rates for different disks.

    SSD's are GREAT for certain applications where shock resistance and speed are key; you can get 15 times the random read/write at 1/100th the latency out of a SSD than you can out of the priciest harddrive, for a fraction of the cost a server racked with drives can fully saturate it's network ports . For doing large-volume data projects or running a fully virtualized infrastructure that needs tons of I/O, there really is, IMO, no other option. Doing so, however, without backups upon backups is suicide for the same reason running a SAN indefinatly without a backup is suicide. Thankfully running VM's makes backing up and restoring a breeze.

  • Re:CRC Errors (Score:5, Informative)

    by ZedNaught (533388) on Tuesday October 16, 2012 @04:13PM (#41673343)
    Firmwares release notes, from January 13th, 2012: "Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive."
  • by bbasgen (165297) on Tuesday October 16, 2012 @05:22PM (#41674215) Homepage
    I have ordered approximately 500 Intel SSD's over the past 18 months (320 series and the 520 series primarily). To date, we have had exactly one fail to my knowledge. It was a 320 series 160 GB with known firmware issue. We have around 80 of that type and size, and the drive that failed did so on first image. We RMA'ed the drive and got a replacement.

For every bloke who makes his mark, there's half a dozen waiting to rub it out. -- Andy Capp

Working...