Ask Slashdot: How Do SSDs Die?

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Ask Slashdot: How Do SSDs Die? 510

Posted by timothy on Tuesday October 16, 2012 @12:20PM from the whimpery-bang dept.

First time accepted submitter kfsone writes "I've experienced, first-hand, some of the ways in which spindle disks die, but either I've yet to see an SSD die or I'm not looking in the right places. Most of my admin-type friends have theories on how an SSD dies but admit none of them has actually seen commercial grade drives die or deteriorate. In particular, the failure process seems like it should be more clinical than spindle drives. If you have X many of the same SSD drive and none of them suffer manufacturing defects, if you repeat the same series of operations on them they should all die around the same time. If that's correct, then what happens to SSDs in RAID? Either all your drives will start to fail together or at some point, your drives will become out of sync in-terms of volume sizing. So, have you had to deliberately EOL corporate grade SSDs? Do they die with dignity or go out with a bang?"

This discussion has been archived. No new comments can be posted.

Ask Slashdot: How Do SSDs Die?

Load All Comments

Search 510 Comments Log In/Create an Account

Comments Filter:

CRC Errors (Score:5, Informative)

by Anonymous Coward writes: on Tuesday October 16, 2012 @12:22PM (#41670125)

I had 2 out of 5 SSDs fail (OCZ) with CRC errors, I'm guessing faulty cells.

Share
twitter facebook
- Re: (Score:3, Interesting)
  
  by Quakeulf ( 2650167 ) writes:
  
  How big in terms of gigabytes were they? I have two disks from OCZ myself and they have been pretty fine so far. The biggest is 64 gb, the smallest is 32 gb. I was thinking of upgrading to a 256 gb SSD at some point but not knowing what might kill it is something I honestly have not thought of, and would like some input on. My theory is heat and a faulty power supply would play major roles in this, but not so sure about physical impact although to some extent it would break it.
  - Re:CRC Errors (Score:5, Informative)
    
    by Anonymous Coward writes: on Tuesday October 16, 2012 @12:39PM (#41670415)
    
    OCZ has some pretty notorious QA issues with a few lines of their SSDs, especially if your firmware isn't brand spanking new at all times.
    I'd google your drive info to see if yours are on death row. They seem a little small (old) for that, since I only know of problems with their more recent, bigger drive.
    
    Parent Share
    twitter facebook
    - Re:CRC Errors (Score:5, Interesting)
      
      by Synerg1y ( 2169962 ) writes: on Tuesday October 16, 2012 @12:53PM (#41670647)
      
      OCZ makes several different product lines of SSDs, each line has it's own quirks, so generalizing OCZ's QA issues isn't accurate. I've always had good luck with the vertex 3s both for myself & people I install them for. I've a SSD die once and it looked identifcal to a spinning disk failure from a chkdsk point of view, can't remember what kind it was, it was either an OCZ, or a Corsair, but I can name a ton of both of those brands that are still going 2-3y+.
      
      Parent Share
      twitter facebook
      - Re:CRC Errors (Score:5, Informative)
        
        by MrL0G1C ( 867445 ) writes: on Tuesday October 16, 2012 @01:53PM (#41671497) Journal
        
        http://www.behardware.com/articles/862-7/components-returns-rates-6.html [behardware.com]
        Personally, I'm glad my SSDs aren't OCZ.
        
        Parent Share
        twitter facebook
        
        Re:CRC Errors (Score:4, Insightful)
        
        by markhahn ( 122033 ) writes: on Tuesday October 16, 2012 @02:27PM (#41672047)
        
        this is not very useful, as it mainly points out that the initial generations of commodity SSDs were immature. not to mention that return rates contain other phenomena than wear or even failure.
        
        Parent Share
        twitter facebook
      - Re: (Score:3)
        
        by clarkn0va ( 807617 ) writes:
        
        Exactly right. I've used, sold and supported dozens of SSDs. Most were Vertex or Agility (1, 2, 3 and 4), and I've yet to see a single one fail. By contrast, I sold exactly three OCZ Petrols and had 4 failures! The last two were RMA replaced by Agility 3 and Octane, repectively, so obviously OCZ has seen a problem there.
        Similarly, I sold a batch of a dozen or so Kingston budget drives and saw nearly half of them fail around the 1 year mark. I've used a couple Corsair drives and had issues with them not comi
      - Re: (Score:3)
        
        by ArhcAngel ( 247594 ) writes:
        
        I suspected you weren't talking about the Vertex line. I got a 60GB Vertex III last year and shortly after installation it started randomly disappearing during intense gaming. I thought it was due to heat so I bolted a fan/heat sink to it but that didn't seem to help. I struggled with the issue for over six months until OCZ finally released a firmware update (it had released several prior) that fixed it. I never lost any data but it was a little unnerving every time it would vanish.
    - Re:CRC Errors (Score:5, Informative)
      
      by Dishwasha ( 125561 ) writes: on Tuesday October 16, 2012 @01:11PM (#41670925)
      
      I've had over 10 replacements on the original OCZ Vertex 160GB drives and an unnecessary motherboard replacement on my laptop that I eventually figured out was due to the laptop battery reaching the end of its life and not providing enough voltage. Unfortunately OCZ's engineers did not design the drives to handle loss of voltage and the drives absolutely corrupt. Eventually OCZ sneakily modified their warranty to include not providing warranty when the drives don't receive enough power rather than getting their engineers to just fix the problem. I'm actually running on a Vertex 3 and as of yet have not had that problem, but I am crossing my fingers.
      
      Parent Share
      twitter facebook
      - Re:CRC Errors (Score:5, Funny)
        
        by lytles ( 24756 ) writes: on Tuesday October 16, 2012 @01:40PM (#41671295) Homepage
        
        power corrupts. absolute power corrupts absolutely
        
        Parent Share
        twitter facebook
        
        Re:CRC Errors (Score:5, Funny)
        
        by Mattcelt ( 454751 ) writes: on Tuesday October 16, 2012 @01:44PM (#41671353)
        
        And a lack of power enables corruption. QED
        
        Parent Share
        twitter facebook
      - Re:CRC Errors (Score:5, Insightful)
        
        by Dishwasha ( 125561 ) writes: on Tuesday October 16, 2012 @03:39PM (#41672967)
        
        I would counter-argue that any flash drive manufacturer is asking for massive RMAs when the device is clearly targeted for the laptop market (otherwise they would manufacture it in a 3.5" format) where the operating environment is guaranteed to be running on a battery for long periods of time. Any research in to battery operation would expose you to the vast differences in operating voltage as batteries discharge as well as the age of the battery. It is just bad engineering to not take this in to account.
        Reformatting the drive was not an option because the drive wouldn't even detect in the BIOS unless the special factory jumper was set which is a non-operational mode for the drive. This problem was reproduced over 10 times with over 10 different drives of the same model Vertex. Slightly bad power caused the entire drive to be rendered unusable. Amazingly, none of the other hardware in the laptop had any problem with the power (i.e. screen, cpu, memory, other spindle-based hard drive, gpu, etc.). As I said, bad engineering.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by filthpickle ( 1199927 ) writes:
        
        otherwise they would manufacture it in a 3.5" format
        The standard form factor for SSD's is 2.5" no matter how you intend to use them. I am not really commenting on what you say aside from that. I was honestly curious when I read that because I have never seen a 3.5" SSD (I haven't looked very hard). There are a few from OCZ on newegg but that's all a brief scan could find.
  - - Re:CRC Errors (Score:5, Insightful)
      
      by arth1 ( 260657 ) writes: on Tuesday October 16, 2012 @03:16PM (#41672681) Homepage Journal
      
      I am running (6) OCZ Vertex2 256GB drives under heavy use 24/7. Almost 2 years on have only had one fail and it still works, just started kicking random errors.
      Your failure rate of > 8% per year isn't very reassurring.
      
      Parent Share
      twitter facebook
- Re: (Score:3)
  
  by AmiMoJo ( 196126 ) writes:
  
  You could be more specific. Errors on reading or writing?
  I have had a couple of SSDs die. The first was an Intel and ran out of spare capacity after about 18 months, resulting in write failures and occasional CRC errors on read. The other was an Adata and just died completely one day, made the BIOS hang for about five minutes before deciding nothing was connected.
- - - - Re:CRC Errors (Score:5, Informative)
        
        by ZedNaught ( 533388 ) writes: on Tuesday October 16, 2012 @04:13PM (#41673343)
        
        Firmwares release notes, from January 13th, 2012: "Correct a condition where an incorrect response to a SMART counter will cause the m4 drive to become unresponsive after 5184 hours of Power-on time. The drive will recover after a power cycle, however, this failure will repeat once per hour after reaching this point. The condition will allow the end user to successfully update firmware, and poses no risk to user or system data stored on the drive."
        
        Parent Share
        twitter facebook
Umm (Score:5, Insightful)

by The MAZZTer ( 911996 ) writes: <.moc.liamg. .ta. .tzzagem.> on Tuesday October 16, 2012 @12:23PM (#41670151) Homepage

It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.

Share
twitter facebook
- Re:Umm (Score:5, Informative)
  
  by kelemvor4 ( 1980226 ) writes: on Tuesday October 16, 2012 @12:32PM (#41670315)
  
  It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
  Never heard of that. I've got about 450 servers each with a raid1 and raid10 array of physical disks. We always buy everything together, including all the disks. If one fails we get alerts from the monitoring software and get a technician to the site that night for a disk replacement. I think I've seen one incident in the past 14 years I've been in this department where more than one disk failed at a time.
  
  My thought on buying them separately is that you run the risk of getting devices with different firmware levels or other manufacturer revisions which would be less than ideal when raided together. Not to mention you have a mess for warranty management. We replace systems (disks included) when the 4 year warranty expires.
  
  Parent Share
  twitter facebook
  - Re:Umm (Score:5, Informative)
    
    by StoneyMahoney ( 1488261 ) writes: on Tuesday October 16, 2012 @12:35PM (#41670347)
    
    The rationale behind splitting hard drives in a RAID between a number of manufacturers batches, even for identical drives, it to try and avoid a problem with an entire batch that's slipped past QA from taking out an entire array of drives simultaneously.
    I'm paranoid, but am I paranoid enough....?
    
    Parent Share
    twitter facebook
    - Re:Umm (Score:5, Insightful)
      
      by statusbar ( 314703 ) writes: <jeffk@statusbar.com> on Tuesday October 16, 2012 @12:40PM (#41670431) Homepage Journal
      
      I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
      
      Parent Share
      twitter facebook
      - Re:Umm (Score:5, Informative)
        
        by CaptSlaq ( 1491233 ) writes: on Tuesday October 16, 2012 @01:00PM (#41670763)
        
        I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
        This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
        
        Parent Share
        twitter facebook
        
        Re:Umm (Score:5, Insightful)
        
        by Anonymous Coward writes: on Tuesday October 16, 2012 @02:26PM (#41672029)
        
        I've seen two instances where a drive failed. Each time there were no handy replacement drives. Within a week a second drive died the same way as the first! back to backup tapes! Better to have replacement drives in boxes waiting.
        This. Your spares closet is your best friend in the enterprise. Ensure you keep it stocked.
        And locked. And don't label them "spares". Label them "cold swap fallback device" or something that management won't see as something "extra" that can be "repurposed" (i.e. stolen)
        
        Parent Share
        twitter facebook
      - Re: (Score:3)
        
        by Sparticus789 ( 2625955 ) writes:
        
        Also useful to set up your RAID with hot-swap drives. In a 16-drive array, I like to set up with RAID 6 and one hot-swap drive. That way, I can actually loose 2 drives, then one more drive once the hot swap has been populated.
      - Re: (Score:3)
        
        by mlts ( 1038732 ) * writes:
        
        That is why one uses RAID 6 with lower tier drives and hot spares.
        Lower tier drives (SATA) need RAID 6 and hot spares, because it takes a long time (days) to rebuild a failed drive, which leaves a large window for another drive failure to happen.
        Upper tier drives (FC/SSD) are far faster, so the window of vulnerability is a lot less, so RAID 5 is more useful. Even then, it doesn't hurt to have a hot spare, so no tech is needed in case of a drive failure. You jusr change out the failed drive at one's relati
        
        Re:Umm... (Score:3, Funny)
        
        by daha ( 1699052 ) writes:
        
        That is why one uses RAID 6 with lower tier drives and hot spares.
        Works great until 3 drives in the RAID fail.
        Better make it a RAID-60 just to be safe. And maybe mirror that too.
        
        Re: (Score:3)
        
        by ArsonSmith ( 13997 ) writes:
        
        Could run a cubed strip of raid 6 arrays in a RAID666
        
        Re:Umm (Score:5, Interesting)
        
        by kasperd ( 592156 ) writes: on Tuesday October 16, 2012 @03:12PM (#41672625) Homepage Journal
        
        That is why one uses RAID 6 with lower tier drives and hot spares.
        Best argument for RAID 6 is bad sectors discovered during reconstruction. Assume one of your disks have a bad sector somewhere. Unless you periodically read through all your disks, you may not notice this for a long time. Now assume a different disk dies [phuket-data-wizards.com]. Reconstruction starts writing on your hot spare. But during reconstruction an unreadable sector is found on a different drive. On RAID 5, that means data loss.
        
        I have on one occasion been assigned the task on recovering from pretty much that situation. And some of the data did not exist anywhere else. In the end my only option was to retry reading the bad media over and over until on one pass I got lucky.
        
        With RAID 6 you are much better off. If one disk is completely lost and you start reconstructing to the hot spare, you can tolerate lots of bad sectors. As long as you are not so unlucky to find bad sectors in the exact same location on two different drives, reconstruction will succeed. An intelligent RAID 6 system will even correct bad sectors in the process. When a bad sector is detected during this reconstruction, the data for both the bad sector as well as this location on the hot spare are reconstructed simultaneously and both can be written to the respective disk.
        
        At the end of the reconstruction you not only have reconstructed the lost disk, you have also reconstructed all the bad sectors found on any of the drives. Should one of the disks run out of space for remapping bad sectors in the process, then that disk is next in line to be replaced.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by sirsnork ( 530512 ) writes:
        
        This is exactly why most RAID cards to patrol reads during low activity.
        Of course, that assumes you use a real RAID card rather than software RAID. I'm not aware of any software raid implementation that does patrol reads
      - Re: (Score:3)
        
        by David_Hart ( 1184661 ) writes:
        
        Wait... If you are running RAID-5 without a hot spare or two, you are just doing it wrong....
      - Re:Umm (Score:4, Insightful)
        
        by NeverVotedBush ( 1041088 ) writes: on Tuesday October 16, 2012 @02:15PM (#41671885)
        
        When a drive fails and a RAID goes into reconstruction (if you are set up that way), that's when you are significantly more likely to have another drive fail due to all the extra activity across the RAID.
        
        We see it all the time on a big array. One must hustle to repair/rebuild the RAID... ;-)
        
        Parent Share
        twitter facebook
    - Re:Umm (Score:5, Insightful)
      
      by ByOhTek ( 1181381 ) writes: on Tuesday October 16, 2012 @12:42PM (#41670459) Journal
      
      In general, if you get such an issue, it will happen early on in the life of the drives (one coworker had what he called the 30-day thrash rule - he would plan ahead and get a huge number of drives - the cheapest available meeting requirements, including avoiding manufacturers we had issues with previously, take a handleful, and thrash 'em for 30 days. If nothing bad happend, he'd either keep up 30 day thrashes on sets of hard drives, pulling out the duds, or just return the whole lot.
      
      Parent Share
      twitter facebook
      - Re:Umm (Score:5, Interesting)
        
        by Anonymous Coward writes: on Tuesday October 16, 2012 @01:01PM (#41670779)
        
        Google published a study they did of their own consumer grade drives, and found the same time. If the drive survives the first month of load, it will likely go on to work for years, but if it throws even just SMART errors in the first 30 days, it is likely to be dodgy
        
        Parent Share
        twitter facebook
        
        Re:Umm (Score:5, Informative)
        
        by Bob the Super Hamste ( 1152367 ) writes: on Tuesday October 16, 2012 @01:37PM (#41671247) Homepage
        
        For those who are interested the white paper is titled "Failure Trends in a Large Disk Drive Population" and can be found here [googleusercontent.com]. It is a fairly short read (13 total pages) and quite interesting if you are into monitoring stuff.
        
        Parent Share
        twitter facebook
        
        Re:Umm (Score:4, Informative)
        
        by Bob the Super Hamste ( 1152367 ) writes: on Tuesday October 16, 2012 @02:53PM (#41672395) Homepage
        
        Mostly the methadology as well as it disproving some of the standard thought (heat or activity kills drives). While they were looking for some leading indicator for all drive failures (were some error reported before a given drive crapped out) which is what they didn't find as a large portion of the drives just crapped out without warning any drives that did start to report warnings were very likely to crap out shortly (I think their threshold was 60 days) which does help to prevent down time. Interestingly I had to look into disk monitoring at my job and ran across that paper, implemented some automated S.M.A.R.T. monitoring and one of the disks in a box had tossed some errors. People complained because my code was alarming this issue so they thought my code was bad. A couple days later the drive gave up the ghost and I was vindicated.
        
        Parent Share
        twitter facebook
      - Bathtub Curve (Score:5, Informative)
        
        by Onymous Coward ( 97719 ) writes: on Tuesday October 16, 2012 @01:05PM (#41670833) Homepage
        
        The bathtub curve [wikimedia.org] is widely used in reliability engineering. It describes a particular form of the hazard function which comprises three parts:
        The first part is a decreasing failure rate, known as early failures.
        The second part is a constant failure rate, known as random failures.
        The third part is an increasing failure rate, known as wear-out failures.
        
        Parent Share
        twitter facebook
    - Re:Umm (Score:5, Interesting)
      
      by MightyMartian ( 840721 ) writes: on Tuesday October 16, 2012 @12:42PM (#41670465) Journal
      
      Too true. Years ago we bought batches of Seagate Atlas drives, and all of them pretty much started dying within weeks of each other. They were still under warranty, so we got a bunch more of the same drives, and lo and behold within nine months they were crapping out again. It absolutely amazed me how closely together the drives crapped out.
      
      Parent Share
      twitter facebook
      - Re: (Score:3, Interesting)
        
        by Anonymous Coward writes:
        
        Warranty replacement drivers are refurbished, meaning they've already failed once. I've never had a refurb drive last a full year without failing. It's gotten bad enough that I don't bother sending them back for warranty replacement anymore.
        
        Re: (Score:3)
        
        by Lumpy ( 12016 ) writes:
        
        I do. Get the refurb back and sell it on ebay for 50% of the going price. you at least get some money back.
      - Re: (Score:3)
        
        by Electricity Likes Me ( 1098643 ) writes:
        
        Western Digital's warranty is still 3 years, although their drives straight up lie about reallocated sector counts in SMART (whereas Seagate does not). This makes failure planning hard, since you can't see if a drive is throwing bad sectors until you run out of replacements and get an uncorrectable error (i.e. data construction).
        Most of my WDs are in a RAIDZ3 though, so it's not so much of a problem.
    - Re: (Score:3)
      
      by Spazmania ( 174582 ) writes:
      
      I lost a server once where the drive batch had a 60% failure rate after 6 months. Unless you're intentionally building the raid for performance (vice reliability), you definitely want to pull drives from as many different manufacturers and batches as you can.
    - Re: (Score:3)
      
      by infodragon ( 38608 ) writes:
      
      Never paranoid enough when dealing with data! I had a RAID 5 (5 disks) of Seagate 80GB SATA disks; 4 failed within an 8 hour window, the 5th failed within 24 hours of the first; this was 3 months after purchase. It was a HUGE PITA. First drive failed and I started an immediate DB dump to an NFS mount. 20GB and 2 hours later the second disk failed and RAID was dead. I ran the other three disks just to see what would happen...
      I will NEVER, EVER run two storage medium (Spinning platter, SSD, ...) from th
      - Re:Umm (Score:4, Interesting)
        
        by infodragon ( 38608 ) writes: on Tuesday October 16, 2012 @02:04PM (#41671685)
        
        [Sarcasm]Nothing like 20/20 hindsight... If I had done anything like trying to rebuild the array it would have fallen apart... Oh wait... If I had followed what you suggested I would have been SCREWED.[/Sarcasm]
        I made a decision based on what on the information on hand.. The rebuild would have take more than a few hours, 80GB disk was SLOW, i.e. first gen SATA. By executing the DB dump I was hitting less than 1/2 the disk capacity on read than 100% disk capacity on a write. It would be significantly faster to retrieve the data than to rebuild. That time window was critical, 2 hours of read vs 4+ hours of write. I also knew I had all the data on hand and all the scripts tested monthly for rebuilding the entire DB on a different server. The decision was easy! Grab the DB data now, redeploy on another system and address the issue on the spot. The system ended up being down 3 hours rather than 24+.
        Secondly The failure was abrupt with no SMART messages, I couldn't trust the others to not have the same non-reporting issues. I made a choice on the spot on how to proceed knowing full well I may have signed my own 24h torture warrant. Fortunately I didn't have the worst case happen and I learned a critical lesson.
        A bit more information...
        +- 30 minutes on each one
        First disk failed...
        2 hours later second disk failed...
        2 hours later third disk failed.
        2 hours later 4th disk failed
        16 hours later 5th disk failed.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3)
        
        by Binestar ( 28861 ) writes:
        
        Those drives were hit with some sort of power issue. Even same batch of drives it's way too close together for manufacturing flaw. Congrats on getting the data off quickly though.
- Re: (Score:2)
  
  by hawguy ( 1600213 ) writes:
  
  It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too. Same would apply to SSDs.
  I've heard the same, but judging from the serial numbers on disks in our major-vendor storage array, they seem to use same-lot disks, here's a few serial numbers from one disk shelf (partially obscured):
  xx-xxxxxxxx4406
  xx-xxxxxxxx4409
  xx-xxxxxxxx4419
  xx-xxxxxxxx4435
  xx-xxxxxxxx4448
  xx-xxxxxxxx4460
  xx-xxxxxxxx4468
  They look close enough to be from the same manufacturing lot. Unless the disk manufacturer randomizes disks before applying serial numbers when selling to this storage vendor. We do lose disks occasionall
- Re: (Score:3)
  
  by kasperd ( 592156 ) writes:
  
  It was my understanding that for traditional drives in a RAID you don't want to get all the same type of drive all made around the same time since they will fail around the same time too.
  It's a claim I have seen often, but I have never seen evidence to support it. A much more likely scenario, which can easily be misdiagnosed as simultaneous drive failures is the following. One disk gets a bad sector, which at first goes unnoticed. A second disk dies. During reconstruction the first bad sector is noticed and
- - Re:Umm (Score:5, Insightful)
    
    by Anonymous Coward writes: on Tuesday October 16, 2012 @12:31PM (#41670293)
    
    yeah, sounds like submitter may be mildly deficient
    
    Which is why he's asking.
    Fuck people who ask questions when they don't know something, right?
    
    Parent Share
    twitter facebook
They shrink (Score:2, Informative)

by Anonymous Coward writes:

The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
- Re:They shrink (Score:5, Informative)
  
  by tgd ( 2822 ) writes: on Tuesday October 16, 2012 @12:32PM (#41670307)
  
  The drives will shrink down to nothing. I believe that the drive controller considers a sector dead after 100,000 writes.
  Filesystems, generally speaking, aren't resilient to the underlying disk geometry changing after they've been laid down. There's reserved space to replace bad cells as they start to die, but the disk won't shrink. Eventually, though, you get parts of the disk dying in an unrecoverable way and the drive is toast.
  
  Parent Share
  twitter facebook
  - Re:They shrink (Score:5, Informative)
    
    by v1 ( 525388 ) writes: on Tuesday October 16, 2012 @12:55PM (#41670669) Homepage Journal
    
    The sectors you are talking about are often referred to as "remaps" (or "spares"), which is also used to describe the number of blocks that have been remapped. Strategies vary, but an off-the-cuff average would be around one available spare per 1000 allocatable blocks. Some firmware will only use a spare from the same track, other firmware will pull the next nearest available spare. (allowing an entire track to go south)
    The more blocks they reserve for spares, the lower the total capacity count they can list, so they don't tend to be too generous. Besides, if your drive is burning through its spares at any substantial rate, doubling the number of spares on the drive won't actually end up buying you much time, and certainly won't save any data.
    But with the hundreds of failing disks I've dealt with, when more than ~5 blocks have gone bad, the drive is heading out the door fast. Remaps only hide the problem at that point. If your drive has a single block fail when trying to write, it will be remapped silently and you won't ever see the problem unless you check the remap counter in smart. If it gets an unreadable block on a read operation, you will probably see an io error however. Some drives will immediately remap it, but most don't and will conduct the remap when you next try to write to that cell. (otherwise they'd have to return fictitious data, like all zeros)
    So I don't particularly like automatic silent remaps. I'd rather know whean the drive first looks at me funny so I can make sure my backups are current and get a replacement on order, and swap it out before it can even think about getting worse. I prefer to replace a drive on MY terms, on MY schedule, not when it croaks and triggers any grade of crisis. There are legitimate excuses for downtime, but a slowly failing drive shouldn't be one of them.
    All that said, on multiple occasions I've tried to cleanse a drive of IO errors by doing a full zero-it format. All decent OBCCs on drives should verify all writes, so in theory this should purge the drive of all IO errors, provided all available spares have not already been used. The last time I did this on a 1TB Hitachi that had ONE bad block on it, it still had one bad block (via read verify) when the format was done. The write operation did not trigger a remap, (and I presume it wasn't verified, as the format didn't fail) and I don't understand that. If it were out of remaps, the odds of it being ONE short of what it needed is essentially zero. So I wonder in reality just how many drive manufacturers aren't even bothering with remapping bad blocks. All I can attribute this to is crappy product / firmware design.
    
    Parent Share
    twitter facebook
    - - Re: (Score:3, Informative)
        
        by Bob the Super Hamste ( 1152367 ) writes:
        
        From my understanding this is exactly the type of thing that S.M.A.R.T is going to detect along with a number of other issues. If you are interested I suggest checking out the paper from Google entitled "Failure Trends in a Large Disk Drive Population" as they made extensive use of S.M.A.R.T and tracked an extremely large number of drives for a number of years for the analysis. [slashdot.org]
      - Re:They shrink (Score:4, Informative)
        
        by v1 ( 525388 ) writes: on Tuesday October 16, 2012 @02:09PM (#41671785) Homepage Journal
        
        SMART is implemented in different ways by different manufacturers. The idea is that the host can ask the peripheral "what value does slot xx contain?" This can refer to an instantaneous condition, such as the temperature of the hard drive, a static value such as how many spares are currently available, a semidynamic value such as is this hard drive failing, and a dynamic value such as how many remap operations have occurred. There's a short list of "basic/standard" values, and then there's the "extended/optional" metrics that not all devices need to support. Each smart slot will also specify the min and max values. If any smart slot has a value outside its allowed range, overall smart status will report as failing. Once a drive toggles over to failing, there's no going back, unless you figure out a way to reset the counters.
        One of the standard set is the "is the hard drive failing" metric. It allows the host to get a simple yes/no answer to summarize whether any of the metrics have gone beyond their tolerated values. For example, one drive I worked with recently was allowed to overtemp twice. If it had experienced a third overtemp during its lifetime, the drive would then permanently fail the overall test. This allows the host to "check smart status" without really having to think much about what it's doing. This is the basic test that most modern OS's check to see if a hard drive needs to be replaced. You usually need to run a special tool to check individual values being returned by smart. These tools need to have a list of what each slot means, and often will report fairly meaningless information near the end of the list, where they don't know what this 23 means in slot 85 etc.
        Other known values may slowly increment over the lifetime of the drive, such as "head re-calibrations", "remaps", SMS head parks, max g forces experienced, etc. You'd have to compare their current values with their claimed limits to see how close each of these metrics is to causing overall smart to toggle to failed. Without knowing what the metric is, or what it's expected limit is, the numbers aren't useful.
        
        Parent Share
        twitter facebook
- Re: (Score:3)
  
  by klui ( 457783 ) writes:
  
  Newer disks' cells aren't rated for more than approximately 5000 writes due to process shrink. You're basically hoping the manufacturer's write leveling firmware is enough to compensate.
How do SSD's die (Score:5, Funny)

by AwesomeMcgee ( 2437070 ) writes: on Tuesday October 16, 2012 @12:26PM (#41670201)

Screaming in agony, hissing bits and bleeding jumperless in the night

Share
twitter facebook
Firmware bugs (Score:2)

by Anonymous Coward writes:

Didn't happen to me, but a number of people with the same Intel SSD reported that they booted up and the SSD claimed to be 8MB and required a secure wipe before it could be reused. Supposedly it's fixed in the new firmware, but I'm still crossing my fingers every time I reboot that machine.
- Re: (Score:2)
  
  by greg1104 ( 461138 ) writes:
  
  That's the Intel 320 series drives. They didn't release a version of those drives claimed suitable for commercial work until the "8MB bug" was sorted out, as the much more expensive 710 series.
- Firmware bugs killed my OCZ Vertex 2 (Score:3)
  
  by ThreeDayMonk ( 673466 ) writes:
  
  I always expected the cells to go first. I was careful to avoid unnecessary writes. In the end, though, it was a known bug that killed the drive. Well, I didn't know about it, of course, until it was too late. If I'd known, I'd have updated the drive firmware to one that didn't have a catastrophic bug.
  I replaced it with a Samsung. The RMA'd replacement OCZ is still sitting in its packet on my desk.
Flash SSD has Write Limitations so... (Score:2, Informative)

by Anonymous Coward writes:

From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/
- Re: (Score:3)
  
  by Auroch ( 1403671 ) writes:
  
  From what I understand, SSD die because of "write-burnout" if they are FLASH based and from what I understand the majority of SSDs are flashed based now. So while I haven't actually had a drive fail on me, I assume that I would be able to still read data off a failing drive and restore it, making it an ideal failure path. I did a google search and found a good article on the issue: http://www.makeuseof.com/tag/data-recovered-failed-ssd/ [makeuseof.com]
  Which is why you can do the same from a failed usb flash drive?
  
  It's a nice theory, but it's highly dependent on the controller.
- Re:Flash SSD has Write Limitations so... (Score:5, Interesting)
  
  by SydShamino ( 547793 ) writes: on Tuesday October 16, 2012 @12:58PM (#41670727)
  
  For flash memory it is the erase cycles, not the write cycles, that drive life.
  http://en.wikipedia.org/wiki/Flash_memory [wikipedia.org]
  The quantum tunneling effect described for the erase process can weaken the insulation around the isolated gate, eventually preventing that gate from holding its charge. That's the typical end-of-life scenario for a bit of flash memory.
  You generally don't say that writes are end-of-life because you could, in theory, write the same pattern to the same byte over and over again (without erasing it) and not cause reduction in part life. Or, since bits erase high and write low, you could write the same byte location eight times, deasserting one new bit each time, then erase the whole thing once, and the total would still only be "one" cycle.
  
  Parent Share
  twitter facebook
wear leveling (Score:2, Informative)

by Anonymous Coward writes:

SSDs use wear leveling algorithms to optimize each memory cell's lifespan; meaning that it keeps track of how many times each cell was written and it ensures that all cells are being utilized evenly. When the cells fail, they're being kept track of and the drive does not attempt to write to that cell any longer. When enough cells have failed the capacity of the drive will shrink noticeably. At that point it is probably wise to replace it. For a RAID configuration the wear level algorithm would presumabl
if they die at the same time repeatably.. (Score:2)

by gl4ss ( 559668 ) writes:

by performing same set of actions, in unreasonable time, then with 99.999%(the more drives, add 9's) probability it's a bug in the firmware/controller. afaik there shouldn't be such drives on market anymore..
otherwise the nands shouldn't die at the same time. shitty nands I suppose will die faster (a bad batch is shitty).
some drive controllers have counters about the nand use - but they shouldn't all blow up when it hits 0, at which point you're recommended to replace them.
I haven't had one die, though I do
They usually die gracefully... (Score:5, Informative)

by dublin ( 31215 ) writes: on Tuesday October 16, 2012 @12:31PM (#41670275) Homepage

In general, if the SSD in question has a well-designed controller (Intel, SandForce), then write performance will begin to drop off as bad blocks start to accumulate on the drive. Eventually, wear levelling and write cycles have taken their toll, and the disk can no longer write at all. At this point, the controller does all it can: it effectively becomes a read-only disk. It should operate in this mode until else something catastrophic (tin migration, capacitor failure, etc.) keeps the entire drive from working.
BTW - I haven't seen this either, but that's the degradation profile that's been presented to me in several presentations by the folks making SSD drives and controllers. (Intel had a great one a few years back - don't have a link to it handy, though...)

Share
twitter facebook
- X-25M Death: Firmware bug too? (Score:5, Interesting)
  
  by Anonymous Coward writes: on Tuesday October 16, 2012 @12:50PM (#41670595)
  
  I had an 80G Intel X-25M fail in an interesting manner. Windows machine, formatted NTFS, Cygwin environment. Drive had been in use for about a year, "wear indicator" still read 100% fine. Only thing wrong with it is that it had been mostly (70 out of 80G full) filled, but wear leveling should have mitigated that. It had barely a terabyte written to it over its short life.
  Total time from system operational to BSOD was about ten minutes. I first noticed difficulties when I invoked a script that called a second script, and the second script was missing. "ls -l" on the missing script confirmed that the other script wasn't present. While scratching my head about $PATH settings and knowing damn well I hadn't changed anything, a few minutes later, I discovered I also couldn't find /bin/ls.exe. In a DOS prompt that was already open, I could DIR C:\cygwin\bin - the directory was present, ls.exe was present, but it wasn't anything that the OS was capable of executing. Sensing imminent data loss, and panic mounting, I did an XCOPY /S /E... etc to salvage what I could from the failing SSD.
  Of the files I recovered by copying them from the then-mortally-wounded system, I was able to diff them against a valid backup. Most of the recovered files were OK, but several had 65536-byte blocks consisting of nothing but zeroes.
  Around this point, the system (unsurprisingly, as executables and swap and heaven knows what else was being riddled with 64K blocks of zeroes) crashed. On reboot, Windows attempted (and predictably failed) to recover (assinine that Windows tries to write to iself on boot, but also assinine of me to not power the thing down and yank the drive, LOL.) The system did recognize it as an 80G drive and attempted to boot itself - Windows logo, recovery console, and all.
  On an attempt to mount the drive from another boot disk, the drive still appeared as an 80G drive once, unfortunately, it couldn't remain mounted long enough for me to attempt further file recovery or forensics.
  A second attempt - and all subsequent attempts - to mount the drive showed it as an 8MB (yes, eight megabytes) drive.
  I'll bet most of the data's still there. (The early X-25Ms didn't use encryption). What's interesting is that the newer drives have a similar failure mode [intel.com] that's widely recognized as a firmware bug. If there were a way to talk to the drive over its embedded debugging port (like the Seagate Barracuda fix from a few years ago), I'll bet I could recover most of the data.
  (I don't actually need the data, as I got it all back from backups, but it's an interesting data recovery project for a rainy day. I'll probably just desolder the chips and read the raw data off 'em. Won't work for encrypted drives, but it might work for this one.)
  
  Parent Share
  twitter facebook
- Re:They usually die gracefully... (Score:5, Interesting)
  
  by AmiMoJo ( 196126 ) writes: on Tuesday October 16, 2012 @12:51PM (#41670625) Homepage Journal
  
  I had an Intel SSD run out of spare capacity and it was not fun. Windows kept forgetting parts of my profile and resetting things to default or reverting back to backup copies. The drive didn't report a SMART failure either, even with Intel's own SSD monitoring tool. I had to run a full SMART "surface scan" before it figured it out.
  That sums up the problem. The controller doesn't start reporting failures early enough and the OS just tries to deal with it as best as possible, leaving the user to figure out what is happening.
  
  Parent Share
  twitter facebook
- Re: (Score:3)
  
  by justthinkit ( 954982 ) writes:
  
  I am curious if Event Viewer data is helpful as the SSD starts to fail.
Re: (Score:2)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- - Re: (Score:3)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
  - Re: (Score:2)
    
    by neokushan ( 932374 ) writes:
    
    Nobody on Slashdot will ever have to worry about those.
They die without warning and without recourse (Score:4, Informative)

by PeeAitchPee ( 712652 ) writes: on Tuesday October 16, 2012 @12:31PM (#41670287)

With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely. In my experience, though SSDs don't fail as often, when they do, it's sudden and catastrophic. Having said that, I've only seen one fail out of the ~10 we've deployed here (and it was in a laptop versus traditional desktop / workstation). So BACK IT UP. Just my $0.02.

Share
twitter facebook
- Re:They die without warning and without recourse (Score:5, Informative)
  
  by PRMan ( 959735 ) writes: on Tuesday October 16, 2012 @12:49PM (#41670583)
  
  I have had two SSD crashes. One was on a very cheap Zelman 32GB drive which never really worked (OK, about twice). The other was on a Kingston 64GB that I have in my server. When it gets really hot in the room (over 100, so probably over 120 for the drive itself in the case), it will crash. But when it cools down, it works perfectly well.
  
  Parent Share
  twitter facebook
- Re:They die without warning and without recourse (Score:5, Interesting)
  
  by cellocgw ( 617879 ) writes: <cellocgw@gmail . c om> on Tuesday October 16, 2012 @12:54PM (#41670653) Journal
  
  With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
  OK, so I'm sure some enterprising /.-er can write a script that watches the SSD controller and issues some clicks to the sound card when cells are marked as failed.
  
  Parent Share
  twitter facebook
- Re:They die without warning and without recourse (Score:5, Informative)
  
  by dougmc ( 70836 ) writes: <dougmc+slashdot@frenzied.us> on Tuesday October 16, 2012 @12:54PM (#41670657) Homepage
  
  With traditional mechanical drives, you usually get a clicking noise accompanied by a time period where you can offload data from the drive before it fails completely.
  Usually? No.
  This does happen sometimes, but it certainly doesn't happen "usually". There's enough different failure mechanisms for hard drives that there isn't any one "usual" method --
  1- drive starts reporting read and/or write errors occasionally, but otherwise seems to keep working
  2- drive just suddenly stops working completely all at once
  3- drive starts making noise (and performance usually drops massively), but the drive still works.
  4- drive seems to keep working, but smart data starts reporting all sorts of problems.
  Personally, I've had #1 happen more often than anything else, usually with a healthy serving of #4 at about the same time or shortly before. #2 is the next most common failure mode, at least in my experience.
  
  Parent Share
  twitter facebook
Bang! (Score:5, Informative)

by greg1104 ( 461138 ) writes: <gsmith@gregsmith.com> on Tuesday October 16, 2012 @12:33PM (#41670325) Homepage

All three of the commercial grade SSD failures I've cleaned up after (I do PostgreSQL data recovery) just died. No warning, no degrading in SMART attributes; works one minute, slag heap the next. Presumably some sort of controller level failure. My standard recommendation here is to consider then no more or less reliable than traditional disks and always put them in RAID-1 pairs. Two of the drives were Intel X25 models, the other was some terrible OCZ thing.
Out of more current drives, I was early to recommend Intel's 320 series as a cheap consumer solution reliable for database use. The majority of those I heard about failing died due to firmware bugs, typically destroying things during the rare (and therefore not well tested) unclean shutdown / recovery cases. The "Enterprise" drive built on the same platform after they tortured consumers with those bugs for a while is their 710 series, and I haven't seen one of those fail yet. That's not across a very large installation base nor for very long yet though.

Share
twitter facebook
- Re:Bang! (Score:5, Funny)
  
  by ColdWetDog ( 752185 ) writes: on Tuesday October 16, 2012 @12:48PM (#41670567) Homepage
  
  Does anyone else find this sort of thing upsetting? I grew up during that period of time when tech failed dramatically on TV and in movies. Sparks, flames, explosions - crew running around randomly spraying everything with fire extinguishers. Klaxons going off. Orders given and received. Damage control reports.
  None of this 'oh snap, the hard drive died'.
  Personally, I think the HD (and motherboard) manufacturers ought to climb back on the horse. Make failure modes exciting again. Give us a run for the money. It can't be hard - there still must be plenty of bad electrolytic capacitors out there.
  How about a little love?
  
  Parent Share
  twitter facebook
Data corruption, then fails e2fsck upon boot (Score:4, Informative)

by vlm ( 69642 ) writes: on Tuesday October 16, 2012 @12:36PM (#41670357)

My experience was system crash due to corruption of loaded executables, then at the hard reboot it fails the e2fsck because the "drive" is basically unwritable so the e2fsck can't complete.
It takes a long time to kill a modern SSD... this failure was from back when a CF plugged into a PATA-to-CF adapter was exotic even by /. standards

Share
twitter facebook
Dunno about how, but I do know WHEN (Score:2)

by davidwr ( 791652 ) writes:

Like spinning drives, silicon drives always die when it will do the most damage [wikipedia.org].
Like right before you find out all your backups are bad.
I have seen SSD death (Score:5, Informative)

by MRGB ( 2743757 ) writes: on Tuesday October 16, 2012 @12:38PM (#41670389)

I have seen SSD death many times and it is a strange sight indeed. What is interesting about it when compared to normal drives is that when normal drives fail it is - mostly - and all or nothing ordeal. A bad spot on a drive is a bad spot on a drive. With SSDs you can have a bad spot one place, reboot, and you get a bad spot in another place. Windows loaded on an SSD will exhibit all kinds of bizarre behaviour. Sometimes it will hang, sometimes it will blue-screen, sometimes it will boot normally until it tries to read or write to that random bad spot. Rebooting is like rolling the dice to see what it will do next - that is, until it fails completely.

Share
twitter facebook
1 failed SSD experienced... (Score:2)

by StoneyMahoney ( 1488261 ) writes:

Only seen a single SSD fail. It was a Mini-PCIex unit in a Dell Mini 9. I suspect the actual failure may have been atypical as it seems it failed in just the right place to render the filesystem unwritable, although you could read from fairly hefty sections of it. It was immediate and irrepairable, although I suspect SSD manufacturers use better quality than that built-to-a-price (possibly counterfeit) POS.
Had one die twice (Score:2)

by bstrobl ( 1805978 ) writes:

Had an aftermarket SSD for a macbook air fail twice in 2 years (threw it out and placed an original hdd after that). Both times the system decided not to boot and could not find the SSD.

In both cases I have suspected that the Indilinx controller gave way. This seems mirrored in quite a few cases with the experience of others who had drives with these chips in them.

In an ideal scenario the controller should be able to handle the eventual wearout of the disk by finding other memory cells to write to. An
I had one fail (Score:3)

by kelemvor4 ( 1980226 ) writes: on Tuesday October 16, 2012 @12:41PM (#41670445)

I had a FusionIO IODrive fail a few weeks ago. It was running a data array on a windows 2008 r2 server. It manifested its-self by giving errors in the windows event log and causing long boot times (even though it was not a boot device). The device was still accessible, but slower than normal. I think the answer to your question will probably vary greatly both by manufacturer and also based on what part of the device failed. The SSD's I've used generally come with a fairly large amount of "backup" memory on them so that if a cell begins to fail, the card marks the cell bad and uses one from one of the backup chips. Much like how hard drives deal with bad sectors. As I understand it, the SSD is somehow able to detect the failure before data is lost and begin using the backup chips transparently and automatically vs having to do a scandisk or similar to do the same on a physical disk. That may very well vary by manufacturer as well.

Share
twitter facebook
Peacefully, with their loved ones at their bedside (Score:3)

by Revotron ( 1115029 ) writes: on Tuesday October 16, 2012 @12:41PM (#41670451)

as the disk controller reads them their last rites before they integrate with the great RAID array in the sky.

Share
twitter facebook
Oblig: T. S. Eliot (Score:5, Funny)

by stevegee58 ( 1179505 ) writes: on Tuesday October 16, 2012 @12:43PM (#41670489) Journal

Not with a bang but a whimper.

Share
twitter facebook
Yes they do fail (Score:3)

by AnalogDiehard ( 199128 ) writes: on Tuesday October 16, 2012 @12:45PM (#41670519)

We use SSDs in a few Windows machines at work. Running 24/7/365 production. We were replacing them every couple of years.

Share
twitter facebook
My SSD is bad! (Score:3)

by dittbub ( 2425592 ) writes: on Tuesday October 16, 2012 @12:46PM (#41670541)

I have a G.Skill Falcon 64GB SSD that is failing on me. Windows chkdsk started seeing "bad sectors" (whatever this means for SSD... I think its really slow parts of the SSD) and started seeing more and more and windows would not boot. A fresh install of windows would immediately crash in a day or two. I had done a "secure erase" and that seemed to the job, a chkdsk found no "bad sectors". But a weeks later chkdsk found 4 bad sectors. But its going on a month now and I have yet to have windows fail.

Share
twitter facebook
SSD wear cliff (Score:5, Informative)

by RichMan ( 8097 ) writes: on Tuesday October 16, 2012 @12:50PM (#41670609)

SSD's have an advertised capacity N and an actual capacity M. Where M > N. In general the bigger M realtive to N the better the performance and lifetime of the drive. As it wears it will "silently" assign bad blocks and reduce M. Your write performance will degrade. If you have good analysis tools it will tell you when it starts getting a lot of blocks near end of life and when M is getting reduced.
Blocks near end of life are also more likely to get read errors. The drive firmware is supposed to juggle things around so all of the blocks near end of life about the same time. With a soft read error the block will be moved to a more reliable portion of the SSD. That means increased wear.
1. Watch write perforamance/spare block count
2. If you get any read errors do a block life audit
3. When you get into life limiting events things accelerate to bad due to the mitigation behaviors
Be carefull depending on the sensitivities of the firmware it will let you get closer to catastrophe before warning you. More likely to be closer in consumer grade.

Share
twitter facebook
No, they don't all age the same. (Score:4, Informative)

by YesIAmAScript ( 886271 ) writes: on Tuesday October 16, 2012 @12:55PM (#41670673)

It's statistical, not fixed rate. Some cells wear faster than others due to process variations, and the failures don't show up to you until there are uncorrectable errors. If one chip gets 150 errors spread out across the chip, and another gets 150 in critical positions (near to each other), then the latter one will show failures while the first one keeps going.
So yeah, when one goes, you should replace them all. But they won't all go at once.
Also note most people who have seen SSD failures have probably seen them fail due to software bugs in their controllers, not inherent inability to store data due to wear.

Share
twitter facebook
Still relevant? (Score:3)

by Oscaro ( 153645 ) writes: on Tuesday October 16, 2012 @12:58PM (#41670723) Homepage

After reading this horror story I arrived to the conclusion that SSDs are not for me. I wonder if it's still true.
Super Talent 32 GB SSD, failed after 137 days
OCZ Vertex 1 250 GB SSD, failed after 512 days
G.Skill 64 GB SSD, failed after 251 days
G.Skill 64 GB SSD, failed after 276 days
Crucial 64 GB SSD, failed after 350 days
OCZ Agility 60 GB SSD, failed after 72 days
Intel X25-M 80 GB SSD, failed after 15 days
Intel X25-M 80 GB SSD, failed after 206 days
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html [codinghorror.com]

Share
twitter facebook
First hand experience here (Score:5, Informative)

by SeanTobin ( 138474 ) writes: <<byrdhuntr> <at> <hotmail.com>> on Tuesday October 16, 2012 @12:58PM (#41670729)
I recently had a "old" (cir 2008) 64gb SSD drive die on me. It's death followed this pattern:
- Inexplicable system slowdowns. In hindsight, this should have been a warning alarm.
- System crash, followed by a failure to boot due to unclean ntfs volume which couldn't be fixed by chkdisk
- Failed to mount r/w under Ubuntu. Debug logs showed that the volume was unclean and all writes failed with a timeout
- Successful r/o mount showed that the filesystem was largely intact
- Successful dd imaged the drive and allowed a restore to a new drive.
After popping a new disk in and doing a partition resize, my system was back up and running with no data loss. Of all the storage hardware failures I've experienced, this was probably the most pain-free as the failure caused the drive to simply degrade into a read-only device.
Share
twitter facebook
Some anecdotes (Score:3)

by toastyman ( 23954 ) writes: <toasty@dragondata.com> on Tuesday October 16, 2012 @01:15PM (#41670969) Homepage

We've got a fair number of SSDs here. Failures have been really rare. The few that have:
#1 just went dead. Not recognized by the computer at all.
#2 Got stuck in a weird read-only mode. The OS was thinking it was writing to it, but the writes weren't really happening. You'd reboot and all your changes were undone. The OS was surprisingly okay with this, but would eventually start having problems where pieces of the filesystem metadata it cached didn't sync up with new reads. Reads were still okay, and we were able to make a full backup by mounting in read only mode.
#3 Just got progressively slower and slower on writes. but reads were fine.
Overall far lower SSD failure rates than spinning disk failure rates, but we don't have many elderly SSDs yet. We do have a ton of servers running ancient hard drives, so it'll be interesting to see over time.

Share
twitter facebook
Theory or Practice? (Score:5, Interesting)

by rabtech ( 223758 ) writes: on Tuesday October 16, 2012 @01:21PM (#41671051) Homepage

In theory they should degrade to read-only just as others have pointed out in other posts, allowing you to copy data off them.
In reality, just like modern hard drives, they have unrecoverable firmware bugs, fuses that can blow with a power surge, controller chips that can burn up, etc.
And just like hard drives, when that happens in theory you should still be able to read the data off the flash chips but there are revisions to the controller, firmware, etc that make that more or less successful depending on the manufacturer. You also can't just pop the board off the drive like with an HDD, you need a really good surface mount resoldering capability.
So the answer is "it depends"... If the drive itself doesn't fail but reaches the end of its useful life or was put on the shelf less than 10 years ago (flash capacitors do slowly drain away) then the data should be readable or mostly readable.
If the drive itself fails, good luck. Maybe you can bypass the fuse, maybe you can re-flash the firmware, or maybe it's toast. Get ready to pay big bucks to find out.
P.S. OCZ is fine for build it yourself or cheap applications but be careful. They have been known to buy X-grade flash chips for some of their product lines - chips the manufacturers list as only good for kid toys or non-critical, low-volume applications. Don't know if they are still doing it but I avoid their stuff.
Intel's drives are the best and have the most-tested firmware but you pay for it. Crucial is Micron's consumer brand and tends to be pretty good given they make the actual flash - they are my go-to brand right now. Samsung isn't always the fastest but seems to be reliable.
Do your research and focus on firmware and reliability, not absolute maximum throughput/IOPs.

Share
twitter facebook
Tin whiskers (Score:3)

by AikonMGB ( 1013995 ) writes: on Tuesday October 16, 2012 @01:27PM (#41671117) Homepage

Tin whisker growth is another way not directly related to the flash cells. Commercial electronics use lead-free solder and no real whisker mitigation techniques. Eventually a whisker shorts between two things that shouldn't be shorted, conducts sufficient current for a sufficient amount of time, and poof, your drive is dead.

Share
twitter facebook
Experience with all ranges (Score:4, Interesting)

by guruevi ( 827432 ) writes: on Tuesday October 16, 2012 @05:11PM (#41674051)

The ultra-cheap SSD's in my severs lasted only 3 months. The 4 OCZ Vertex 3 IOPS have so far lasted over a year with ~2TB processed per disk, 2 Intel SLC and 2 MLC's already over 2 years over which time they have processed ~10TB each (those were all enterprise grade or close to it). They are in a 60TB array doing caching so they regularly get read/write/deleted. I have some OCZ Talos (SAS) as well where one was DoA and another early-death but simply shipping them into RMA and I had another one in a couple of days. But the rest of them do well over 6 months and going.
Several other random ones still work fine in random desktop machines and workstations.
As far as spare room on those devices, depending on the manufacturing process you get between 5 and 20% unused space where 'bad' blocks come to live. I haven't had one with bad blocks so most of mine have gone out with a bang, usually they just stop responding and drop out, totally dead. I would definitely recommend RAID6 or mirrors as they do die just like normal hard drives (I just had 3 identical 3TB drives die in the last week)

Share
twitter facebook
Intel SSD in the Enterprise: very low failure rate (Score:5, Informative)

by bbasgen ( 165297 ) writes: on Tuesday October 16, 2012 @05:22PM (#41674215) Homepage

I have ordered approximately 500 Intel SSD's over the past 18 months (320 series and the 520 series primarily). To date, we have had exactly one fail to my knowledge. It was a 320 series 160 GB with known firmware issue. We have around 80 of that type and size, and the drive that failed did so on first image. We RMA'ed the drive and got a replacement.

Share
twitter facebook
- - Re:Die! (Score:5, Funny)
    
    by Anonymous Coward writes: on Tuesday October 16, 2012 @12:27PM (#41670219)
    
    Wow - you've been here a long long time then
    
    Parent Share
    twitter facebook
    - - Re: (Score:3)
        
        by HaZardman27 ( 1521119 ) writes:
        
        why are you complaining about our long standing culture
        
        lister king of smeg (2481612)
        Not your first account I take it?
- Re:When you're nearing maximum write limit (Score:4, Interesting)
  
  by theNetImp ( 190602 ) writes: on Tuesday October 16, 2012 @12:33PM (#41670319)
  
  So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
  
  Parent Share
  twitter facebook
  - Re: (Score:3, Interesting)
    
    by Baloroth ( 2370816 ) writes:
    
    So by reason of thinking, if you have a RAID of 15 drives for storage of images, these images never change, they are written and never over written, then the SSDs should theoretically never die because they are only reading these bits now?
    Reading flash is not 100% non-destructive, if you never do a re-write cells near each read cell (which is all of them, probably) will degrade over time. I believe the stored data will degrade over long periods of time in any case, but I'm not sure. But if you re-write data every year or so, they could probably last decades.
  - Re:When you're nearing maximum write limit (Score:4, Informative)
    
    by SydShamino ( 547793 ) writes: on Tuesday October 16, 2012 @12:51PM (#41670633)
    
    In theory, yes. In flashROM devices the erase process is the aging action. Your write-once-never-erase-read-only flash should last until A) enough charge manages to leak out of gates that you get bit errors, or B) the part fails due to corrosion or other long-term aging issue, similar to any piece of electronics.
    If you have raw access to the flashROM you could in theory write the same data into the same unerased bytes to recover from bit errors (if you had an uncorrupted copy), so only aging failures would occur. But of course you can't do this with an SSD as you have no direct access to the memory, and the controller A) wouldn't let you write into unerased space, and B) wouldn't write the data into the exact same place again anyway.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:3)
  
  by bytesex ( 112972 ) writes:
  
  Did you mount it with noatime and nodiratime ?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

CRC Errors (Score:5, Informative)

Re: (Score:3, Interesting)

Re:CRC Errors (Score:5, Informative)

Re:CRC Errors (Score:5, Interesting)

Re:CRC Errors (Score:5, Informative)

Re:CRC Errors (Score:4, Insightful)

Re: (Score:3)

Re: (Score:3)

Re:CRC Errors (Score:5, Informative)

Re:CRC Errors (Score:5, Funny)

Re:CRC Errors (Score:5, Funny)

Re:CRC Errors (Score:5, Insightful)

Re: (Score:3)

Re:CRC Errors (Score:5, Insightful)

Re: (Score:3)

Re:CRC Errors (Score:5, Informative)

Umm (Score:5, Insightful)

Re:Umm (Score:5, Informative)

Re:Umm (Score:5, Informative)

Re:Umm (Score:5, Insightful)

Re:Umm (Score:5, Informative)

Re:Umm (Score:5, Insightful)

Re: (Score:3)

Re: (Score:3)

Re:Umm... (Score:3, Funny)

Re: (Score:3)

Re:Umm (Score:5, Interesting)

Re: (Score:3)

Re: (Score:3)

Re:Umm (Score:4, Insightful)

Re:Umm (Score:5, Insightful)

Re:Umm (Score:5, Interesting)

Re:Umm (Score:5, Informative)

Re:Umm (Score:4, Informative)

Bathtub Curve (Score:5, Informative)

Re:Umm (Score:5, Interesting)

Re: (Score:3, Interesting)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re:Umm (Score:4, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re:Umm (Score:5, Insightful)

They shrink (Score:2, Informative)

Re:They shrink (Score:5, Informative)

Re:They shrink (Score:5, Informative)

Re: (Score:3, Informative)

Re:They shrink (Score:4, Informative)

Re: (Score:3)

How do SSD's die (Score:5, Funny)

Firmware bugs (Score:2)

Re: (Score:2)

Firmware bugs killed my OCZ Vertex 2 (Score:3)

Flash SSD has Write Limitations so... (Score:2, Informative)

Re: (Score:3)

Re:Flash SSD has Write Limitations so... (Score:5, Interesting)

wear leveling (Score:2, Informative)

if they die at the same time repeatably.. (Score:2)

They usually die gracefully... (Score:5, Informative)

X-25M Death: Firmware bug too? (Score:5, Interesting)

Re:They usually die gracefully... (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

They die without warning and without recourse (Score:4, Informative)

Re:They die without warning and without recourse (Score:5, Informative)

Re:They die without warning and without recourse (Score:5, Interesting)

Re:They die without warning and without recourse (Score:5, Informative)

Bang! (Score:5, Informative)

Re:Bang! (Score:5, Funny)

Data corruption, then fails e2fsck upon boot (Score:4, Informative)

Dunno about how, but I do know WHEN (Score:2)

I have seen SSD death (Score:5, Informative)

1 failed SSD experienced... (Score:2)

Had one die twice (Score:2)

I had one fail (Score:3)