Forgot your password?
typodupeerror
Bug Data Storage Linux

SSD Failure Temporarily Halts Linux 3.12 Kernel Work 552

Posted by Soulskill
from the must-be-nvidia's-fault dept.
jones_supa writes "The sudden death of a solid-state drive in Linus Torvalds' main workstation has led to the work on the 3.12 Linux kernel being temporarily suspended. Torvalds has not been able to recover anything from the drive. Subsystem maintainers who have outstanding pull requests may need to re-submit their requests in the coming days. If the SSD isn't recoverable he will finish out the Linux 3.12 merge window from a laptop."
This discussion has been archived. No new comments can be posted.

SSD Failure Temporarily Halts Linux 3.12 Kernel Work

Comments Filter:
  • Really? (Score:5, Insightful)

    by koan (80826) on Wednesday September 11, 2013 @03:52PM (#44822461)

    No backup?

    • Re:Really? (Score:5, Insightful)

      by gagol (583737) on Wednesday September 11, 2013 @03:56PM (#44822543)
      I found spinning rust to at least give some clues prior to a crash and burn. I would say, single ssd is not ready for anything critical, in my opinion. Worst case scenario, you can always get the platters transfered in a good drive and recover from there (pricey, bur cheap if data is valuable enough).
      • Re:Really? (Score:5, Funny)

        by SJHillman (1966756) on Wednesday September 11, 2013 @04:06PM (#44822743)

        Maybe Linus doesn't consider Linux to be critical...

        Microsoft sure as hell doesn't seem to find Windows to be critical.

      • Re:Really? (Score:5, Insightful)

        by Anonymous Coward on Wednesday September 11, 2013 @04:06PM (#44822745)

        I used to think that too, until I had a mechanical hard drive experience controller failure without warning. Single drive is not ready for anything critical, regardless of the storage mechanism.

        • Re:Really? (Score:5, Interesting)

          by chuckinator (2409512) on Wednesday September 11, 2013 @04:27PM (#44822979)
          Seconded. I've had a RAID1 mirror on my primary workstation at home for roughly... 4 years. I had one of those "oh, drat, my drive is starting to click, and we all know what that means..." moments and barely had time to backup the /home partition to an external machine while I went hardware shopping. Since that event window closed, that configuration has saved my butt twice. One time, the mirrored pair started to go after kinetic shock from moving to a new residence, and it didn't even stress me out to wait for a new pair from my online vendor of choice. I don't know what happened the second time, but I'm guessing that some bad components on the mobo were dirtying the 5V and 3.3V power rails into the drive connector because the whole rig decided to go kaput shortly after in a way that forced an upgrade to the latest CPU socket du jour mobo. Thankfully, I was already budgeting for new guts for that rig due to performance demands.
          • RAID (Score:5, Interesting)

            by Larry_Dillon (20347) <dillon.larry@nOsPam.gmail.com> on Wednesday September 11, 2013 @04:57PM (#44823349) Homepage

            I'm not nearly as much of a believer in RAID for the home environment. If you (accidentally) delete something on one drive it's gone from both. Better to buy two drives and do a daily rsync. That way you have a window of opportunity to recover data. Personally, I use rsync without --delete until the 2d drive starts getting full, then I use the --delete flag to clean up.

            • RAID 1 with a nightly rsync to an off-site server has worked for me for several years now. The remote server runs zfs so I also take weekly snapshots in case I need to restore something older than last night.

            • by jekewa (751500)

              Accidental deletion is a whole different beast. If you accidentally delete something created between rsync copies it's gone for good, too, and rsync can't save you.

              Unless your tool does some incremental storage for you. For example, Eclipse saves each save in a local history, including deletions, so you can go back in time even if all you did is change the file (which would also have "not there" impact between rsync copies)..

              if you need that kind of assurance, you'll need more than rsync or RAID.

            • Re:RAID (Score:5, Interesting)

              by Miamicanes (730264) on Wednesday September 11, 2013 @06:50PM (#44824681)

              The thing that really sucks about SSDs (at least, Sandforce-based drives) is the fact that 99% of their failures are due to firmware bugs that can be simultaneously triggered on an entire array at once (especially the sleep-related bugs). It's a mode of failure the creators of RAID 1, 5, and 10 never anticipated.

              IMHO, the worst thing about SSDs (at least, those with Sandforce controllers) is the fact that they have mandatory full-drive encryption that can't be disabled, using a key you aren't allowed to set or recover, and gets blown away whenever you reflash the firmware. This means, among other things, if the drive's controller gets itself confused:

              * You can't reflash data-recovery firmware onto the drive. The act flashing it would blow away the encryption key and render the data gone forever.

              * If the drive decides you're trying "too hard" to systematically extract data from it while it's in a confused state, it'll go into "panic mode" by blowing away the encryption key. If this happens, your data is gone forever AND you have to send the drive back to OCZ or whomever you got it from in order to get it unlocked. For your protection, of course. And Hollywood's. Among other things, dd_rescue/ddrecover can trigger panic mode.

              * You can't even do the equivalent of removing the platters from a conventional drive in a clean room and mount them to another drive for reading, because the data on the flash chips is all encrypted, and the key is unrecoverable.

              This is BULLSHIT, and it's why I refuse to buy any more SSDs. I, as an end user, should be able to download a utility from somewhere, reflash the drive to firmware that includes an offline recovery mode that simply dumps the flash chip content from start to finish, and either disable the encryption or set it to a key *I* control, so the 99.99999% of the data on the drive that's good when the embedded firmware freaks out can be dumped and recovered offline.

              If there's a God, Linus will go NUCLEAR over this, get a few seconds on CNN & other networks to rant about the unreliability of SSDs, and scare enough consumers to hit the industry HARD where it'll hurt the most... their bank accounts.

              It might not be possible to make SSDs reliable, but DAMMIT, they should at least be RECOVERABLE. There were goddamn hard drives with recoverable data pulled out of laptops left in safes in the Vistamark hotel when a tower sheared it in half and buried it under flaming rubble, yet a SSD that dies if you so much as look at it the wrong way due to firmware bugs ends up being fundamentally unrecoverable for no hard technical reason.

              And yes, I'm bitter about having my hard drive commit suicide for no reason besides Sandforce Business Policy. As long as they keep making controllers that cause drives to self-destruct at the drop of a hat, I'll keep doing my best to talk people out of buying drives tainted by their controller chips. Sandforce sucks.

      • So you've never had a hard disk controller failure then?

        " Worst case scenario, you can always get the platters transfered in a good drive and recover from there"

        What makes you think you can't take FLASH devices and access them in a similar way to platters? Just like with platters, you won't be able to access data on any damaged portions but unlike with platters it is unlikely that the platters will trash the read/write heads of the new drive.

        • Re: (Score:3, Funny)

          by jimbolauski (882977)

          So you've never had a hard disk controller failure then?

          " Worst case scenario, you can always get the platters transfered in a good drive and recover from there"

          What makes you think you can't take FLASH devices and access them in a similar way to platters? Just like with platters, you won't be able to access data on any damaged portions but unlike with platters it is unlikely that the platters will trash the read/write heads of the new drive.

          I don't know what your talking about it's very easy to desolder a couple hundred pins on a board, then install a new chip and resolder the new chip back in. That's just as easy as popping off the back of the HD removing a couple a screws and pulling out the platter.

        • Re:Really? (Score:5, Informative)

          by Guspaz (556486) on Wednesday September 11, 2013 @04:51PM (#44823277)

          What makes you think you can't take FLASH devices and access them in a similar way to platters?

          Because on most SSDs, the data is encrypted, and on all SSDs, the pages are in an effectively random order. If you've lost the controller, you've lost both the encryption keys and the table that enables a logical platter-style presentation of the pages. No amount of soldering is going to fix those problem.

          • The new drive has a new controller. Where do you think the controller stores all the data it needs to decrypt? Hint: It is in the FLASH devices. I am not saying this will work 100% of the time, since the damaged part might be the component that stores the needed information, but again, that is no different than a platter scenario. There is a reason why data recovery services don't guarantee success with platter based media.
        • > What makes you think you can't take FLASH devices and access them in a similar way to platters?

          Sandforce controllers enforce mandatory AES encryption that can't be disabled, using a key that can't be recovered or set to a known value. So if your controller decides to quit allowing you to access your data, unsoldering the chips won't do you any good, because the values you read from them might as well be random noise.

      • Re:Really? (Score:5, Informative)

        by tlhIngan (30335) <.slashdot. .at. .worf.net.> on Wednesday September 11, 2013 @04:27PM (#44822985)

        I found spinning rust to at least give some clues prior to a crash and burn. I would say, single ssd is not ready for anything critical, in my opinion. Worst case scenario, you can always get the platters transfered in a good drive and recover from there (pricey, bur cheap if data is valuable enough).

        Sudden SSD failure is actually not really a failure that's detectable. Good SSDs have tons of metrics available through SMART including media wear indicators that tell you impending failure long before it happens.

        But when an SSD suddenly dies, it's generally because the controller's FTL tables got corrupted. For high performance drives, it's remarkably easy to do as performance is #1, not data safety. There's nothing wrong with the disk or the electronics.

        The FTL (flash translation layer) is what maps a sector the OS uses to the actual flash sector itself. If it gets corrupted, the controller has no way of accessing the right sectors anymore and things go tits up. It's even worse because a lot of metrics are tied to the FTL, including media wear, so losing that data means you can't simply erase and start over - you're completely hooped as the controller cannot access anything.

        If you want to think of it another way, treat it like the super block on a filesystem, and the filesystem tables. Now imagine they get corrupt - the data is useless and recovery is difficult, even though the underlying media is perfectly fine. It's possible to hose it so badly that recovery is impossible.

        For speed, FTL tables are cached - and modern SSDs can easily have 512MB-1GB of DDR memory just to hold the tables. Of course, you can't write-through changes since the tables themselves need to be wear-levelled on the flash media.

        One of the iffiest times for this comes when an SSD is power cycled - pulling the power on an SSD can cause corruption because the tables may be in the middle of an update. But things like firmware bugs and other things can easily corrupt the table as well (think a stray pointer scribbling over the table RAM). A good SSD often has extra capacitance onboard to ensure that on sudden power failure, there is enough backup power to do an emergency commit to flash. This protects against power cycling, but firmware bugs can still destroy the data.

        Of course, SSDs without such features mean the firmware has to be extra careful. And sometimes, such precautions can miss a point in time where you cannot pull the power at all.

        It's sort of reminiscent of that Seagate failure that resulted in a log file reaching a certain size disabling the drive - the data and media were perfectly fine, it's just that the firmware crapped out.

        • by djdanlib (732853)

          It would be great if they would mention stability features on the box, or at least in the marketing material. But they don't. It always looks like this: MEMORY! It's quiet! SATA-II maximum bandwidth of 3.0 Gbps! Speed up your desktop! Look at the rebate! Millions of hours MTBF! Low power usage!

    • Re:Really? (Score:5, Funny)

      by Anonymous CowWord (635850) on Wednesday September 11, 2013 @03:59PM (#44822601)

      Haven't you heard?

      "Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" - Linus Torvalds[1]

      1: https://groups.google.com/forum/#!msg/linux.dev.kernel/2OEgUvDbNbo/bTk-VE1zrnYJ [google.com]

      • "Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)" - Linus Torvalds[1]

        Pfff... That's soooo last century!

        Let me fix that for you, Mr. Torvalds
        "Only wimps use tape backup: real men just upload their important stuff on git, and let the rest of the world clone it"
        Now that sounds more typical for the current decade.

        Oh, and for the MasterCard-Ads like finish:
        "For everyone else, there's the NSA."

        ----

        The funniest part is that he is the actual author of the git scm system which served him as backup this time.

    • Re: (Score:2, Funny)

      by Anonymous Coward

      Ask Obama!

      He's got a backup...

    • Re:Really? (Score:5, Informative)

      by Anonymous Coward on Wednesday September 11, 2013 @04:01PM (#44822645)

      No backup?

      http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01690.html

      I long ago gave up on doing backups. I have actively moved to a model
      where I use replacable machines instead. I've got the stuff I care
      about generally on a couple of different machines, and then keys etc
      backed up on a separate encrypted USB key.

      So it's inconvenient. Mainly from a timing standpoint. But nothing more.

      Linus

    • No. Not really. He has thousands of them all over the internet [git-scm.com]. How else do you think he is going to finnish the job with his laptop (excuse the pun.)
    • by sjames (1099)

      The world is his backup. No code seems to be lost, just temporarily not where he wants it to be.,/p>

    • Re:Really? (Score:5, Funny)

      by stewsters (1406737) on Wednesday September 11, 2013 @04:14PM (#44822871)
      Yeah, i wonder if anyone has ever told him about git. Too bad he didn't back it up. Now we will have to start a new Linux kernel.

      Sarcasm Intended.
    • by Joce640k (829181)

      No backup?

      Didn't he write some source code control system or other to prevent this...?

  • by Sneakernets (1026296) on Wednesday September 11, 2013 @03:55PM (#44822513) Journal
    That's all that Ballmer needs to stop Linux? Just find Torvald's SSD?
    • Re: (Score:3, Insightful)

      by CastrTroy (595695)
      Makes me wonder what would happen to Linux development if Torvalds was to get hit by a bus, or be incapacitated in some way. Is kernel development that reliant on one person that a single laptop breaking brings everything to a halt?
      • by sjames (1099)

        Only in the short term. In the bus scenario, another leader would be chosen by the developers. There are several good choices there.

  • by ruiner13 (527499) on Wednesday September 11, 2013 @03:55PM (#44822525) Homepage
    Maybe Linus needs to create a backup program like he did when he wanted a better version control system and created git? Also, why is the only copy of the changes on his local workstation and not a server with redundancy? This seems rather amateurish.
  • by IMarvinTPA (104941) <IMarvinTPA@@@IMarvinTPA...com> on Wednesday September 11, 2013 @03:57PM (#44822553) Homepage Journal

    Linux said "So I don't want to necessarily blame the harddisk, since it's just ten
    days since I upgraded the rest of my machine, after it worked years in
    the previous one. That just makes me go "hmm". As far as I know, all
    the fans etc were working fine, but.."

    There's his problem: "after it worked years in the previous [machine]."

    His SSD died a natural death of old age.

    IMarv

    • by kwalker (1383)

      That's not how drives die of old age. A sudden and permanent drive failure like what is described is almost always a controller failure. When mechanical drives die of old age, they generally develop bad sectors and read-errors accumulate on the platter, but you can still read from the un-damaged areas. When SSDs die, those worn-out sectors go read-only or begin throwing similar read/write errors, depending on the firmware.

      After having a 40GB IBM Deathstar suddenly go down in flames, and dozens of "salvage m

    • by citizenr (871508)

      His SSD died a natural death of old age.

      IMarv

      there is NOTHING natural about a drive that disappears without a notice with all of your data.

  • The one (personal) thing storage-related that I'd like to re-iterate is that I think that rotating storage is going the way of the dodo (or the tape). "How do I hate thee, let me count the ways". The latencies of rotational storage are horrendous, and I personally refuse to use a machine that has those nasty platters of spinning rust in them.

    Bet you regret knocking those platters of spinning rust [slashdot.org] now, don't you Mr. Torvalds?

  • by Nick (109)
    Was he too busy treating people horribly to audit his DR procedures?
  • why this news? (Score:5, Insightful)

    by Laxori666 (748529) on Wednesday September 11, 2013 @04:04PM (#44822715) Homepage
    Why is this news... is this our version of People magazine, where instead of hearing about all the details of the Kardashians' lives, we hear about every email or event that happens to Linus?
    • Why is this news... is this our version of People magazine, where instead of hearing about all the details of the Kardashians' lives, we hear about every email or event that happens to Linus?

      It shows that the best or at least most respected in the business can still be stupid when it comes to simple things like backups. Seriously, there is no reason in this day or age to lose more than a couple of transactions if you are careful. Someone kick Linus in the ass for being so sloppy and lazy.

  • I find it amazing to consider that he is not working on a redundant and well backed up machine. Where's last hour's backup? Yesterday's backup? Even pig farmer's know to backup their data.

  • I'm no kernel maintainer but...

    If his workstation is so important why doesn't he mirror the disks?
    Back them up regularly?
    Run a remote desktop to a server with the above conditions

  • ...for over an hour when Torvalds had to make an emergency run to Albertson's for some toilet paper and hostility medication.

  • by Mike_EE_U_of_I (1493783) on Wednesday September 11, 2013 @04:13PM (#44822861)

    I've owned several hundred hard drives over the last 30 years. I've never had an active hard drive drive just blank out. I have had drives that had not been powered for a couple of years refuse to ever come back. But if I did not feel the need to even power the thing on for years, you can imagine how little I cared for what was on it.

        In the last four years, I've owned around 20 SSDs. I've had five failures. Every single one was the drive just instantly lost everything. Amazingly, in four of the five cases, the drive still worked fine! It had simply lost all the data on it and believed itself to be a blank drive.

        That said, the speed of SSDs makes them worth the risk to me. But I take backups far more seriously than I used to. I need them far more often.

    • by RichMan (8097) on Wednesday September 11, 2013 @04:25PM (#44822971)

      A hard shutdown of high-speed SSD is death. It takes really really good firmware to recover without reinitializing the drive.

      The basic SSD "format" is susceptable to damage on power fails in a way that hard drives are not. The mapping and setup stables of the SSD are critical and constantly in flux unlike a harddrive where the mapping is only updated when a failure occures.
      SSD drives need internal power fail control so they can gracefully shudown and firmware that supports it.

    • Oh man thats scary. Any *good* solution? I've heard Raid is a no no on SSD as it will shorten its life. Maybe regular BTRFS/LVM snapshots exported to a spinning disk ?

    • by Dracos (107777)

      This describes several of the reasons why I will not buy an SSD any time in the near future. Sketchy reliability, indeterminate longevity, inexplicable data loss. Mirroring a turd just means you have multiple turds. I have a few 10+ year old DeskStar drives that I still use and have never given me problems.

  • by Cmdrx (655099) on Wednesday September 11, 2013 @04:28PM (#44822997)
    Now there a new meaning for Kernel Panic!
  • by AaronW (33736) on Wednesday September 11, 2013 @04:41PM (#44823179) Homepage

    I learned long ago after some close calls to back everything up. In my case for my desktop I store my data on a XFS partition stored on a RAID 5 hard drive array. I also am using Crashplan to back up all of my data, both to a removeable hard drive and to the cloud with over 3TB of data backed up. The nice thing about Crashplan is that it continually backs up, taking periodic snapshots so I can restore a previous version of a file if I wish. The main drawbacks of Crashplan are that it runs on Java and can be a memory pig. I pay $6/month for unlimited backup of up to 10 machines and have several computers backed up with them now. With the proper settings on my router I don't even notice all the backup traffic running in the background.

    Since I have had sudden SSD failures in the past I also dump my root XFS filesystem weekly onto my RAID array (it takes under a minute to run xfsdump) and incremental backups nightly and those dumps get backed up on the cloud as well.

    I have found the XFS tools to be quite good at recovery when things go really bad. When running software RAID 1 I had problems where drives would drop out of the array for apparently no reason and I have had several occasions where while rebuilding the other drive would pop out of the array. Switching to an Areca hardware raid controller with battery backed DRAM ended those problems (besides seeing a big performance improvement).

    I have found the RAID controller to work well when drive failure occurs and it even recovered after human error (I accidentally disconnected one of the active drives while it was rebuilding and reconnected it).

    I won't use btrfs yet. The last time I tried it about 6 months ago it was quite slow and I have a lot of concerns about the storage filling up due to COW that have not been adqeuately addressed as far as I could tell. I tried setting it up for a Cyrus IMAP server on an Intel SSD and it was unusably slow just untaring all the files so I ended up going back to XFS.

    SSDs are still relatively new. I have had issues with some firmware versions and had one fail catastrophically after only 2 weeks of use. I have also had compact flash and SD devices suddenly fail. My experience is that usually mechanical hard drives give some warning (i.e. SMART) and they tend to last years. I have a server I just retired where the hard drive had 10 years on the clock according to SMART.

  • by stox (131684) on Wednesday September 11, 2013 @05:03PM (#44823435) Homepage

    I have a mirrored set of SSD's on all my important machines, and RAID 6 for bulk storage.

    Unlike Linus, I can't afford to lose work.

  • by redelm (54142) on Wednesday September 11, 2013 @05:27PM (#44823743) Homepage

    This might be [electrolytic] capacitor or some other component-level magic-smoke release. There is also the dreaded, much-discussed "wear" from re-writing flash memory -- worse than you think because blocks of 64 KB [typically] have to be erased and re-written to change any byte therein.

    Linus, of all people, ought to know his kernel has options to minimize the re-writes, many of them developed to optimize laptops (like delaying writes). Another thing is to mount partitions (/etc/fstab anyone?) with `noatime` as an option (maybe 'nodiratime` too). Un*x and other Linux-like systems by default will re-write the access time for any disk inode read. Turning it off reduces disk write load (and seeks on slow disks). I've had it off for over ten years an not noticed any malperformance, althrough there are rumored to be some, somewhere.

FORTH IF HONK THEN

Working...