SSD Failure Temporarily Halts Linux 3.12 Kernel Work 552
jones_supa writes "The sudden death of a solid-state drive in Linus Torvalds' main workstation has led to the work on the 3.12 Linux kernel being temporarily suspended. Torvalds has not been able to recover anything from the drive. Subsystem maintainers who have outstanding pull requests may need to re-submit their requests in the coming days. If the SSD isn't recoverable he will finish out the Linux 3.12 merge window from a laptop."
Re:Really? (Score:5, Informative)
No backup?
http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01690.html
I long ago gave up on doing backups. I have actively moved to a model
where I use replacable machines instead. I've got the stuff I care
about generally on a couple of different machines, and then keys etc
backed up on a separate encrypted USB key.
So it's inconvenient. Mainly from a timing standpoint. But nothing more.
Linus
Re:Welcome to how SSDs fail. (Score:4, Informative)
A hard shutdown of high-speed SSD is death. It takes really really good firmware to recover without reinitializing the drive.
The basic SSD "format" is susceptable to damage on power fails in a way that hard drives are not. The mapping and setup stables of the SSD are critical and constantly in flux unlike a harddrive where the mapping is only updated when a failure occures.
SSD drives need internal power fail control so they can gracefully shudown and firmware that supports it.
Re:Really? (Score:5, Informative)
Sudden SSD failure is actually not really a failure that's detectable. Good SSDs have tons of metrics available through SMART including media wear indicators that tell you impending failure long before it happens.
But when an SSD suddenly dies, it's generally because the controller's FTL tables got corrupted. For high performance drives, it's remarkably easy to do as performance is #1, not data safety. There's nothing wrong with the disk or the electronics.
The FTL (flash translation layer) is what maps a sector the OS uses to the actual flash sector itself. If it gets corrupted, the controller has no way of accessing the right sectors anymore and things go tits up. It's even worse because a lot of metrics are tied to the FTL, including media wear, so losing that data means you can't simply erase and start over - you're completely hooped as the controller cannot access anything.
If you want to think of it another way, treat it like the super block on a filesystem, and the filesystem tables. Now imagine they get corrupt - the data is useless and recovery is difficult, even though the underlying media is perfectly fine. It's possible to hose it so badly that recovery is impossible.
For speed, FTL tables are cached - and modern SSDs can easily have 512MB-1GB of DDR memory just to hold the tables. Of course, you can't write-through changes since the tables themselves need to be wear-levelled on the flash media.
One of the iffiest times for this comes when an SSD is power cycled - pulling the power on an SSD can cause corruption because the tables may be in the middle of an update. But things like firmware bugs and other things can easily corrupt the table as well (think a stray pointer scribbling over the table RAM). A good SSD often has extra capacitance onboard to ensure that on sudden power failure, there is enough backup power to do an emergency commit to flash. This protects against power cycling, but firmware bugs can still destroy the data.
Of course, SSDs without such features mean the firmware has to be extra careful. And sometimes, such precautions can miss a point in time where you cannot pull the power at all.
It's sort of reminiscent of that Seagate failure that resulted in a log file reaching a certain size disabling the drive - the data and media were perfectly fine, it's just that the firmware crapped out.
Re:None of that mattered, because (Score:5, Informative)
Re:Intel? (Score:3, Informative)
"So power cycling can apparently trigger this - and the disk for some odd reason (self protection?) decides to decapitate itself and set accessible cylinders down to 16 instead of 16384."
Re:Really? (Score:5, Informative)
There was absolutely no code on his system that wasn't on between dozens and thousands of other systems depending on its age.
Just read TFA: "I had pushed out _most_ of my pulls today". His "pulls" are code that is *elsewhere*. He's just a conduit (and gatekeeper) between a few dozen elsewheres and a server with a fat pipe. And by the construction of the system, it really shouldn't matter how those pulls ordered. (If there'll be a merge conflict one way round, there'll be a merge conflict in other permutations too.)
Re:Really? (Score:5, Informative)
What makes you think you can't take FLASH devices and access them in a similar way to platters?
Because on most SSDs, the data is encrypted, and on all SSDs, the pages are in an effectively random order. If you've lost the controller, you've lost both the encryption keys and the table that enables a logical platter-style presentation of the pages. No amount of soldering is going to fix those problem.
Re:Really? (Score:2, Informative)
I thought NSA backed up all our drives.
Controller failure (Score:2, Informative)
So buy a new drive with the same rev boards and swap them out. Problem solved.
Re:Really? (Score:5, Informative)
You do that outside of a cleanroom and your data is gone forever.
False -- I've done it on a number of occasions (to drives I didn't care about), and was able to run the drives for months without their covers. I'd still be using the drives if I had need for drives as small as they were (somewhere in the 80GB range)...
Would I use a drive in this state for something critical? No, but saying you immediately lose the data if you pull a drive cover is just flat wrong.
Re:RAID (Score:5, Informative)
You guys should really look at the --backup and --backup-dir options in rsync.
I use them in conjunction with --delete to always have a "current" copy of the data, along with any old files (ie that have been updated or deleted) in a separate backup folder, named after the current day of the month.
That way you get a directory structure as follows: ...
01
02
03
04
31
Current
You can restore the up-to-date set from Current at any time, and if you want to retrieve a file you deleted or over-wrote five days ago, go look in folder 06.
Will never work with modern drives (Score:5, Informative)
Re:RAID (Score:5, Informative)
There's the operative phrase. RAID is for systems where you can't have or don't want an hour of downtime while restoring from a backup. The R in RAID stands for redundant. As in you can have a failure and keep going.
Note that this is the converse of "RAID is not a backup!" [smallnetbuilder.com] Just like RAID is not a replacement for a backup, a backup is not a replacement for RAID either. They do different things (and if you're smart, you will also backup your RAID). From your own description, you wanted a backup. RAID was never the correct solution for your needs.
Re:RAID (Score:4, Informative)
Why not do it right [rsnapshot.org]?
Re:Really? (Score:5, Informative)