Five Years of Data Show That SSDs Are More Reliable Than HDDs Over the Long Haul (arstechnica.com) 82
Backup and cloud storage company Backblaze has published data comparing the long-term reliability of solid-state storage drives and traditional spinning hard drives in its data center. Based on data collected since the company began using SSDs as boot drives in late 2018, Backblaze cloud storage evangelist Andy Klein published a report yesterday showing that the company's SSDs are failing at a much lower rate than its HDDs as the drives age. ArsTechnica: Backblaze has published drive failure statistics (and related commentary) for years now; the hard drive-focused reports observe the behavior of tens of thousands of data storage and boot drives across most major manufacturers. The reports are comprehensive enough that we can draw at least some conclusions about which companies make the most (and least) reliable drives. The sample size for this SSD data is much smaller, both in the number and variety of drives tested -- they're mostly 2.5-inch drives from Crucial, Seagate, and Dell, with little representation of Western Digital/SanDisk and no data from Samsung drives at all. This makes the data less useful for comparing relative reliability between companies, but it can still be useful for comparing the overall reliability of hard drives to the reliability of SSDs doing the same work.
Backblaze uses SSDs as boot drives for its servers rather than data storage, and its data compares these drives to HDDs that were also being used as boot drives. The company says these drives handle the storage of logs, temporary files, SMART stats, and other data in addition to booting -- they're not writing terabytes of data every day, but they're not just sitting there doing nothing once the server has booted, either. Over their first four years of service, SSDs fail at a lower rate than HDDs overall, but the curve looks basically the same -- few failures in year one, a jump in year two, a small decline in year three, and another increase in year four. But once you hit year five, HDD failure rates begin going upward quickly -- jumping from a 1.83 percent failure rate in year four to 3.55 percent in year five. Backblaze's SSDs, on the other hand, continued to fail at roughly the same 1 percent rate as they did the year before.
Backblaze uses SSDs as boot drives for its servers rather than data storage, and its data compares these drives to HDDs that were also being used as boot drives. The company says these drives handle the storage of logs, temporary files, SMART stats, and other data in addition to booting -- they're not writing terabytes of data every day, but they're not just sitting there doing nothing once the server has booted, either. Over their first four years of service, SSDs fail at a lower rate than HDDs overall, but the curve looks basically the same -- few failures in year one, a jump in year two, a small decline in year three, and another increase in year four. But once you hit year five, HDD failure rates begin going upward quickly -- jumping from a 1.83 percent failure rate in year four to 3.55 percent in year five. Backblaze's SSDs, on the other hand, continued to fail at roughly the same 1 percent rate as they did the year before.
Yes. (Score:5, Interesting)
Re:Yes. (Score:4, Insightful)
To be very slightly fair to them, when consumer SSDs were brand new nobody was quite sure what their longevity would be. There were a lot of expectations and testing and such that said they ought to last as long or longer than spinny disks, but that level of trust wasn't there yet. I also remember people saying that when hard drives die, you often have at least some warning and can recover some data from them whereas SSDs tend to just up and disappear one day, but if you're keeping proper backups and such that's not a problem either.
I also have an SSD that's probably 8ish? years old now and still going strong even as my primary boot device for my game machine.
Re:Yes. (Score:4, Interesting)
But... how does Backblaze do a backup? Infinite loop!
Re:Yes. (Score:5, Interesting)
In fact, I think OSs these days expect SSD levels of performance for things like swapping, buffer management, etc... because they are getting way more complex and sophisticated in terms of what they swap, stage, and buffer. It makes more sense to put OSs on SSDs these days, OLTP database applications as well, and definitely web servers. Massive file caches that are rarely accessed, backup, and archival data? I am not so sure.
I would love to see SSDs put in BackBlazes backup pools, and compare the write endurances. I know they are more expensive to procure, but in datacenter operations, my assumption was the power and heat dissipation more than made up for that up front price difference in TCO. Facebook at one point explained how they mitigated rising power costs by using SSDs they could power down to hold pics that went over a month old and were therefore seldom accessed. Just my $0.02
Re:Yes. (Score:4, Insightful)
My anecdotal experience suggests that they failed a LOT faster back in those days, before write-leveling was well understood and before operating systems got better at avoiding unnecessary writes.
Glad yours was better.
But no matter what: back up often, and test those backups!!!
Re: (Score:2)
That's fair. Thinking about it has slowly brought back memories and I do recall some friends having issues. And by and large they did not have especially good backup strategies either. Mine was also a pretty expensive Samsung, which was considered one of the top of the line ones at the time which might explain its longevity compared to some others.
Re: (Score:2)
Re:Yes. (Score:5, Interesting)
Oh, they failed a lot sooner back then, but not due to writes. Wear levelling is a fine art, but one that has been perfected so much that the patents on the earliest algorithms have expired. We knew how to wear level - it's whether or not you wanted to license an algorithm. And we've known this for basically 30 years when SSDs started coming out (flash memory first came out in the 80s as an alternative to the Electrically Erasable Programmable Read Only Memory, or EEPROM to enable faster and higher density storage. Toshiba was one of the early pioneers with Intel following shortly afterwards)
It's so well known, it's got its own term - Flash Translation Layer, or FTL. And this was a secret sauce, often highly patented, and many times also a huge trade secret - the good FTLs will be highly performant, error tolerant and have error recovery, wear levelling and handle flash media as it ages.
The cheap stuff, like cheap memory cards, often use simple algorithms that rapidly wore out the flash chips, leading to premature failure, or because they were cheap, often didn't handle failure gracefully so they just up and died.
The expensive stuff used sophisticated algorithms with error correction and handled bad blocks and other things with grace, and could easily handle wearing and aging.
Linux offers two open-source filesystems for flash as well - JFFS2 (for NOR flash), and YAFFS (for NAND flash), which build in a custom filesystem and FTL and everything (it works a lot better when it's all integrated so you can handle blocks going bad far more gracefully).
No, what killed many SSDs early on was in the pursuit of speed, many companies cut corners. Some of the fasted controllers (remember SandForce?) were fast, but if you cut corners, they would be slow. And one corner cut was power loss protection - many of the faster controllers had RAM based FTL tables which offered ultimate speed, but if you didn't sync the RAM tables with flash storage you would corrupt the disk. And cheap SSD manufacturers often omitted the necessary backup capacitance and still enabled "go fast mode" so a bad poweroff could easily corrupt the flash tables. Though, back in the day, the firmware was often not quite stable as well, so you occasionally ran into massive data corruption as well.
Some better ones indicated this with a "limp home" mode where they somehow showed up as a 8MB disk (why 8MB? I don't know). We had several end up like this, and they almost always were resurrected if you used the SATA SECURE_ERASE command which basically wiped the data storage tables and restarted from scratch. But this may not work always as sometimes the corrupted tables included the flash wear counters and losing those often meant the firmware went crazy figuring out what it had.
These days, speed isn't an issue - we've maxed out SATA III interface speeds for a long time now (at least 5+ years, probably a decade) - which is why every SATA SSD is basically stuck at 540MB/sec read and write. And people stopped cheaping out on stuff - especially if you got name brand stuff from like Samsung, Intel, and the like.
Re: (Score:2)
"To be very slightly fair to them, when consumer SSDs were brand new nobody was quite sure what their longevity would be"
To be fair, a lot of them weren't very good. 3 that I bought between 2009 and 2013 failed within 5 years.
In the past 25 years I have 5 failed HDDs out of over 40.
Re: (Score:2)
Also very true. Just goes to show the trouble with anecdotes is everybody can have different experiences.
Re: (Score:2)
Purely anecdotal, but since I installed my first SSD drive on one of the company computers about seven years ago, I've had precisely one SSD fail (and that may have been an onboard controller issue, the failure was pretty much total and not a lost sectors problem), and in the same time I've probably had to swap out seven or height hard drives.
Re: (Score:2)
To be very slightly fair to them, when consumer SSDs were brand new nobody was quite sure what their longevity would be.
Really, because I recall quite a few people believed they knew exactly what their longevity would be and interjected that "knowledge" in damn near any thread or conversation on the subject.
flash failure mode (Score:2)
I also remember people saying that when hard drives die, you often have at least some warning and can recover some data from them whereas SSDs tend to just up and disappear one day,
I've never witnessed the "disappear on day" situation they describe in the recent years.
Personal experience:
- Yes, failing HDD have very often given some alerts (clicks, specific read failures on some sectors, increasing rate at which SMART needs to remap bad sectors, etc.) but once they're dead you can't recover anything from them.
(But given that my last machine with spinning rust is running on RAID6: meh... just swap the damaged disk)
- BUT, every failed flash media I've had (SSD, but also
Re: (Score:3)
In a counterpoint, my first SSD was an OCZ Vertex that I RMA'd at least 3 times before I gave up on it because it kept failing repeatedly.
Back then, Windows and Linux weren't optimized for SSDs and the drivers (and the drives themselves) were very hit-and-miss.
I heard many horror stories around that time of catastrophic SSD failures for a variety of reasons.
I waited for years before making another foray into the SSD world and I went with Samsung. I continue to use the same Samsung SSDs that I purchased year
Re: (Score:2)
Ah yes, OCZ. 4 of 4 dead. But I kind of expected that back when. OCZ was pretty shady. With Samsung now basically 9 of 9 in good shape.
Re: (Score:2)
SSDs seem to be one of the things Samsung manages to do well. Not too many of them catch fire. I have an 850 Evo SATA which is still trucking along. I have however disabled paging. I'm not wasting my writes on swap.
Re: (Score:2)
I had an OCZ Vertex *2* that ran for many years. I was very paranoid about it being a sudden complete loss, and I kept very regular backups.
Knock on wood, I have not actually seen an SSD fail, ever.
Re: (Score:2)
"Knock on wood, I have not actually seen an SSD fail, ever."
I have at least 3 you can have. Probably many more if I go through the decomm'ed drives at my employer.
Re:Yes. (Score:4, Insightful)
Just don't buy a Samsung QVO drive. You think you're saving a few currency units, but you'll end up regretting it. Search for reviews on these things.
Re: (Score:2)
Back then, Windows and Linux weren't optimized for SSDs and the drivers (and the drives themselves) were very hit-and-miss.
You are deluding your self if you think that any os is optimized for reading and not writing. Just look at your SSD statistics. Or open task manager on windows or iotop on linux. On an otherwise idle system you will see 0 read activity and fair amount of write activity. Linux does that less than windows but both are writing way more than reading. If you think that macos is better, think again. Apple had its fair share of complaints about low life of the SSD drives. Bad thing about modern mac is that you can
Re: (Score:2)
Or open task manager on windows or iotop on linux. On an otherwise idle system you will see 0 read activity and fair amount of write activity. Linux does that less than windows but both are writing way more than reading.
Just what are they supposed to be reading? Of *course* they should be writing all the time - the disk is where the "permanent" when-the-power-fails-it-comes-back storage is, and all of the state references that need to persist across problems, as well as all of the logging, need to continuously be written to disk. If you only write to disk every ten minutes, then you can only recover to a persisted state at an up to ten minute window. For any remotely serious activity that is staggeringly bad. Let your syst
Re: (Score:2)
The Windows memory system uses a page file. Every half second or so, it writes out dirty memory pages to disk. If there's pressure on system memory, it can just discard pages and immediately give them to whatever process needs them.
If you want to stop the writes, you can disable the page file. Things will work fine as long as you have plenty of RAM - but if you ever run out, Windows will have to kill processes.
Re: (Score:2)
Re: (Score:2)
My first ones were fast but had really crappy reliability. Then I moved to Samsung. To be fair I basically expected the crappy reliability back when and nothing important was lost.
Re: (Score:2)
Same with my Corsair Force Series F115 Solid-State Disk (SSD) (115 GB; CSSD-F115GB2-BRKT-A) for my old Debian box since 11/24/2011!
Re: (Score:3)
Early SSD drives had terrible controller chips. Intel had very good ones back then, probably the best. OCZ had the worst.
The flash memory rarely failed, but on the bad drives, the controller chip would often fail and brick the drive. SSDs were often a gamble back then unless you did a lot of research into the reliable ones. Nowadays the kinks are worked out. And if you're ever in doubt, Samsung is always a safe choice. They've generally got the best controllers with the lowest odds of data loss if anything
Re: Yes. (Score:1)
SSD reliablility going down? (Score:3)
Re: SSD reliablility going down? (Score:2)
Note that the media failing is basically not at all what causes ssd failure. If the media wears out or has read errors, the firmware has explicit code paths for that. It can handle it by reallocating spare media or by just warning you the device can no longer be written to and making it read only.
Ssd failure is when other parts on the board fail, or when firmware hits a bug. Including in the complex behavior above. None of it can be blamed on the flash itself.
Re: (Score:2)
Not if done right. Modern encoding can make QLC as reliable as SLC. It also makes things slower. If done wrong or cheaply, sure.
Re: (Score:3)
The main issue with QLC and beyond is not that they are significantly less reliable, it's that sometimes you get performance hiccups.
The voltage in the cell decays over time. When the drive is idle it may scan for such sectors and re-write them if needs be. When the drive doesn't have much idle time available it can't do that, and the result is that when it reads a decayed sector it sees that the voltages are off and has to re-read that sector with adjusted voltage thresholds. That causes a delay.
How big of
Re: (Score:2)
The journey from SLC to QLC and beyond surely has an impact on reliability.
Ahh that ol' chestnut. I remember when SSDs came out. Unreliable they said. They proved to be reliable.
Limited writes they said. Proved to be more than capable for even demanding workloads over a long time.
But MLC will change all that they are worse than SLC. Proved to be reliable.
But TLC will change all that they are worse than SLC. Proved to be reliable.
But QLC will change all that... -- you are here --
At this point I'm not sure what's worse. The SSD doomsayers changing their excuses or people still deny
Re: (Score:2)
A lot of the original ones that came out were pretty unreliable. The flash media itself for the most part was fine. It was the the controllers which were a buggy, unreliable, hot mess. It took a few years for SSDs to mature enough that I would trust them at the same level as mechanical hard drives to suddenly not lose some or all the data stored on them.
Re: (Score:2)
if the thing goes, it goes entirely, and no last-ditch revivals.
Citation needed.
Re: One drawback (Score:2)
That's what I've seen. I've never seen a sdd degrade, but then if the firmware reallocates storage I wouldn't notice. What I have seen is total failure, no block device at all, nothing to even pull partial data from.
They do seem reliable now though.
Re: (Score:2)
Re: One drawback (Score:2)
I've heard stories of HDDs failing with a very loud ear busting squeal, and when they were opened up, they found a huge round gash in the platters where the magnetic coating was completely stripped away by the crashed heads.
Personally, I've dealt with clicks of death, "stiction", and rapidly developing bad sectors. Can't say I miss that.
Re: (Score:2)
Depends entirely on what fails. Controller or firmware failure can have it disappear immediately with no recourse. Media or flash failure will either go unnoticed while it reallocates things, or it'll go read-only and refuse to write any more as a sign of failure which gives you time to get stuff off it.
You're right though that spinny disks tend to have a more gradual failure as they go with more warning, but I have had plenty of HDs just up and poof too.
Backblaze doesn't know what they are talking about (Score:2)
"Backblaze doesn't know what they are talking about" [insert my favorite 1-, 3-, or 20-drive, 2-year anecdote here]. Consumer grade, shucking, misleading, my favorite drive vendor, the sales guy told me...
Re: (Score:2)
[insert my favorite 1-, 3-, or 20-drive, 2-year anecdote here]. Consumer grade, shucking, misleading, my favorite drive vendor, the sales guy told me...
Or just point out that this is specific to boot drives which changes less, and not data drives which generally have a lot of writes. This doesn't take away from their point, that an SSD, with no moving parts can be more reliable is that situation, but it also doesn't prove that a spinning disc is always less reliable.
I use exactly the same setup myself, but I wouldn't make the same broad statement.
Re: (Score:2)
Re: (Score:2)
Tell that to Windows Update...
Windows? Oh, I thought we were talking about actual servers, not the toy variety.
Different failure mode (Score:2)
Re: (Score:2)
SSDs also are harder to recover from. If a head crashes on a HDD, one might still be able to move the platters to another head in a clean room and pull the data back. If a HDD controller dies, you can swap controllers, and if lucky, get your data.
SSDs, on the other hand, once enough electrons leave the gates, where a "1" can't be perceived from a "0"... that's that. Data is gone. SSDs also tend to fail hard, and the trick of putting the drive in the freezer that -might- just work to allow data to be pul
RE: SSD failures (Score:2)
It was my understanding that SSDs were generally using firmware that locks them into a "read only" mode when they fail? Obviously, you're going to still have your total failures due to a chip blowing out on the controller board or what-not. But that could happen on any conceivable type of mass storage you'd be able to attach to a computer.
I know some early SSDs didn't work this way. (I think Intel may have been the company to start doing it, initially?) But I've seen other SSDs fail in this manner. Allows
Re: (Score:2)
This has been my experience with SSDs over the past 3-5 years. When they fail, the drive controller detects that the flash has passed it's lifetime writes or something like that and locks the drive into read-only mode. You can't dump any more data onto the drive, but you can usually recover everything that is on there already.
If the controller fails, then all bets are off, but this is a far less frequent occurrence.
Re: (Score:2)
Re: (Score:3)
My experience is quite different to this - most SSDs I've seen fail have been the flash that's failed, and the drive puts itself into read-only mode so that it won't damage the flash with further writes. This allows you to copy the data off the drive. Interestingly, some failed SSDs report to the OS that they are read-only, but others pretend to be a read/write device, but they silently discard all the writes. At least you can still get the data back that has already been written though.
If however the contr
I have 30 year old hard drives that work just fine (Score:5, Funny)
Re: (Score:3)
That's too bad, we were thinking you'd be perfect for our new senior backup manager position - but it requires 30 years experience with SSDs.
Re: (Score:2)
but, I don't have a single working 30 year old SSD.
I don't know many 30 years old that want to work nowadays. Most of them just want to stay home ("work from home" they say) and view-splurge (aka "binge watch") on Netflix.
But what about for backup? (Score:4, Interesting)
How long will current SSDs retain the data without being attached to power? I've had experiences in the past of them fairly quickly just forgetting everything. And I don't know where to look to find ratings on this.
Just anecdotal .. Sammy over Crucial, any day (Score:2)
Just anecdotal from rouglhy what.. 5 years? maybe 7?
Samsungs, 0 fails. Crucial, at least six that I know of -- and I don't do helpdesk - those are just the ones I hear about from the helpdesk guys venting their spleens about them.
When I did work at an MSP, all we bought was Samsung Evo because Crucials were crap. That was on paper - company policy. No Crucials. Too many unsatisfied clients.
Samsung is all I have in all my personal machines. Happy with them.
As for the poster that wondered "how long" with
Re: (Score:1)
Of course, the second I submit this comment one of them will blow up, just to spite me..
Okay, so which one blew up right after you submitted the comment above?
Asking for a friend, something about an office betting pool, I didn't ask for details.
Re: (Score:2)
So far, none. But I did have a bushing break on a new aircon vent. Oh well. Not related. Nothing a dab of superglue didn't fix.
Re: (Score:2)
The only SSDs that I have had fail are:
SanDisk due to wonky power-data connectors that break very easily
Mushkin ECO2 made for Newegg that had "silent controler death" corrected by an SSD firmware update that you only learned about if you knew to search for it
My Sammy EVOs and Crucial M500s and Intel (both "consumer" and "Skull series") and Gigabyte SSDs all appear to work fine for many years now. My last few remaining SanDisks (not WDC SanDisks) are still chugging away out there.
SSD failures (Score:5, Insightful)
While SSDs are more reliable, and I love them, they fail hard when they fail. I've recovered data from many failed spinning disks, never recovered a single bit from a failed SSD. Make sure you have good backups.
I guess I got "lucky" Re:SSD failures (Score:1)
I've had hard-deaths, where I couldn't do anything with them.
I've also had "S.M.A.R.T"-imminent-failure or other "this drive is behaving odd, better clone it ASAP then retire it" behavioral-warnings where I was able to recover some or all data before it was too late.
If you've used enough of anything - SSDs included - you will have your own hard-luck stories to tell.
Re: (Score:2)
Yes, this. When I've seen SSDs go, they go completely out.
Which is why I backup semi-religiously. And test recoveries. yes on my personal stuff too. Have, for decades. I was bitten by data loss in the 90's and won't repeat that.
The days of recovering data from broken drives with things like Zero Assumption Recovery are done
That thing was epic.. even worked on broken RAIDs, to a level that even some dedicated data recovery whose name eludes me couldn't match. Got me a nice fat christmas present once, p
Re: (Score:1)
Same experience here, I have only once in 30 years had a HDD fail entirely without pre-indications allowing me to recover at least most if not all of the data, and that one instance was on a reboot.
Every SSD I've had fail, although I agree at a lower rate, just falls flat instantly during use with total data loss and zero warning.
I'm not sure that's 'better' even if it is less frequent.
Re: (Score:2)
I've had a number of SSDs in client systems fail, and I've had a couple of my own fail, and in nearly every case the drive has soft failed. The controller puts the drive into read-only mode, and you can still copy data back from the drive.
This can be quite strange to diagnose at first, some controllers tell the OS that they are read-only, but others pretend it's business as usual, and let you write data. They don't actually commit the data to flash, but tell the OS that they have. It can be difficult to wor
Re: (Score:2)
Why recover data? If that was what you're relying on you're an absolute idiot. You should have good backups completely irrespective of whether you use an SSD or not.
Also SSDs have just as many soft failure modes as HDDs do resulting in the drive becoming read only. HDDs just like SSDs have hard failures too.
Re: (Score:2)
How do your continuous and never-ending backups work? Is every bit copied to another computer as it is produced?
I don't know how your backups work but mine run daily. This leaves time between the last backup and now, where things can be produced and not backed up in that time. There is always something that was not captured in the last backup due to the timing of the backup and the timing of work being done, sometimes those things are important to recover.
Re: (Score:2)
> there is always something that was not captured in the last backup due to the timing of the backup and the timing of work being done, sometimes those things are important to recover.
Yes. Drives are cheap enough that if you care about your data you should:
1) daily backup
2) mirror
3) snapshot
4) offsite
Most of us earn more than an SSD per day and occasionally come up with tremendously more value than that on a good day.
My worry window is now 15 minutes. If I screw up beyond my editor history, I can rollb
These are boot drives in a data center (Score:2)
First, they are boot drives with low write workloads. Flash error rates climb as the number of erase cycles climb. A flash page with
These drives are also never power cycled. One of the more demanding part of an SSD controller design is re-building the "block map" on power up. I would wager that these drives are
So basically, this is the failure rate of the electronics. The hard drives have a motor so they have the electronics plus some mechanical parts.
If y
Re: (Score:2)
Re: (Score:2)
If you take one of these SSDs and write aggressively to it, you can kill every one inside of a few months.
I have experience running SSDs that get completely overwritten every 10-20 days. My experience that that they don't fail before the published number of block writes is exceeded.
Monitor the health via the S.M.A.R.T. data and replace when you get close to the wear limit and they will be reliable.
Both drive have their place (Score:2)
Better yes, still not a good as an HDD (Score:2)
The most significant factor in life span is the environment:
- If you are using a mobile environment, HDD have a much-limited lifespan. When I was specing laptops before the affordability of SSDs became available, I'd estimate 3 years for a laptop HDD because o
TBW is not that bad (at least for TLC) (Score:1)
I still would not go with QLC, but TLC drives should be fine for anybody not using it in data-center settings.
Unless they are stored without power (Score:2)
Re: (Score:2)
Re: (Score:2)
Backblaze is awesome (Score:2)
Tangential!
Backblaze provides an awesome backup solution! If you don't have offsite backup, check 'em out.
And I appreciate their periodic drive quality reports.
Not in my experience, not exactly (Score:2)
I can't say they are worse. If anything, it is the same reliability. But recoverability is significantly better on hard drives. When SSDs die they are harder to recover.
Old versus new (Score:2)
I'll guess that drives, like everything else, are getting cheapened out over time, shortening their quality and longevity.