Google-Backed SSD Endurance Research Shows MLC Flash As Reliable As SLC (hothardware.com) 62
MojoKid writes: Even for mainstream users, it's easy to feel the differences between using a PC that has an OS installed on a solid state drive versus a mechanical hard drive. Also, with SSD pricing where it is right now, it's also easy to justify including one in a new configuration for the speed boost. And there's obvious benefit in the enterprise and data center for both performance and durability. As you might expect, Google has chewed through a healthy pile of SSDs in its data centers over the years and the company appears to have been one of the first to deploy SSDs in production at scale. New research results Google is sharing via a joint research project now encompasses SSD use over a six year span at one of Google's data centers. Looking over the results led to some expected and unexpected findings. One of the biggest discoveries is that SLC-based SSDs are not necessarily more reliable than MLC-based drives. This is surprising, as SLC SSDs carry a price premium with the promise of higher durability (specifically in write operations) as one of their selling points. It will come as no surprise that there are trade-offs of both SSDs and mechanical drives, but ultimately, the benefits SSDs offer often far outweigh the benefits of mechanical HDDs.
Re: (Score:3)
Yep, we need critical details, like if there is any way to tell that an SSD is about to become totally unreadable. There is some worrying stuff in TFA:
Other results point to the uselessness of the RBER value (raw bit error rate). It was found that there was absolutely no correlation between the number of these warnings and the number of uncorrectable errors that creep up.
Uncorrectable errors are not too bad, at least you can make a copy of the drive. It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.
Re: (Score:1)
It's when the drive dies completely and you have to reach for the most recent backup that we really want to predict.
Flash wear doesn't cause the drive to completely die, only controller failure can do that. Usually it seems to be the controller firmware that fails, rather than the electronics.
Re: (Score:2)
These issues seemed to have gone away since the crappy sandforce controllers are no longer in use. It is a firmware issue and not wear in hte past. OCZ made defective drives which had no extra capacitors so they would fry.
Howevcer, I did throw out a Sandisk ultra 1 plus I bought in 2013 recently. It experienced corruption which I thought was my imagination and something I did. I put it in a girlfriends laptop and noticed Windows reported corruption after a reboot. This coulld be due to extra ram as cache th
Re: (Score:2)
Recent Intel drives have a "feature" where once the a failed write counter reaches a certain limit they go into read only mode, and then brick after the next power cycle. Not exactly ideal behaviour.
Re: (Score:1)
I'm curious if the Intel drives doing this are consumer level or enterprise grade drives. The Intel enterprise grade drives did quite well at a previous job where they were used to handle an insane amount of random I/O hitting them on a constant basis.
Of course, the difference between the two are the capacitors, which hold enough electricity to finish the in-flight write transaction, so a hard power off is less likely to cause the controller to lose its ability to find pages (the SSD equivalent of the thum
Re: (Score:2)
Enterprise drives. They are very much there for caches or RAID use where the loss of a drive won't be critical. It just seems like an odd decision, presumably due to the firmware being able to write internal metadata back to the drive and having to keep it all in RAM. Crap design IMHO.
Re: (Score:3, Funny)
Re:Where is the report? (Score:5, Informative)
Okay, thanks to selectspec for posting a link to the report: http://0b4af6cdc2f0c5998459-c0... [rackcdn.com]
The bad news is that Google were using their own custom controllers. Thus we can't draw any conclusions about different manufacturers or controllers or error correction techniques. All they look at is the error rates for different types of flash memory and how often their hardware could correct the errors.
For consumers this is likely meaningless. So much is dependent on the drive controller and selection of error detection/recovery scheme, it doesn't really help to look at the type of flash.
Re: (Score:2)
The problem is not so much schemes used for error recovery or wear levelling. Those are all well understood. The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure), bugs in the code etc. You only need to see the list of firmware updates some of the drive manufacturers roll out to realise that it's not the hardware that is causing problems, it's the firmware in the controllers.
Re: (Score:2)
The problem is one of QA. Poorly written firmware, unable to handle edge cases (e.g. power failure)
Nice post, except I would hardly call power failure an "edge case".
Re: (Score:2)
Yes "abnormal" operation could be considered a better word.
Re:Where is the report? (Score:5, Informative)
http://0b4af6cdc2f0c5998459-c0... [rackcdn.com]
Re:Where is the report? (Score:5, Interesting)
Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
Re: (Score:1)
and most consumer level drives aren't even MLC... but TLC :-/. I don't even think there are consumer level SLC models anywhere. Even the samsung pro is MLC, with samsung EVO being TLC.
Re: (Score:2)
Er, TLC ("triple-level cell", 8 states)) is a form of MLC. MLC is (blindingly obvious from the acronym) "multi-level cell", not "two-level cell" (4 states).
Re: (Score:2)
That makes logical sense but that is not how the terms are used in industry. MLC actually means two levels and TLC is three. QLC will be the term to describe four level cells that I believe are not yet on the market.
Re: (Score:2)
And the odd part is that a "Three level cell" is actually 8 levels (3 bits), so QLC will be 16 levels.
Re: (Score:2)
Yup:
To quote the report:
"Revisiting Table 3, we see that this perception is correct
when it comes to SLC drives and their RBER, as they
are orders of magnitude lower than for MLC and eMLC
drives. However, Tables 2 and 5 show that SLC drives do
not perform better for those measures of reliability that
matter most in practice: SLC drives donâ(TM)t have lower repair
or replacement rates, and donâ(TM)t typically have lower
rates of non-transparent errors"
SLC have less errors, but they affect real-world usage
Re: (Score:2)
Thanks. However, what I read from this study is that SLC is indeed far more reliable, especially over time, with the risk of a several years old MLC drive developing an uncorrectable error is an order of magnitude or more higher than for a similarly old SLC drive.
Only some less common errors are in the same ballpark, which should not be extrapolated to what the title claims.
This paper shows that RBER and the UE rates are correlated with P/E cycles, age, and SLC vs. MLC. But this was already known. Finding the opposite would be surprising. I would expected that as the rated endurance is approached, retention is affected. It would have been interesting to see the RBER and UE rates for the MLC drives past the rated endurance points and to see if the authors' claim that error rates are linear with PE cycles still holds past (or at least close to) the rated endurance.
Despite th
Re: (Score:2)
Gibberish? (Score:1)
Anyone try to just submit a random string of characters? I am fairly certain it would end up posted.
Re: (Score:2)
Re:Gibberish? (Score:4, Insightful)
This site has "TechNerd 101" as a pre-requirement. If you're reading this site without first completing that course, please speak with your student adviser to discuss your options.
Re: (Score:1)
Worthless.... (Score:3)
Just as worthless as their last "study" on storage reliability, as they do not name manufacturers and models. Research published by Google sucks badly.
Re: (Score:2)
as they do not name manufacturers and models
When you're studying differences purely on a technological point of view trying to address the conception of SLC vs MLC, what has manufacturer got to do with it? People constantly post about the differences between SLC and MLC regardless of which manufacturer manufacturers the drive, so when attempting to study that all you're doing is adding additional distractions from your point.
Yes it would be nice to know who's doing the best and the worst.
No that was not at all the point of the study.
Re: (Score:2)
Several points:
1. Manufacturers and models are critical to repeatability and verifiability. As it is, they could have pulled those numbers from their backsides and nobody could tell.
2. There are quite a few SSDs out there that have problems in the relevant time-of-purchase time-span. For example, OCZ had much higher failure rates in a number of models. Without knowing whether any of those (and how many) were in the sample, you do not get a realistic picture, as you are comparing devices on different maturit
Intel sucks at SSDs (Score:3)
All I know is that even Intel can't even make a decent SSD. The first SSD I bought was their Intel SSD 530 Series 120GB and I've never been able to use the damn thing. I've tried it on two computers, a Mac mini 2010 and a DIY PC with a recent motherboard, and in both of them the drive just won't boot after a warm reset. Even after all these years, Intel hasn't published a firmware upgrade to fix the problem.
Re: (Score:2)
Are you sure that's just not a defective drive? I've put the same SSD in a MacBook Pro 13 2011 and some random Toshiba laptop (Windows 8.1) for my sisters in law, both with the 240GB version of the drive. Seems to work perfectly fine and they've been running for a couple years without issues.
Re: (Score:2)
That particular problem with the 530 Series has been known for years.
Intel says it's a problem with Macs.
Apple says it's a problem with the drive.
Re: (Score:2)
It's almost certainly Intel's fault. Some of their SSDs do not follow the SATA spec properly on reset which can cause the initial probe to fail with a timeout. If you probe a second time it will succeed. I actually had to add a second probe to DragonFlyBSD's AHCI driver to work around the problem. It doesn't seem to be related to startup time, even with a long delay I'll see first-probe failures on Intel SSDs in various boxes.
Strangely enough the failures occur with Intel AHCI chipsets + Intel SSDs, but
Re: (Score:2)
The same problem also happens with nVidia chipsets, such as the MCP89 in my Mac mini 2010.
Re: (Score:2)
Nope they haven't. This article talks about firmware DC12 but the most recent firmware is DC33 and people are still having the same problem with these drives.
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
Tell you what, I'd do a no questions asked exchange for an OCZ Vortex drive for you. How does that sound?
Yep I'll take one for the team.
If Google knows this... (Score:3)
...then it would stand to reason that other storage vendors mostly know this, too.
So why aren't there more MLC based flash arrays, especially all-flash models? For storage capacities under 24 TB raw, it would be pretty price competitive to HDD but produce a storage device with insane I/O potential.
Re: (Score:2)
No it wouldn't/isn't. Not even close.
Re: (Score:1)
Because flash is expensive compared to spinning rust.
24TB of hard drive storage can be had for maybe $1000 or so, 4 x 8TB hard drives in a RAID5 style array.
a 1TB SSD runs around $400. 24TBs of that is $96K, raw storage. Maybe you can get a bulk discount and pay $60k.
Sure, you can buy 2/4 TB SS
Re: (Score:2, Informative)
Your math is off... 400*24=$9600, not 96K.
Re: (Score:2)
First off, your math is way off -- 24 x $400 is $9600.
Secondly, nobody would build a 24 TB array with 4x8TB in RAID 5. The risk of data loss during a disk rebuild is too high and it would provide so little I/O that it would be all but useless for anything but low-access archiving.
A better comparison for disks would be 1 TB 15k SAS, and these retail for $225, so the math on disk cost alone is a lot more competitive.
It becomes more competitive when you look at the performance -- 24 SSDs would give you close
Re: (Score:2)
So why aren't there more MLC based flash arrays
What companies are you referring to? I just installed an EMC VNX2 with a tier of MLC flash, which uses FAST VP. Nimble's arrays also use MLC flash [nimblestorage.com] - not eMLC, MLC:
Todayâ(TM)s SSDs degrade when burdened with continual patterns of random writes. When SSDs receive random writes, the write activity within the SSD is greater than the actual number of writes. This write amplification dramatically increases the number of write cycles that the SSD must support. Multi-level cell (MLC) flash is typically not suitable for traditional storage systems because it can only endure 5,000 to 10,000 write cycles. Instead, traditional systems must use single-level cell (SLC) SSDs and will soon begin using enterprise multi-level cell (eMLC) SSDs. SLC and eMLC technologies can endure up to 100,000 write cycles, but cost 4 to 6 times more than traditional MLC flash.
Nimble Storage approaches the problem of write amplification differently. The CASL file system is optimized to aggregate a large number of random writes into sequential I/O stripes. It only writes to flash in multiples of full-erase block width sizes. As a result, write amplification is minimized, allowing the use of lower-cost MLC SSDs.
Re: (Score:2)
I know that Compellent uses MLC in their flash tiers, too, but they refer to it "read intensive" and in the certification class it was explained it's only used for cache reads.
Re: (Score:2)
Re: (Score:2)
holding down the on/off button to improperly shut the computer off when it freezes
There's an improper way to shut down an unresponsive computer? What about system's without reset buttons?
SSD pile growing, HDD pile shrinking. (Score:2)
Calculate the cost of the replacement cycle too and suddenly SSDs look a lot cheaper. It's just that most people can't think beyond the end of their noses, so if the up-front cost looks expensive they stop right there.
I bought my last HDDs last year. Two 4TB 'archival' drives for backups. My existing pile of new 1TB or 2TB HDDs (I have around a dozen 3.5" and half a dozen 2.5" left) will be dribbled out as needed, but won't be buying any new HDDs from now on. In fact, I couldn't even foist off some of