Disk Failure Rates More Myth Than Metric 283
Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
Comment removed (Score:5, Interesting)
Misunderstanding MTBF (Score:5, Interesting)
MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.
Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).
To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.
HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.
Temperature is the key (Score:5, Interesting)
Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).
Very few hard disk enclosures can bring the temperature down to 24-25C.
SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".
From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.
Re:MTBF For Unused Drive? (Score:5, Interesting)
On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.
The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.
Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.
The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.
Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.
If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.
Recycle, don't just dump it! (Score:1, Interesting)
Probably some nice magnets inside to play with too.
Re:Temperature is the key (Score:5, Interesting)
Google says that's just not what they've seen [google.com]. "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."
On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.
I don't know what you people do to your drives (Score:3, Interesting)
Re:Marketplace can't function without good data (Score:4, Interesting)
Re:Never had a drive *not* fail. (Score:3, Interesting)
Re:Never had a drive fail (Score:2, Interesting)
WD Green drive - marketing invention (Score:1, Interesting)
One difference would be that the voice coil motor that pushes the head back and forth on seeks on Samsung drives runs slower, but quieter and lower power. Samsung drives generally have a reputation for being lower power. That has been one differentiating factor between Samsung versus Seagate, Fujitsu and Western Digital. However, an even bigger difference is the number of disks in the drive. The more disks, the harder all of the motors have to work.
There are differences from model to model within vendors as well. For each new model of hard drive you have a custom designed motor, enclosure, ICs, media, etc. The technology is moving so fast it is hard to follow. The current generation is the 1TB disks.
One funny example is that right now Western Digital is pushing their so-called "Green" 5400 rpm drives. Running at 5400 rpm does indeed use less power -- but they didn't set out to make a low power drive. Engineering was simply unable to get their 1TB drive to work at the higher performance 7200 rpm. So, they marketed it as a "green" drive, and had a huge success!
Re:Never had a drive *not* fail. (Score:3, Interesting)
One drive, 24x7, approx. 12 years. Seagate. Why?
Re:warranties (Score:1, Interesting)
I see the useful life of most drives as 3 - 5 years. The drive supplier is going to cover failures inside that 5 year useful life. Most folks are replacing the obsolete gear as new hardware becomes available. The idea that a drive could actually have a million hour MTBF is just fantasy. I see lots of failures with less than 8000 hours (a year) on the drive. Those are certainly outside the "early life failure" category. They are just worn out. Lots of drives have defects that the user doesn't even know about. They don't generate SMART reports or they don't analyze what the report says. I see drives all the time that have only a few hundred hours on them that I wouldn't install in my system.
MTBF specs for hard drives are a marketing ploy.
Re:Never had a drive *not* fail. (Score:2, Interesting)
Re:Never had a drive fail (Score:3, Interesting)
Of course, yonder is a large stack of backups, which also help increase HD longevity.