Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Disk Failure Rates More Myth Than Metric 283

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
This discussion has been archived. No new comments can be posted.

Disk Failure Rates More Myth Than Metric

Comments Filter:
  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Saturday April 05, 2008 @03:34PM (#22974524)
    Comment removed based on user account deletion
  • by dh003i ( 203189 ) <dh003i@gmail. c o m> on Saturday April 05, 2008 @03:55PM (#22974654) Homepage Journal
    I think that a lot of people are mis-understanding MTBF. A HD might have a MTBF of 100 years. This doesn't mean that the company expects the vast majority of consumers to have that HD running for 100 years without problems.

    MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.

    Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).

    To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.

    HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.
  • by arivanov ( 12034 ) on Saturday April 05, 2008 @03:55PM (#22974660) Homepage
    Disk MTBF is quoted for 20C.

    Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).

    Very few hard disk enclosures can bring the temperature down to 24-25C.

    SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".

    From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.
  • by zappepcs ( 820751 ) on Saturday April 05, 2008 @03:56PM (#22974664) Journal
    The problem is that the MTBF is calculated on an accelerated lifecycle test schedule. Life in general does not actually act like the accelerated test expanded out to 1day=1day. It is an approximation, and prone to errors because of the aggregated averages created by the test.

    On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.

    The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.

    Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.

    The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.

    Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.

    If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.
  • by Anonymous Coward on Saturday April 05, 2008 @04:02PM (#22974702)
    He should look at the escalating price of gold too. Older the computer component the more gold in the connectors and the thicker the gold on the traces, etc.. Not to mention other precious metals involved in some of the components such as platinum, paladium, etc.. Perhaps the greatest consideration should be given to the fact that it would increase the heavy metal pollution at the dump it goes to.

    Probably some nice magnets inside to play with too. :P
  • by ABasketOfPups ( 1004562 ) on Saturday April 05, 2008 @04:10PM (#22974756)

    Google says that's just not what they've seen [google.com]. "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."

    On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.

  • by gelfling ( 6534 ) on Saturday April 05, 2008 @04:22PM (#22974834) Homepage Journal
    But since 1981 I have had exactly zero catastrophic PC drive crashes. That's not to say I haven't seen some bad/relocated sectors, but hard failures? None. Granted that's only 20 drives. But in fact in my experience in PC's, midranges and mainframes in almost 30 years I have seen zero hard drive crashes.
  • by commodoresloat ( 172735 ) * on Saturday April 05, 2008 @04:56PM (#22975046)

    If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.
    And that may be the exact reason why the vendors are providing bad data. On the flip side, however, if people knew how often drives failed, perhaps we'd buy more of them in order to always have backups.
  • by Depili ( 749436 ) on Saturday April 05, 2008 @05:09PM (#22975112)
    The deathstars were all 80gt PATA disks, manufactured by a single plant, had 8 of them, all failed.
  • by Rosy At Random ( 820255 ) on Saturday April 05, 2008 @05:43PM (#22975302) Homepage
    Am I the only one who wants to hear more about the drive that went ballistic?
  • by Anonymous Coward on Saturday April 05, 2008 @05:51PM (#22975350)
    disclaimer: I work for Samsung 3.5" HDD Lab

    One difference would be that the voice coil motor that pushes the head back and forth on seeks on Samsung drives runs slower, but quieter and lower power. Samsung drives generally have a reputation for being lower power. That has been one differentiating factor between Samsung versus Seagate, Fujitsu and Western Digital. However, an even bigger difference is the number of disks in the drive. The more disks, the harder all of the motors have to work.

    There are differences from model to model within vendors as well. For each new model of hard drive you have a custom designed motor, enclosure, ICs, media, etc. The technology is moving so fast it is hard to follow. The current generation is the 1TB disks.

    One funny example is that right now Western Digital is pushing their so-called "Green" 5400 rpm drives. Running at 5400 rpm does indeed use less power -- but they didn't set out to make a low power drive. Engineering was simply unable to get their 1TB drive to work at the higher performance 7200 rpm. So, they marketed it as a "green" drive, and had a huge success!
  • by dgatwood ( 11270 ) on Saturday April 05, 2008 @06:02PM (#22975404) Homepage Journal

    One drive, 24x7, approx. 12 years. Seagate. Why?

  • Re:warranties (Score:1, Interesting)

    by Anonymous Coward on Saturday April 05, 2008 @06:29PM (#22975538)
    Interesting ideas about hard drive reliability. I spend many hours each month looking at hard drive performance as part of my work. My job is to qualify drives (and other devices) for our servers. Also have a large volume of drives in the lab and in the field to monitor.

    I see the useful life of most drives as 3 - 5 years. The drive supplier is going to cover failures inside that 5 year useful life. Most folks are replacing the obsolete gear as new hardware becomes available. The idea that a drive could actually have a million hour MTBF is just fantasy. I see lots of failures with less than 8000 hours (a year) on the drive. Those are certainly outside the "early life failure" category. They are just worn out. Lots of drives have defects that the user doesn't even know about. They don't generate SMART reports or they don't analyze what the report says. I see drives all the time that have only a few hundred hours on them that I wouldn't install in my system.

    MTBF specs for hard drives are a marketing ploy.

     
  • by monkaru ( 927718 ) on Saturday April 05, 2008 @07:22PM (#22975806)
    *laughs* Redunadcy exists for a purpose. I ALWAYS assume hardware will fail. It does you know. I guess that's why I still have the data from my 1966 756 byte Multics terminal account.
  • by Reziac ( 43301 ) * on Sunday April 06, 2008 @12:09AM (#22977462) Homepage Journal
    I live where the power spikes and sags constantly. My machines are all on UPSs. And each PC has a decent quality PSU. And if a HD runs more than "pleasantly warm" to the touch, it gets its own dedicated fan. Consequently, I firmly believe all HDs are supposed to live A Long Time... the oldest of my 24/7 HDs right now is 10 years old, and has about 80,000 actual hours on it -- Like yourself, I think they're supposed to be worn out before being thrown out. :)

    Of course, yonder is a large stack of backups, which also help increase HD longevity. ;)

The use of money is all the advantage there is to having money. -- B. Franklin

Working...