Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Disk Failure Rates More Myth Than Metric 283

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
This discussion has been archived. No new comments can be posted.

Disk Failure Rates More Myth Than Metric

Comments Filter:
  • by **loki969** ( 880141 ) on Saturday April 05, 2008 @03:35PM (#22974538)
    ...those that make backups and those that never had a hard drive fail.
  • by dpbsmith ( 263124 ) on Saturday April 05, 2008 @03:38PM (#22974554) Homepage
    If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.

    The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.
  • by Raineer ( 1002750 ) on Saturday April 05, 2008 @03:44PM (#22974590)
    I see it the other way... Once I start taking backups my HDD's never fail, it's when I forget that they crash.
  • by Anonymous Coward on Saturday April 05, 2008 @03:44PM (#22974594)
    Drive failures are actually fairly common, but usually the failures are due to cooling issues. Given that most PCs aren't really set up to ensure decent hard drive cooling, it is probable that the failure ratings are inflated due to operation outside of the expected operational parameters (which are probably not conservative enough for real usage). In my opinion, if you have more than a single hard drive closely stacked in your case you should have some sort of hard drive fan.
  • warranties (Score:5, Insightful)

    by qw0ntum ( 831414 ) on Saturday April 05, 2008 @03:45PM (#22974602) Journal
    The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.
  • What MTBF is for. (Score:5, Insightful)

    by sakusha ( 441986 ) on Saturday April 05, 2008 @03:51PM (#22974640)
    I remember back in the mid 1980s when I received a service management manual from DEC, it had some information that really opened my eyes about what MTBF was really intended for. It had a calculation (I have long since forgotten the details) that allowed you to estimate how many service spares you would need to keep in stock to service any installed base of hardware, based on MTBF. This was intended for internal use in calculating spares inventory level for DEC service agents. High MTBF products needed fewer replacement parts in inventory, low MTBF parts needed lots of parts in stock. Presumably internal MTBF ratings were more accurate than those released to end users.

    So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.

    Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)
  • by WaltBusterkeys ( 1156557 ) on Saturday April 05, 2008 @03:59PM (#22974684)
    Great post above. It also depends on how you count "failure." I've had external drives fail where the disk would still spin up, but the interface was the failure point. I took the disk out of the external enclosure and it worked just fine with a direct IDE (I know, who uses that anymore?) connection.

    If I were running a data-based business I'd count that as a "failure" since I had to go deal with the drive, but the HD company probably wouldn't since no data was permanently lost.
  • by GIL_Dude ( 850471 ) on Saturday April 05, 2008 @04:09PM (#22974748) Homepage
    I'd agree with you there; I have had probably 8 or 9 hard drives fail over the years (I currently have 10 running in the house right now and I have 8 running at my desk at work, so I do have a lot of drives). I am sure that I have caused some of the failures by just what you are talking about - I've maxed out the cases (for example my server has 4 drives in it, but was designed for 2 - I had to make my own bracket to jam the 4th in there, the 3rd went in place of a floppy). But I've never done anything about cooling and I probably caused this myself. Although to hear the noises coming from some of the platters when they failed I'm sure at least a couple weren't just heat. For example at work I have had 2 drives fail in just bog standard HP Compaq dc7700 desktops (without cramming in extra stuff). Sometimes they just up and die, other times I must have helped them along with heat.
  • by OS24Ever ( 245667 ) * <trekkie@nomorestars.com> on Saturday April 05, 2008 @04:12PM (#22974766) Homepage Journal
    More like 'those that never owned an IBM Deskstar drive'
  • by DonChron ( 939995 ) on Saturday April 05, 2008 @04:22PM (#22974826)
    Drive manufacturers take a new hard drive, run a hundred drives or so for some number of weeks, and measure the failure rate. Then they extrapolate that failure rate out to thousands of hours... So, let's say one in 100 drives fail in a 1000-hour test (just under six weeks). MTBF = 100,000 hours, or 11.4 years!

    To make this sort of test work, it must be run over a much longer period of time. But in the process of designing, building, testing and refining disk drive hardware and firmware (software), there isn't that much extra time to test drive failure rates. Want to wait an extra 9 months before releasing that new drive, to get accurate MTBF numbers? Didn't think so. How many different disk controllers do they use in the MTBF tests, to approximate different real-world behaviors? Probably not that many.

    Could they run longer tests, and revise MTBF numbers after the initial release of a drive? Sure, and many of them do, but that revised MTBF would almost always be lower, making it harder to sell the drives. On the other hand, newer drives are certainly available every quarter, so it may not be a bad idea to lower the apparent value of older drive models.

    So, it's better to assume a drive will fail before you're done using it. They're mechanical devices with high-speed moving parts, very narrow tolerable ranges of operation (that drive head has to be far enough away from the platters not to hit them, but close enough to read smaller and smaller areas of data). Anyone who's worked in a data center, or even a small server room, knows that drives fail. When I've had around two hundred drives, of varying ages, sizes and manufacturers, in a data center, I observed a failure rate of five to ten drives per year. This is well below the MTBF for enterprise disk array drives (SCSI, FC, SAS, whatever), but drives fail. That's why we have RAID. Storage Review has a good overview of how to interpret MTBF values from drive manufactures [storagereview.com].

  • by Kupfernigk ( 1190345 ) on Saturday April 05, 2008 @04:39PM (#22974942)
    which many people confuse with MTTF (mean time to failure) - which is relevant in predicting the life of equipment. It needs to be stated clearly that MTBF applies to populations; if I have 1000 hard drives with a MTBF of 1 million hours, I would on average expect one failure every thousand hours. These are failures rather than wearouts, which are a completely different phenomenon.

    Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.

    One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.

  • by crunchy_one ( 1047426 ) on Saturday April 05, 2008 @04:47PM (#22974982)
    Hard drives have been becoming less and less reliable as densities increase. Seagate, WD, Hitachi, Maxtor, Toshiba, heck, they all die, often sooner than their warranties are up. They're mechanical devices, for crying out loud. So here's a bit of good advice: If you really care about your data, use a RAID array with redundancy (RAID 1 or 5). It will cost a bit more, but you'll sleep better at night. Thank you all for your kind attention. That is all.
  • Comment removed (Score:3, Insightful)

    by account_deleted ( 4530225 ) on Saturday April 05, 2008 @04:52PM (#22975018)
    Comment removed based on user account deletion
  • by BSAtHome ( 455370 ) on Saturday April 05, 2008 @05:15PM (#22975142)
    There is another failure rate that you have to take into account: unrecoverable bit-read error-rate. This is detected as an error in the upstream connection, which can cause the controller to fail the drive. An unrecoverable read fails the ECC mechanism and can under circumstances be recovered by performing a re-read of the sector.

    The error-rate is in the order of 10^14 bits. Calculating this on a busy system, reading 1MBytes/s gives you approx. 10^7 seconds for each unrecoverable read failure. Or, that means it occurs 3 times per year on average. So, forget MTBF on busy systems and hope that your controller is able to do re-reads on a disk. Otherwise, your busy system/array is not going to last very long.
  • by neumayr ( 819083 ) on Saturday April 05, 2008 @05:18PM (#22975158)
    *blink*

    Okay, when I think of backup, it's data backup.
    I wouldn't backup applications or operating systems, just their configuration files.
    Anyway, what I'd try doing is diff(1)ing all those backed up system files with the originals.

    Or am I missing something completely, and it's some weird rootkit that's embedded in some wm* media file?
  • by drsmithy ( 35869 ) <drsmithy@nOSPAm.gmail.com> on Saturday April 05, 2008 @05:20PM (#22975166)

    However, Google's data doesn't appear to have a lot of points when temperatures get over 45 degrees or so (as to be expected, since most of their drives are in a climate controlled machine room).

    The average drive temperature in the typical home PC would be *at least* 40 degrees, if not higher. While it's been some time since I checked, I seem to recall the drive in my mum's G5 iMac was around 50 degrees when the machine was _idle_.

    Google's data is useful for server room environments, but I'd be hesistant to extrapolate it to drives that aren't kept in a server room with a ~20 degrees C ambient temperature and have active cooling.

  • by oren ( 78897 ) on Saturday April 05, 2008 @05:24PM (#22975198)
    Disk reliability metrics are much more science than myth. Like all science, this means you actually need to put some minimal effort into understanding them. Unlike myths :-)

    Disks have two separate reliability metrics. The first is their expected life time. In general disks failure follows a "bathtub distribution". They are much more likely to fail at the first few weeks of operation. If they make it past this phase, they become very reliable - for a while anyway. Once their expected lifetime is reached, their failure rate starts steeply climbing.

    The often quoted MTBF numbers express the disk reliability during the "safe" part of this probability distribution. Therefore, a disk with an expected lifetime of, say, 4 years, can have an MTBF of 100 years. This sounds theoretical until you consider that if you have 200 of such disks, you can expect that on average one of them will fail each year.

    People running large data warehouses are painfully aware of these two separate numbers. They need to replace all "expired" disks, and also have enough redundancy to survive disk failures in the duration.

    The article goes so far as to state this:

    "When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last."

    However, this obviously flew over the head of the author:

    The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.

    Common understanding is that 5 years is a bloody long life expectancy for a hard disk! It would take divine intervention to stop failures from rising after such a long time!
  • by KillerBob ( 217953 ) on Saturday April 05, 2008 @05:24PM (#22975200)
    I still have a working 10MB hard drive from an IBM 8088... >.> (and yes, that system still works too, complete with the Hercules monochrome graphics and orange scale CRT)
  • Re:warranties (Score:3, Insightful)

    by ooloogi ( 313154 ) on Saturday April 05, 2008 @05:33PM (#22975256)
    Warranties beyond about two years become largely meaningless for this purpose, because after a drive is getting older people often won't bother claiming warranty for what is by then such a small drive. The cost of shipping/transport is likely to be more than the marginal $/GB on a new drive.

    So in this way a manufacturer can get away with a long warranty, without necessarily incurring a cost for unreliability.
  • by mobby_6kl ( 668092 ) on Saturday April 05, 2008 @06:51PM (#22975660)

    They claim an MTBF in the ballpark of 50 years, but that's just a number pulled out of their rectal cavity.

    If you take a large number of drives and perform scientifically valid MTBF failures, you would certainly come up with a number less than half of that, and perhaps as low as 10% of that.

    Where did you pull these numbers from?
  • A MTBF is only meaningfull when combined with an operating lifespan over which is was measured and after which it is advised that customers needing high reliability replace thier drives.

    Also the manufacturer needs to specify the conditions of the test, temperature, humidity etc and customers requiring reliability need to ensure they run near those conditions.

    If you do a 1000 hour test and all your drives have a design fault that cause a large proportion of them to fail after about 5000 hours usage you probablly won't notice the fault but 7 months down the line customers who run the drive 24/7 will.

    The problem is of course that by the time you have done proper testing (= running the drives for thier expected lifespan under realistic operating conditions and seeing what proportion fail during that time and when) for a device with an expected lifetime in years the device is obsolete.

  • by squidinkcalligraphy ( 558677 ) on Saturday April 05, 2008 @11:16PM (#22977202)
    "Backups are for wimps. Real men upload their data to an FTP site and
    have everyone else mirror it." -Linus Torvalds
  • Except... (Score:3, Insightful)

    by absurdist ( 758409 ) on Sunday April 06, 2008 @12:00AM (#22977412)
    ...that by the time the drive fails beyond that warranty, the vendor is more likely than not not going to have any drives that small in stock. So they'll replace it with whatever's on the shelf, which is usually an order of magnitude larger, at the very least.
  • by fluffy99 ( 870997 ) on Sunday April 06, 2008 @02:54AM (#22978074)
    To the guys who claim they've never lost a drive, you've had what? Maybe 3 or 4? I deal with several large raids, encompassing a few hundred drives and running 24/7. The power and cooling are very tightly controlled. Looking at our statistics, we have about a 5% failure rate for drives within the first year. About 10% over four years. SCSI drives seem to last longer than SATA drives, but they are also much more expensive. The MTBF numbers from the manufacturers are total BS. The best number to go by is the warranty, because that's what matters to the manufacturer. Depending on the expected failure rate of a particular model and the profit margin, they set the warranty period to minimize the number of replacements and still be able to make a profit. Some models that might be a 5% or even 10% warranty replacement rate.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...