Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Disk Failure Rates More Myth Than Metric 283

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
This discussion has been archived. No new comments can be posted.

Disk Failure Rates More Myth Than Metric

Comments Filter:
  • by Murphy Murph ( 833008 ) <sealab.murphy@gmail.com> on Saturday April 05, 2008 @03:40PM (#22974564) Journal

    I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

    My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
    I always seem to have a use for a drive, so I run them until failure.

  • by ABasketOfPups ( 1004562 ) on Saturday April 05, 2008 @03:57PM (#22974670)

    Warranty periods for 750 gig and 1 terabyte drives from Western Digital [zipzoomfly.com], Samsung [zipzoomfly.com], and Hitachi [zipzoomfly.com], are 3 years to 5 years according to the info on zipzoomfly.com.

    A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years

  • Re:What MTBF is for. (Score:4, Informative)

    by sakusha ( 441986 ) on Saturday April 05, 2008 @04:09PM (#22974740)
    Thanks. I read your comment and got to thinking about it a bit more. I vaguely recall that in those olden days, MTBF was not an estimate, it was calculated from the service reports of failed parts. The calculations were released in monthly reports so we could increase our spares inventory to cover parts that were proving to be less reliable than estimated. But then, those were the days when every installed CPU was serviced by authorized agents, so data gathering was 100% accurate.
  • by hedwards ( 940851 ) on Saturday April 05, 2008 @04:09PM (#22974742)
    I think cooling issues are somewhat less common than most people think, but they are definitely significant. And I wouldn't care to suggest that people neglect to handle heat dissipation on general principle.

    Dirty, spikey power is a much larger problem. A few years back I had 3 or 4 nearly identical WD 80gig drives die within a couple of months of each other, They were replaced with identical drives that are still chugging along find all this time later. The only major difference is that I gave each system a cheapo UPS.

    Being somewhat I cheap, I tend to use disks until they wear out completely. After a few years I shift the disks to storing things which are permanently archived elsewhere or swap. Seems to work out fine, only problem is what happens if the swap goes bad while I'm using it.
  • by Jugalator ( 259273 ) on Saturday April 05, 2008 @04:13PM (#22974772) Journal
    I agree, I had a Maxtor disk that ran at something like 50-60 C and wondered when it was going to fail, never really treated it as my safest drive. And lo and behold, after ~3-4 years the first warnings on bad sectors started cropping up, and a year later Windows panicked and told me to immediately back it up if I hadn't already because I guess the number of SMART errors were building up.

    On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.

    So I'm thinking there can be some measurable differences between drive brands, and a study of this, along with perhaps relationship with brand failure rates would be most interesting!
  • by kesuki ( 321456 ) on Saturday April 05, 2008 @04:13PM (#22974776) Journal
    And i had 5 fail This year, welcome, the the law of averages. note i own about 15 hard drives including the 5 that failed.
  • by omnirealm ( 244599 ) on Saturday April 05, 2008 @04:14PM (#22974786) Homepage
    While we are on the topic of failing drives, I think it would be appropriate to include a warning about USB drives and warranties.

    I purchased a 500GB Western Digital My Book about a year and a half ago. I figured that a pre-fab USB enclosed drive would somehow be more reliable than building one myself with a regular 3.5" internal drive and my own separately purchased USB enclosure (you may dock me points for irrational thinking there). Of course, I started getting the click-of-death about a month ago, and I was unpleasantly surprised to discover that the warranty on the drive was only for 1 year, rather than the 3 year warranty that I would have gotten for a regular 3.5" 500GB Western Digital drive at the time. Meanwhile, my 750GB Seagate drive in a AMS VENUS enclosure has been chugging along just fine, and if it fails sometime in the next four years, I will still be able to exchange it under warranty.

    The moral of the story is that, when there is a difference in the warranty periods (i.e., 1 year vs. 5 years), it makes a lot more sense to build your own USB enclosed drive rather than order a pre-fab USB enclosed drive.
  • Re:What MTBF is for. (Score:4, Informative)

    by davelee ( 134151 ) on Saturday April 05, 2008 @04:22PM (#22974830)
    MTBFs are designed to specify a RATE of failure, not the expected lifetime. This is because disk manufacturers don't test MTBF by running 100 drives until they die, but rather running say, 10000 drives and counting the number that fail during some period of months perhaps. As drives age, clearly the failure rate will increase and thus the "MTBF" will shrink.

    long story short -- a 3 year old drive will not have the same MTBF as a brand new drive. And a MTBF of 1 million hours doesn't mean that the median drive will live to 1 million hours.
  • by Kjella ( 173770 ) on Saturday April 05, 2008 @04:23PM (#22974846) Homepage
    1.6GB drive: failed
    3.8GB drive: failed
    45GB drive: failed
    2x500GB drive: failed

    Still working:
    9GB
    27GB
    100GB
    120GB
    2x160GB
    2x250GB
    3x500GB
    2x750GB
    3x500GB external

    However, in all the cases they've been the worst possible. The 45GB drive was my primary drive at the time with all my recent stuff. The 2x500GB were in a RAID5, you know what happens in a RAID5 when two drives fail? Yep. Right now I'm running 3xRAID1 for the important stuff (+ backup), JBOD on everything else.
  • Re:What MTBF is for. (Score:3, Informative)

    by flyingfsck ( 986395 ) on Saturday April 05, 2008 @04:34PM (#22974912)
    That is an urban legend. Colossus and Eniac were far more reliable than that. The old tube based computers seldom failed, because the tubes were run at very low power levels and tubes degrade slowly, they don't pop like a light bulb (which is run at a very high power level to make a little visible light). Colossus for example was built largely from Plessey telephone exchange registers and telex machines. These registers were in use in phone exchanges for decades after the war. I saw some tube based exchanges in the early 80s that were still going strong.
  • by afidel ( 530433 ) on Saturday April 05, 2008 @05:01PM (#22975070)
    I would tend to agree with that. I run a datacenter that's cooled to 74 degrees and has good clean power from the online UPS's and I've had 6 drive failures out of about 500 drives over the last 22 months. Three were from older servers that weren't always properly cooled (the company had a crappy AC unit in their old data closet.) The other three all died in their first month or two after installation. So properly treated server class drives are dying at a rate of about .5% per year for me, I'd say that jives with manufacturer MTBF.
  • by Depili ( 749436 ) on Saturday April 05, 2008 @05:15PM (#22975136)

    Excess heat can cause the lubricant of a hd to go bad and causes weird noises, also logic board failures/head positioning failures cause quite a racket.

    In my experience most drives fail without any indications from smart tests, ie. logic board failures, bad sectors are quite rare nowadays.

  • by drsmithy ( 35869 ) <drsmithy&gmail,com> on Saturday April 05, 2008 @05:21PM (#22975176)

    On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.

    Air is a much better insulator than metal.

  • by AySz88 ( 1151141 ) on Saturday April 05, 2008 @05:32PM (#22975254)
    MTBF is only valid during the "lifetime" of a drive. (For example, "lifetime" might mean the five years during which a drive is under warranty.) Thus, the MTBF is the mean time before failure if you replace the drive every five years with other drives with identical MTBF. Thus the 100-some year MTBF doesn't mean that an individual drive will last 100+ years, it means that your scheme of replacing every 5 years will work for an average time of 100+ years.
    Of course, I think this is another deceptive definition from the hard drive industry... To me, the drive's lifetime ends when it fails, not "5 years".
    Source: http://www.rpi.edu/~sofkam/fileserverdisks.html [rpi.edu]
  • by ooloogi ( 313154 ) on Saturday April 05, 2008 @05:42PM (#22975300)
    From the Google study, it would appear that there was a brand of hard drive that ran cool and was unreliable. If there's a correlation between brand/model/design and temperature (which there will be), then the temperature study may just be showing that up.

    To get a meaningful result, it would require taking a population of the same drive and comparing the effects of temperature on it.
  • by mollymoo ( 202721 ) * on Saturday April 05, 2008 @05:47PM (#22975328) Journal

    Maybe they mean the MTBF for drives that are just on, but not being used. I've never put any stock into those numbers, because I've had too many drives fail to believe that they're supposed to be lasting 100 years.

    If you think an MTBF of 100 years means the disk will last 100 years you're bound to be disappointed, because that's not what it means. MTBF is calculated in different ways by different companies, but generally there are at least two numbers you need to look at, MTBF and the design or expected lifetime. A disk with an MTBF of 200 000 hours and a lifetime of 20 000 hours means that 1 in 10 are expected to fail during their lifetime, or with 200 000 disks one will fail every hour. It does not mean the average drive will last 200 000 years. After the lifetime is over all bets are off.

    In short, the MTBF is a statistical measure of the expected failure rate during the expected lifetime of a device, it is not a measure of the expected lifetime of a device.

  • by SuperQ ( 431 ) * on Saturday April 05, 2008 @06:14PM (#22975458) Homepage
    MTBF is NOT calculated for a single drive. MTBF is calculated based on an average for ANY pool size of drives.

    If you have 10,000 drives, and the failure is 1 in 1,000,000 hours, you will have a failure every 100 hours.

    Here's a good document on disk failure information:
    http://research.google.com/archive/disk_failures.pdf [google.com]
  • by KillerBob ( 217953 ) on Saturday April 05, 2008 @06:35PM (#22975562)
    Admittedly, it's a different environment entirely than what you're running, but let me see if I can shed some light on it for you....

    I administer a small server, which runs its services in virtual sandboxes. One physical box, but through KVM the Apache/PHP/MySQL is in one sandbox, the SMTP/IMAP is in another, etc. Each VM image is about 20GB, give or take, and the machine has two physical hard drives. My backup is periodic, and incremental. And the backup alternates between the drives... at any given time each hard drive will have two copies of every VM, not counting the one that's actually running.

    Now... here's where the full system backup comes in: because it's a virtual machine, it's only a single 20GB file. Backing it up is as easy as shutting down the VM and copying the file. Recovering from a backup is where it gets even easier... all I have to do is copy that one file back, and start it up. Poof. *everything* is back the way it was at the time of the backup. Total time to recover? Less than a minute.

    And the host OS is easy to rebuild, too, because there's no configuration files to worry about. SSH and KVM are the only services the host is running, and for the most part an out of the box configuration for most Linux distributions will handle it quite nicely.

    So... I guess to answer your question... in my case a complete system backup makes administering, and recovering from "oh shit" moments a hell of a lot easier. :) If you have the hard drive storage space available, I'd definitely suggest going that route.
  • by ClioCJS ( 264898 ) <cliocjs+slashdot AT gmail DOT com> on Saturday April 05, 2008 @06:43PM (#22975600) Homepage Journal
    Last I checked.
  • Re:What MTBF is for. (Score:3, Informative)

    by Bacon Bits ( 926911 ) on Saturday April 05, 2008 @07:46PM (#22976000)
    Exactly, it's a basic misunderstanding of what MTBF means.

    Let's say you buy quality SAS drives for your servers and SAN. They're Enterprise grade, so they have a MTBF of 1 million hours. Your servers and SAN have a total of 500 disks between them all. How many many drives should you expect to fail each year?

    IIRC, this is the calculation:

    1 year = 365 days x 24 hours = 8760 hours per year
    500 disks * 8760 hours per year = 4,380,000 disk-hours per year
    4,380,000 disk-hours per year / 1,000,000 hours per disk failure = 4.3 disk falures per year

    So a 500 disk server farm should expect 4-5 disk failures annually.
  • by putaro ( 235078 ) on Saturday April 05, 2008 @09:06PM (#22976492) Journal
    No, the key is a small sample size. Disks in data centers, running in a nice, fully A/C'd room off nicely filtered power will fail. All disks will fail eventually - they have little spinny things in them and bearings and such that will eventually give out. But, your mileage will vary, disks *are* reliable, and it's easy to have a small sample set that works well.

One man's constant is another man's variable. -- A.J. Perlis

Working...