Disk Failure Rates More Myth Than Metric 283
Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
Comment removed (Score:5, Interesting)
Never had a drive *not* fail. (Score:5, Informative)
My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
I always seem to have a use for a drive, so I run them until failure.
Re: (Score:2)
If this was the case I would seriously consider looking for a problem that's not directly related to the hard drives themselves. Around 80% of HDD failures are controller board failures, I wonder if maybe your setup is experi
Re: (Score:2)
Re: (Score:3, Interesting)
One drive, 24x7, approx. 12 years. Seagate. Why?
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Insightful)
Okay, when I think of backup, it's data backup.
I wouldn't backup applications or operating systems, just their configuration files.
Anyway, what I'd try doing is diff(1)ing all those backed up system files with the originals.
Or am I missing something completely, and it's some weird rootkit that's embedded in some wm* media file?
Re:Never had a drive *not* fail. (Score:5, Informative)
I administer a small server, which runs its services in virtual sandboxes. One physical box, but through KVM the Apache/PHP/MySQL is in one sandbox, the SMTP/IMAP is in another, etc. Each VM image is about 20GB, give or take, and the machine has two physical hard drives. My backup is periodic, and incremental. And the backup alternates between the drives... at any given time each hard drive will have two copies of every VM, not counting the one that's actually running.
Now... here's where the full system backup comes in: because it's a virtual machine, it's only a single 20GB file. Backing it up is as easy as shutting down the VM and copying the file. Recovering from a backup is where it gets even easier... all I have to do is copy that one file back, and start it up. Poof. *everything* is back the way it was at the time of the backup. Total time to recover? Less than a minute.
And the host OS is easy to rebuild, too, because there's no configuration files to worry about. SSH and KVM are the only services the host is running, and for the most part an out of the box configuration for most Linux distributions will handle it quite nicely.
So... I guess to answer your question... in my case a complete system backup makes administering, and recovering from "oh shit" moments a hell of a lot easier.
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:3, Interesting)
Re: (Score:3, Insightful)
Re:Never had a drive fail (Score:5, Funny)
hah. captcha word: largest
Re:Never had a drive fail (Score:4, Funny)
Re: (Score:2)
Re: (Score:3, Insightful)
Re:Never had a drive fail (Score:4, Informative)
Dirty, spikey power is a much larger problem. A few years back I had 3 or 4 nearly identical WD 80gig drives die within a couple of months of each other, They were replaced with identical drives that are still chugging along find all this time later. The only major difference is that I gave each system a cheapo UPS.
Being somewhat I cheap, I tend to use disks until they wear out completely. After a few years I shift the disks to storing things which are permanently archived elsewhere or swap. Seems to work out fine, only problem is what happens if the swap goes bad while I'm using it.
Re:Never had a drive fail (Score:5, Informative)
Re: (Score:3, Interesting)
Of course, yonder is a large stack of backups, whic
Re:Never had a drive fail (Score:4, Insightful)
Re: (Score:3, Informative)
Excess heat can cause the lubricant of a hd to go bad and causes weird noises, also logic board failures/head positioning failures cause quite a racket.
In my experience most drives fail without any indications from smart tests, ie. logic board failures, bad sectors are quite rare nowadays.
Re: (Score:2)
When they fail within minutes, in an open box, with extra fans blowing across them (4 out of 4 from one batch, 2 out of 4 with a replacement batch - and yes, they were also individually checked in another machine afterwards, but let's face it, when they're making grinding or zip-zip-zip noises, they're defective) there's a problem with quality control. Specifically, China.
Also , do NOT use those hard drive fans that mount under the hd - I tried that with a raid 4 years ago. The fans become unbalanced aft
Re:Never had a drive fail (Score:5, Funny)
<Indiana Jones> IT BELONGS IN A MUSEUM!</Indiana Jones>
It belongs in a museum (Score:2)
Re:Never had a drive fail (Score:4, Informative)
Re: (Score:3, Informative)
3.8GB drive: failed
45GB drive: failed
2x500GB drive: failed
Still working:
9GB
27GB
100GB
120GB
2x160GB
2x250GB
3x500GB
2x750GB
3x500GB external
However, in all the cases they've been the worst possible. The 45GB drive was my primary drive at the time with all my recent stuff. The 2x500GB were in a RAID5, you know what happens in a RAID5 when two drives fail? Yep. Right now I'm running 3xRAID1 for the important stuff (+ backup), JBOD on everything else.
Re:Never had a drive fail (Score:5, Funny)
Re: (Score:2)
During this period, I learned not to buy WD drives in Australia again - whereas Seagate and Samsung handle warranty returns locally, and each took about 3 days to get a new drive to me, WD wanted me to send the drive to Singapore, and estimated a 4-week turnaround. Fortunately, I was able to convince the retailer t
Re: (Score:2)
Never had one fail yet. Very impressed.
Re: (Score:2)
Something about moving around and harddrives don't mix. (Can't wait for SSD).
Re: (Score:2, Interesting)
Re: (Score:2)
The only time I had a hard drive die was at work... which is probably one of the worst places for it to happen.
And our tech people couldn't recover data; I had to ask for the broken drive and recover it myself.
And I was quite dicked because we get just one big partition and so the fragmentation rate was extremely high over my important documents.
That's why:
1. always partition everything
2. never use Maxtor drives
3. never buy Dell
Re: (Score:2)
That's not a very epic fail.
There are only two kind of peeps... (Score:5, Insightful)
Re:There are only two kind of peeps... (Score:5, Insightful)
Re: (Score:3, Funny)
Re:There are only two kind of peeps... (Score:5, Insightful)
have everyone else mirror it." -Linus Torvalds
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
Re: (Score:2)
Re: (Score:2)
I actually had a few "Deathstars", in fact, I still have one in one of my machines still running fine. Not a single one of them has crashed for me, one made the infamous "click of death" but then just kept on running...
I always love mentioning that when people say IBM hardware is of poor quality (generally gamers and similar people who will never again touch any product by a company if they so much as hear a rumour about one of their products failing a bit too often, yet they'll gladly buy the cheapest pos
Marketplace can't function without good data (Score:5, Insightful)
The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.
Re: (Score:2)
Re:Marketplace can't function without good data (Score:4, Interesting)
Re: (Score:2)
Re: (Score:2)
The way I understand MTBF to really be, one in ten drives failing might translate to an MTBF of 30 years, assuming the drive is replaced at the end of a 3 year service life, or 50 years assuming a 5 year service life.
Re:Marketplace can't function without good data (Score:4, Insightful)
Also the manufacturer needs to specify the conditions of the test, temperature, humidity etc and customers requiring reliability need to ensure they run near those conditions.
If you do a 1000 hour test and all your drives have a design fault that cause a large proportion of them to fail after about 5000 hours usage you probablly won't notice the fault but 7 months down the line customers who run the drive 24/7 will.
The problem is of course that by the time you have done proper testing (= running the drives for thier expected lifespan under realistic operating conditions and seeing what proportion fail during that time and when) for a device with an expected lifetime in years the device is obsolete.
MTBF For Unused Drive? (Score:2)
Re:MTBF For Unused Drive? (Score:5, Interesting)
On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.
The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.
Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.
The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.
Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.
If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.
Re: (Score:3, Insightful)
If I were running a data-based business I'd count that as a "failure" since I had to go deal with the drive, but the HD company probably wouldn't since no data was permanently lost.
Re: (Score:3, Insightful)
The error-rate is in the order of 10^14 bits. Calculating this on a busy system, reading 1MBytes/s gives you approx. 10^7 seconds for each unrecoverable read failure. Or, that mea
Re:MTBF For Unused Drive? (Score:4, Informative)
If you have 10,000 drives, and the failure is 1 in 1,000,000 hours, you will have a failure every 100 hours.
Here's a good document on disk failure information:
http://research.google.com/archive/disk_failures.pdf [google.com]
Re:MTBF For Unused Drive? (Score:5, Informative)
If you think an MTBF of 100 years means the disk will last 100 years you're bound to be disappointed, because that's not what it means. MTBF is calculated in different ways by different companies, but generally there are at least two numbers you need to look at, MTBF and the design or expected lifetime. A disk with an MTBF of 200 000 hours and a lifetime of 20 000 hours means that 1 in 10 are expected to fail during their lifetime, or with 200 000 disks one will fail every hour. It does not mean the average drive will last 200 000 years. After the lifetime is over all bets are off.
In short, the MTBF is a statistical measure of the expected failure rate during the expected lifetime of a device, it is not a measure of the expected lifetime of a device.
warranties (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
I'm not saying that MTBF isn't a completely unreliable number. I'd imagine there is a correlation between higher M
Re: (Score:2)
Re: (Score:2)
ASSuming anything approaching a significant of drives which fail during the warranty period are claimed. Otherwise a warranty is nothing more than advertising.
I strongly suspect this is not the case and you are simply replacing one false metric with another.
Re: (Score:3, Insightful)
So in this way a manufacturer can get away with a long warranty, without necessarily incurring a cost for unreliability.
Except... (Score:3, Insightful)
Easy to get the quoted figures ... (Score:2)
What MTBF is for. (Score:5, Insightful)
So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.
Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)
Re: (Score:2)
Re:What MTBF is for. (Score:4, Informative)
Re:What MTBF is for. (Score:4, Informative)
long story short -- a 3 year old drive will not have the same MTBF as a brand new drive. And a MTBF of 1 million hours doesn't mean that the median drive will live to 1 million hours.
Re: (Score:3, Informative)
Re: (Score:2)
This is the case with any statistic. They are very useful for predicting trends in a large enough population, but completely useless for predicting individuals' behaviour.
Re: (Score:3, Informative)
Let's say you buy quality SAS drives for your servers and SAN. They're Enterprise grade, so they have a MTBF of 1 million hours. Your servers and SAN have a total of 500 disks between them all. How many many drives should you expect to fail each year?
IIRC, this is the calculation:
1 year = 365 days x 24 hours = 8760 hours per year
500 disks * 8760 hours per year = 4,380,000 disk-hours per year
4,380,000 disk-hours per year / 1,000,000 hours per disk
Misunderstanding MTBF (Score:5, Interesting)
MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.
Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).
To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.
HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.
Re: (Score:2)
Re: (Score:3, Insightful)
Temperature is the key (Score:5, Interesting)
Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).
Very few hard disk enclosures can bring the temperature down to 24-25C.
SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".
From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.
Re: (Score:2)
Re:Temperature is the key (Score:5, Interesting)
Google says that's just not what they've seen [google.com]. "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."
On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.
Re: (Score:3, Insightful)
However, Google's data doesn't appear to have a lot of points when temperatures get over 45 degrees or so (as to be expected, since most of their drives are in a climate controlled machine room).
The average drive temperature in the typical home PC would be *at least* 40 degrees, if not higher. While it's been some time since I checked, I seem to recall the drive in my mum's G5 iMac was around 50 degrees when the machine was _idle_.
Google's data is useful for server room environments, but I'd be hesistant
Re: (Score:2, Informative)
To get a meaningful result, it would require taking a population of the same drive and comparing the effects of temperature on it.
Re:Temperature is the key (Score:4, Informative)
On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.
So I'm thinking there can be some measurable differences between drive brands, and a study of this, along with perhaps relationship with brand failure rates would be most interesting!
Re: (Score:3, Informative)
On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.
Air is a much better insulator than metal.
Re: (Score:2)
I've had over a dozen Maxtors in the last decade and none have failed. Of course I replace them every 3-4 years because by then they've got too damn small.
The only drive I've had that did fail was an IBM, and even then I had plenty of advance warning so I could replace it before it was unusable.
Re: (Score:2)
What about google? (Score:2)
Add value & Interpreting (Score:2)
They are obviously interpreting the numbers.
How the hell can they be adding value is way beyond me.
Adding price, may be, but VALUE ????
Re: (Score:2)
How the hell can they be adding value is way beyond me.
By having larger amounts of data and more skill in interpreting it.
Build your own USB drives (Score:4, Informative)
I purchased a 500GB Western Digital My Book about a year and a half ago. I figured that a pre-fab USB enclosed drive would somehow be more reliable than building one myself with a regular 3.5" internal drive and my own separately purchased USB enclosure (you may dock me points for irrational thinking there). Of course, I started getting the click-of-death about a month ago, and I was unpleasantly surprised to discover that the warranty on the drive was only for 1 year, rather than the 3 year warranty that I would have gotten for a regular 3.5" 500GB Western Digital drive at the time. Meanwhile, my 750GB Seagate drive in a AMS VENUS enclosure has been chugging along just fine, and if it fails sometime in the next four years, I will still be able to exchange it under warranty.
The moral of the story is that, when there is a difference in the warranty periods (i.e., 1 year vs. 5 years), it makes a lot more sense to build your own USB enclosed drive rather than order a pre-fab USB enclosed drive.
Re: (Score:2)
Needless to say, it's my last WD drive. Their service suxk.
MTBF rate calculation method is flawed (Score:2, Insightful)
To make this sort of test work, it must be run over a much longer period of time. But in the process of designing, building, testing and refining disk drive hardware and firmware (software), the
I don't know what you people do to your drives (Score:3, Interesting)
MTBF is a useful statistical measure (Score:4, Insightful)
Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.
One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.
Re: (Score:2)
RAID, If You Really Care (Score:2, Insightful)
Typical misleading title (and bad article) (Score:3, Insightful)
Disks have two separate reliability metrics. The first is their expected life time. In general disks failure follows a "bathtub distribution". They are much more likely to fail at the first few weeks of operation. If they make it past this phase, they become very reliable - for a while anyway. Once their expected lifetime is reached, their failure rate starts steeply climbing.
The often quoted MTBF numbers express the disk reliability during the "safe" part of this probability distribution. Therefore, a disk with an expected lifetime of, say, 4 years, can have an MTBF of 100 years. This sounds theoretical until you consider that if you have 200 of such disks, you can expect that on average one of them will fail each year.
People running large data warehouses are painfully aware of these two separate numbers. They need to replace all "expired" disks, and also have enough redundancy to survive disk failures in the duration.
The article goes so far as to state this:
"When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last."
However, this obviously flew over the head of the author:
The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.
Common understanding is that 5 years is a bloody long life expectancy for a hard disk! It would take divine intervention to stop failures from rising after such a long time!
Re: (Score:3, Insightful)
Where did you pull these numbers from?
MTBF assumes drives are replaced every few years (Score:3, Informative)
Of course, I think this is another deceptive definition from the hard drive industry... To me, the drive's lifetime ends when it fails, not "5 years".
Source: http://www.rpi.edu/~sofkam/fileserverdisks.html [rpi.edu]
Wow... (Score:2)
To quote Scott McNealy (Score:2)
Enough with the little sample sizes (Score:3, Insightful)
Re:Failure rates ! warranty period. (Score:5, Informative)
Warranty periods for 750 gig and 1 terabyte drives from Western Digital [zipzoomfly.com], Samsung [zipzoomfly.com], and Hitachi [zipzoomfly.com], are 3 years to 5 years according to the info on zipzoomfly.com.
A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years
Re: (Score:2)
3 years is pretty much the industry standard on hard drives. Likewise for monitors, btw... so if your HP or Dell starts having problems with the monitor, you should check the warranty on the monitor because it'll usually be longer than the warranty on the desktop. :)
But yes... external peripherals usually only have a 1 year warranty. My 1TB external drive is the only thing I've ever bought the extended warr