

Disk Drive Failures 15 Times What Vendors Say 284
jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."
Re:Repeat? (Score:4, Informative)
In other news... (Score:5, Informative)
I believe it... (Score:3, Informative)
Re:Repeat? (Score:5, Informative)
"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."
Just common sense.
When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.
When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.
Having read the paper and seen the talk... (Score:2, Informative)
RAID = Redundant Articles of Identical Discourse (Score:3, Informative)
Slashdot has a high rate of RAID, which is a bad thing. Which is a bad thing. It has been a whole 9 days. Slashdot needs a story moderation system so dupe articles can get modded out of existance. Ditto for slashdot editors who do the duping!
Can we get redundant posting on the story about google's paper [slashdot.org]?
Re:Interface matters why? (Score:5, Informative)
Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.
Re:Personally I am SHOCKED (Score:5, Informative)
The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)
I don't care how you spin it. 1024 is the multiple. NOT 1000!
Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.
Re:In other news... (Score:4, Informative)
http://www.usenix.org/events/fast07/tech/schroede
Re:Interface matters why? (Score:3, Informative)
Re:Even better ... (Score:5, Informative)
Unfortunately there is no big "spike"; the average replacement rate just grows and grows with time.
just assume 3 years (Score:5, Informative)
Faster, cheaper, more reliable (Score:3, Informative)
I've noticed this personally. Now, anecdotal evidence doesn't count for a lot, and it may be a case that we are pushing our drives more. But back in the day of 40MB hard drives that cost a fortune, they used to last forever. The only drive I ever had fail on me in the old days were the Syquest removable HD cartridges, for obvious reasons. But even they didn't fail that often, considering the extra wear-and-tear of having a removable platter with separate heads in the drive.
But these days, with our high-capacity ATA drives, I see hard drives failing every month. Sure, the drives are cheap and huge, but they don't seem to make them like they used to. I guess it's just a consequence of pushing the storage and speed to such high levels, and cheap mass-production. Although the drives are cheap, if somebody doesn't back up their data, the costs are incalculable if the data is valuable.
Re:Personally I am SHOCKED (Score:3, Informative)
Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.
Off-Topic: SI Units (Score:5, Informative)
Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.
As such, there has been a decree [nist.gov] to give the powers of two their own SI prefix names. The following have been established:
These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.
Schwab
Re:Not So Fuzzy math (Score:4, Informative)
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.
Re:Odd numbers for memory failure? (Score:1, Informative)
We see everything eventually die - power supplies, fans, motherboards, RAM, CPUs, drives. Nothing is immune from "wearing out" except maybe the boxes themselves.
This is only news... (Score:2, Informative)
I expect most desktop drives to last 5 years max. MAX. No manufacturer has an edge. It's just the way it is. MTBF is fiction.
For an always-on server, I expect failures about every 3-4 years. For my clients who cared enough to pay for the very best, I replaced the drives in the 3rd year without waiting. No failures costa a bit more.
My experience is that Seagate and Fujitsu are my best server drives. IBM was also on the list, but I'm watching Hitachi. No decision.
The losers: Quantum (thankfully gone), Samsung (until recently), Maxtor. Not my opinion, my experience.
Now, in fairness, these are some of my historical losers:
Seagate: Early IDE drives and the 'stiction' problem. Remember banging drives to get them started?
Quantum 'Bigfoot' drives: popular in Compaq machines, the 5.25"
Seagate SCSI drives: Many different types had a bad habit of going off-line for no apparent reason. Your Novell server would log the 'device deactivated to a non-media defect' error. Just restarting the bus controller would sometimes wake them up. Sometimes repowering the drives. Would happen every few months. Usually when I was elsewhere...
And then there was Miniscribe.
But MTBF numbers are universally fiction. Imagine trying to sell the idea of a wave bearing lasting 16 years to an engineer with real-world experience. I figure MTBF numbers come out of the marketing department.
-rick
Re:Personally I am SHOCKED (Score:3, Informative)
To clear up the confusion, the notation for binary, as in 2^20 bytes was developed. That would be a Mebibyte.
http://en.wikipedia.org/wiki/Mebibyte [wikipedia.org]
Re:Repeat? (Score:4, Informative)
Believe me, they aren't determining an 11 year MTBF empirically.
Re:Check SMART Info (Score:3, Informative)
The conclusions are roughly the following: a) if there are SMART errors, the disk will fail soon, b) if there are no SMART errors, the disk is still likely to fail. They saw no SMART errors on 36% of their failed disks.
Re:Actually, one useful feature of Vista... (Score:4, Informative)
Ideal conditions vs. Real world (Score:3, Informative)
I have had two computers with power supply units that were "acting up." They ended up killing my hard drives on multiple occasions - Seagates, WD's, Maxtors, etc. It didn't matter what type of drive you put in these systems, the drive would die after anywhere from a week to two years. I later discovered that the power supplies were the problems, replaced them with brand new ones, and replaced the drives one last time. That was quite some time ago (years), and those drives, although small, still work, and have been transferred into newer computer systems since that time. The PSU was killing the drives; they weren't inherently bad or had a manufacturing defect. A friend of mine who lives in an apartment building constructed circa 1930 experienced similar problems with his drives. After just a few months, it seemed like his drives would spontaneously fail. When I tested his grounding plug, I found that it was carrying a voltage of about 30V (a hot ground - how wonderful). Since he moved out of that building and replaced his computer's PSU, no drive failures.
The same type of thing is true in automobile mileage testing. Car manufacturers must subject their cars to tests based on rules and procedures dictated by state and federal government agencies. These tests are almost never real world - driving on hilly terrain, through winds, with the headlights and window wipers on, plus the AC for defrost. They're based on a certain protocol developed in a laboratory to level the playing field and ensure that the ratings, for the most part, are similar. It simply means when you buy a new car, you can expect that under ideal conditions and at the beginning of the vehicle's life, it should BE ABLE to get the gas mileage listed on the window (based on an average sampling of the performance of many vehicles).
My point is that there really isn't a decent way to go about ensuring that an estimated statistic is valid for individual situations. By modifying the environmental conditions, the "rules of the game" change. A data-center with exceptional environmental control and voltage regulation systems, and top-quality server components (PSU's, voltage regulators, etc.) should expect to experience fewer drive failures per year than the drives found in an old chicken-shack data center set up in some hillbilly's back yard out in the middle of nowhere where quality is the last thing on the IT team's mind. It's impractical to expect that EVERY data center will be ideal - and since it's very very difficult to have better than the "ideal" testing conditions used in the MTTF tests - the real-life performance can only move towards more frequent and early failures. Using the car example above, since almost nobody is going to be using their vehicle in conditions BETTER than the ideal dictated by the protocols set forth by the government, and almost EVERYONE will be using their vehicles under worse conditions, the population average and median have nowhere to go but down. That doesn't mean the number is wrong, it just means that it's what the vehicle is capable of - but almost never demonstrates in terms of its performance - since ideal conditions in the real world are SO rare.