Slashdot Log In
Disk Drive Failures 15 Times What Vendors Say
Posted by
Zonk
on Fri Mar 02, 2007 05:15 PM
from the cough-sputter-wheeze-choke dept.
from the cough-sputter-wheeze-choke dept.
jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
it's relative. (Score:5, Funny)
Yeah, but I bet they didn't say what planet those hours are on.
Re: (Score:3, Funny)
Re:Personally I am SHOCKED (Score:5, Informative)
The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)
I don't care how you spin it. 1024 is the multiple. NOT 1000!
Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.
Parent
Re:Personally I am SHOCKED (Score:5, Funny)
The trick is to purchase your HD in pennies.
"100,000 pennies! why that's 1024 dollars!!"
Parent
Re:Personally I am SHOCKED (Score:4, Funny)
Why do you think FORTRAN is one of the oldest computing languages in existence?
Because it was invented before most other computer languages? Is this a trick question ;-)
Parent
In other news... (Score:5, Informative)
Even better ... (Score:4, Interesting)
Start with 100 drives. Continuous usage.
How many fail in the first 6 months? 12 months? 18 months?
Parent
Re:Even better ... (Score:5, Informative)
Unfortunately there is no big "spike"; the average replacement rate just grows and grows with time.
Parent
Re:In other news... (Score:4, Informative)
http://www.usenix.org/events/fast07/tech/schroede
Parent
I have thought the MTTF is bullshit for a while (Score:5, Interesting)
I don't consider myself a fluke because I know quite a few other people who have had similar problems. What's the deal?
Also, does anyone else find this quote interesting?:
"and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive."
It's a f$#*ing hard drive! Jesus H Tapdancing Christ how can they call that premature wear, do they calculate the MTTF by just letting the drive sit idle and never reading and writing to it? That actually wouldn't suprise me.
Check SMART Info (Score:4, Interesting)
To view the SMART info for drive
smartctl -a
To do a full disk read check (can take hours) do:
smartctl -t long
Sadly, I just found read errors on a 375-hour-old drive (manufacturer's software claimed that repair succeeded). Fortunately, they were on the Windows partition
Seagate (Score:4, Insightful)
Actually, one useful feature of Vista... (Score:5, Interesting)
...is that it detects SMART disk errors in normal use (i.e. you don't have to be watching the BIOS screens when your PC boots).
When I was trying the Vista RC, it told me that my drive was close to failing. I, of course, didn't believe it at first, but I ran the Seagate test floppy and it agreed. So I sent it back to Seagate for a free replacement.
About the only feature that impressed me in Vista, sadly. (And I'm not sure it should have impressed me, tbh. I'm assuming XP never did this as I've never seen/heard of such a feature.)
Re:Actually, one useful feature of Vista... (Score:4, Informative)
Parent
Re:Repeat? (Score:4, Informative)
Parent
Redundancy (Score:4, Funny)
Parent
Re:Redundancy (Score:5, Funny)
Parent
Re:Repeat? (Score:5, Informative)
"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."
Just common sense.
When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.
When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.
Parent
just assume 3 years (Score:5, Informative)
Parent
Re:Repeat? (Score:4, Informative)
Believe me, they aren't determining an 11 year MTBF empirically.
Parent
Re:Interface matters why? (Score:5, Informative)
Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.
Parent
Re:This study is useless. (Score:4, Interesting)
I dont really care to know exactly what is wrong with the drive. If i replace it, and the problem goes away, I would consdier that a bad drive. Even if you could still read and write to it. I just did one this morning that showed no symptoms other than windows taking what I considered a long time, to boot. All the user complained about was sluggish performance, and there were no errors or drive noises to speak of. Problem fixed, user happy, drive bad.
As I already posted, a good rule of thumb is 3 years from the date of manufacture, is when most drives go bad.
Parent
Off-Topic: SI Units (Score:5, Informative)
Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.
As such, there has been a decree [nist.gov] to give the powers of two their own SI prefix names. The following have been established:
These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.
Schwab
Parent
Re:Not So Fuzzy math (Score:4, Informative)
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.
Parent
Re:Masters of estimates (Score:4, Insightful)
Well, the hard-drive makers are correct on the size thing - a Gigabyte is 1000 Megabytes, and the OS and software makers are wrong.
Yeah, they coined the term and have been using it for 40 years, but they're wrong.
Gigabytes are actually displayed as Gigabytes, or that the listing is changed to correctly display Gibibytes as the value? (or Kibibytes, Mebibytes, whatever)
Listen, just because someone comes up with a standard doesn't obligate everyone to use it, especially when they already have a perfectly workable system already. Claiming that NIST can impose an unwanted standard on the world is like saying that it isn't a word until the OED lists it.
Parent