Disk Drive Failures 15 Times What Vendors Say 284
jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."
Repeat? (Score:2, Insightful)
Re:Repeat? (Score:4, Informative)
Redundancy (Score:4, Funny)
Re:Redundancy (Score:5, Funny)
Re: (Score:2)
Redundant Array of Imitating Duplicates
Re: (Score:2, Interesting)
The best part about the entire thing is the very last quote:
"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."
Just common sense.
Re:Repeat? (Score:5, Informative)
"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."
Just common sense.
When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.
When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.
just assume 3 years (Score:5, Informative)
Re: (Score:2)
Then again, considering the assembly-line efficiency and relative consistency with which devices and conponents are made these day, maybe it isn't isn'
Re:Repeat? (Score:4, Informative)
Believe me, they aren't determining an 11 year MTBF empirically.
Re: (Score:3, Funny)
Re: (Score:3, Funny)
Re: (Score:2)
it's relative. (Score:5, Funny)
Yeah, but I bet they didn't say what planet those hours are on.
Re: (Score:3, Funny)
It's not relative. (Score:2)
Re: (Score:2)
Personally I am SHOCKED (Score:2, Insightful)
I propose a new term for the heinous practice---"marketing".
Re:Personally I am SHOCKED (Score:5, Informative)
The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)
I don't care how you spin it. 1024 is the multiple. NOT 1000!
Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.
Re: (Score:3, Informative)
Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.
Re: (Score:3)
Re:Personally I am SHOCKED (Score:4, Funny)
Why do you think FORTRAN is one of the oldest computing languages in existence?
Because it was invented before most other computer languages? Is this a trick question ;-)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Informative)
To clear up the confusion, the notation for binary, as in 2^20 bytes was developed. That would be a Mebibyte.
http://en.wikipedia.org/wiki/Mebibyte [wikipedia.org]
Re:Personally I am SHOCKED (Score:5, Funny)
The trick is to purchase your HD in pennies.
"100,000 pennies! why that's 1024 dollars!!"
And that's a really wide range (Score:2, Funny)
Re: (Score:2)
Well, they don't call it "Best Borrow" for no reason.
Re: (Score:2)
unless they warantee this, which none do, the spec is meaningless, and they might
In other news... (Score:5, Informative)
Even better ... (Score:4, Interesting)
Start with 100 drives. Continuous usage.
How many fail in the first 6 months? 12 months? 18 months?
Re: (Score:2)
Re: (Score:3, Funny)
Re:Even better ... (Score:5, Informative)
Unfortunately there is no big "spike"; the average replacement rate just grows and grows with time.
Re: (Score:2)
Re:In other news... (Score:4, Informative)
http://www.usenix.org/events/fast07/tech/schroede
I believe it... (Score:3, Informative)
Re: (Score:2, Insightful)
Re: (Score:2, Insightful)
Re: (Score:2)
or BEFORE... (Score:2)
As Schwartz [sun.com] put it recently, there are two kinds of disk: Those that have failed, and those that are going to.
Before that (Score:2)
Now in some cases manufacturers with longer warranties are stating that they have more faith in their product, and certainly the sudden drop in warranty length (from 2-3 years down to one for many) indicates a lack of faith in their products.
Basically, a w
Fuzzy math (Score:2, Insightful)
0.88 * 15 = 4?
Re:Not So Fuzzy math (Score:4, Informative)
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.
This study is useless. (Score:3, Interesting)
This study is not news. All it says is that people *think* their hard drives fail more often than the mean time to failure.
Re: (Score:2)
Re:This study is useless. (Score:4, Interesting)
I dont really care to know exactly what is wrong with the drive. If i replace it, and the problem goes away, I would consdier that a bad drive. Even if you could still read and write to it. I just did one this morning that showed no symptoms other than windows taking what I considered a long time, to boot. All the user complained about was sluggish performance, and there were no errors or drive noises to speak of. Problem fixed, user happy, drive bad.
As I already posted, a good rule of thumb is 3 years from the date of manufacture, is when most drives go bad.
Re: (Score:2)
Interface matters why? (Score:3, Interesting)
Re: (Score:3, Insightful)
That statement is based on the long-held assumption that hard drive manufacturers put better materials and engineering into enterprise-targeted drives [Fibre] than they put into consumer-level drives [SATA].
Guess not...
Re: (Score:3, Informative)
Re: (Score:2)
FTA:
"the things that can go wrong with a drive are mechanical -- moving parts, motors, spindles, read-write heads," and these components are usually the same"
The only effect I can see it having would be if really shitty parts were used for one interface compared to the other.
Re: (Score:2)
Re:Interface matters why? (Score:5, Informative)
Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.
Re: (Score:2)
Because drive manufacturers claim [usenix.org] they use different hardware for the drive based on the interface. For example, a SCSI drive supposedly contains a disk designed for heavier use than an ATA drive, they aren't just the same disk with different interfaces.
Re: (Score:2)
I have thought the MTTF is bullshit for a while (Score:5, Interesting)
I don't consider myself a fluke because I know quite a few other people who have had similar problems. What's the deal?
Also, does anyone else find this quote interesting?:
"and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive."
It's a f$#*ing hard drive! Jesus H Tapdancing Christ how can they call that premature wear, do they calculate the MTTF by just letting the drive sit idle and never reading and writing to it? That actually wouldn't suprise me.
Re: (Score:2)
Re: (Score:2)
MTTF, no? MTBF would indicate a fixable system.
Yeah, but there has to be a plateau to the heat curve at some point. It's not as if the heat just keeps going up and up.. I would think that the constant on/off each day, causing expansion and contraction of the parts as they heat and cool, woul
Re: (Score:2)
Acronyms schmackronyms... anyway, I found at least one paper that I read in the past that states the 8 hours/day thing I was referring to: http://www.seagate.com/content/docs/pdf/whitepape r /D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf [seagate.com]
The 8 hours/day is referring to personal storage (as opposed to enterprise storage systems,) and this discussion is supposed to be about enterprise storage, so I'm off topic anyway. (BTW, the whitepaper I linked to does specify it as MTBF, for what it's worth)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Sure, they will have reached "thermal equilibrium" after a short period of time. See Figure 9 in this paper [seagate.com] " Reliability reduction with increased power on hours, ranging from a few hours per day to 24 x 7 operation " to see how I'm not sure that merely being hot is the problem.
Re: (Score:2)
The graph mostly seems to indicate that drives wear out when they are spinning. It's not all that far from a straight line (if you ignore the very low hours), which you would expect if wear was a significant component in the risk
Re: (Score:2)
Perhaps you misinterpreted the label on the y-axis of that figure. It is not in percent, it is a multiplier. So 0.5 means 50%.
Quoting the paper, emphasis mine:
Re: (Score:2)
Oops, indeed I did. It only scales my interpretation, rather than contradicting it though. It still indicates that wear is highly significant (which I expected, but previously erroneously asssumed was accounted for by MTBFs applying to power-on time rather than calendar time).
Re: (Score:2)
Re: (Score:2)
Point, counterpoint...
I've never had a single one of my own hard drives fail. Not a single one, ever. I've had a dozen or so that I can remember, from the 20MiB drive in my Amiga to the 250GiB that now hangs off my NSLU2. They are all either still functioning or became obsolete before failing. Many of them have been run 24/7 for significant chunks of their lives and I don't replace them unl
Re: (Score:3, Insightful)
Ram has no significant inductive load.
I am shocked! (Score:2, Insightful)
Off-Topic: SI Units (Score:5, Informative)
Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.
As such, there has been a decree [nist.gov] to give the powers of two their own SI prefix names. The following have been established:
These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.
Schwab
Having read the paper and seen the talk... (Score:2, Informative)
Corporations misrepresent products, news at 11:00! (Score:2)
Is there anyone out there that actually believed the published MTBF figures, even BEFORE these articles came out?
It's hard to take someone seriously when they claim that their drives have a 100+ year MTBF, especially since precious few are still functional after 1/10th of that much use. To make it better, many drives are NOT rated for continuous use, but only a certain number of hours per day. I didn't know that anyone EVER believed the MTBF B.S..
Re:Corporations misrepresent products, news at 11: (Score:2)
You're misinterpreting MTBF. A 100 year MTBF does not mean the drive will last 100 years, it means that 1/100 drives will fail each year. There will be another spec somewhere which specifies the design lifetime. For the Fujitsu MHT2060AT [fujitsu.com]drive which was in my laptop the MTBF is 300 000 hours, but the component life is a crappy 20 000
Check SMART Info (Score:4, Interesting)
To view the SMART info for drive
smartctl -a
To do a full disk read check (can take hours) do:
smartctl -t long
Sadly, I just found read errors on a 375-hour-old drive (manufacturer's software claimed that repair succeeded). Fortunately, they were on the Windows partition
Re: (Score:2)
The last survey that popped up here said that if SMART says your drive will fail, it probably will, but if SMART doesn't say it will fail, it doesn't mean much.
Suffice to say that you should never trust any piece of hardware that thinks it's SMARTer than you are.
Re: (Score:2)
Yes, that was the Google study [slashdot.org]. So, if SMART says there is a problem, you should pay attention to it. If SMART doesn't find a problem, that doesn't mean you are out of the woods.
Re: (Score:2)
hda has had 356 errors in its short life (I've had it about a year; 200Gb Seagate IDE)
hdc has had 4,560 errors its life (after nearly 3 years of service; 80Gb Maxtor IDE)
That does't sound good to me.
I got the Seagate because my previous drive had failed fsck a few times and had some dodgy-looking data on it.
These figures suggest about 1 error/day for the Seagate, and 4 errors/day for the Maxtor.
I don't li
Re: (Score:3, Informative)
The conclusions are roughly the following: a) if there are SMART errors, the disk will fail soon, b) if there are no SMART errors, the disk is still likely to fail. They saw no SMART errors on 36% of their failed d
RAID = Redundant Articles of Identical Discourse (Score:3, Informative)
Slashdot has a high rate of RAID, which is a bad thing. Which is a bad thing. It has been a whole 9 days. Slashdot needs a story moderation system so dupe articles can get modded out of existance. Ditto for slashdot editors who do the duping!
Can we get redundant posting on the story about google's paper [slashdot.org]?
Re:RAID = Redundant Articles of Identical Discours (Score:2)
Firehose (Score:2)
I both gave this story a thumbs down and dupe feedback, however, so many other people moderated the story up that it was at the highest (visible) ranking by the time it got posted. Apparently a bunch of people missed the
Unfortunately (Score:2)
Re: (Score:2)
Odd numbers for memory failure? (Score:2)
Re: (Score:3, Interesting)
Which is why I use Samsung (Score:2)
No way (Score:2, Funny)
Seagate (Score:4, Insightful)
Re: (Score:3, Funny)
Faster, cheaper, more reliable (Score:3, Informative)
I've noticed this personally. Now, anecdotal evidence doesn't count for a lot, and it may be a case that we are pushing our drives more. But back in the day of 40MB hard drives that cost a fortune, they used to last forever. The only drive I ever had fail on me in the old days were the Syquest removable HD cartridges, for obvious reasons. But even they didn't fail that often, considering the extra wear-and-tear of having a removable platter with separate heads in the drive.
But these days, with our high-capacity ATA drives, I see hard drives failing every month. Sure, the drives are cheap and huge, but they don't seem to make them like they used to. I guess it's just a consequence of pushing the storage and speed to such high levels, and cheap mass-production. Although the drives are cheap, if somebody doesn't back up their data, the costs are incalculable if the data is valuable.
A Story (Score:2)
Besides nostalgia, there wasn't a lot I could do with a giant, noisy 486 anymore, so I ended up just pulling the SC
Re: (Score:2)
Actually, one useful feature of Vista... (Score:5, Interesting)
...is that it detects SMART disk errors in normal use (i.e. you don't have to be watching the BIOS screens when your PC boots).
When I was trying the Vista RC, it told me that my drive was close to failing. I, of course, didn't believe it at first, but I ran the Seagate test floppy and it agreed. So I sent it back to Seagate for a free replacement.
About the only feature that impressed me in Vista, sadly. (And I'm not sure it should have impressed me, tbh. I'm assuming XP never did this as I've never seen/heard of such a feature.)
Re:Actually, one useful feature of Vista... (Score:4, Informative)
Ideal conditions vs. Real world (Score:3, Informative)
I have had two computers with power supply units that were "acting up." They ended up killing my hard drives on multiple occasions - Seagates, WD's, Maxtors, etc. It didn't matter what type of drive you put in these systems, the drive would die after anywhere from a week to two years. I later discovered that the power supplies were the problems, replaced them with brand new ones, and replaced the drives one last time. That was quite some time ago (years), and those drives, although small, still work, and have been transferred into newer computer systems since that time. The PSU was killing the drives; they weren't inherently bad or had a manufacturing defect. A friend of mine who lives in an apartment building constructed circa 1930 experienced similar problems with his drives. After just a few months, it seemed like his drives would spontaneously fail. When I tested his grounding plug, I found that it was carrying a voltage of about 30V (a hot ground - how wonderful). Since he moved out of that building and replaced his computer's PSU, no drive failures.
The same type of thing is true in automobile mileage testing. Car manufacturers must subject their cars to tests based on rules and procedures dictated by state and federal government agencies. These tests are almost never real world - driving on hilly terrain, through winds, with the headlights and window wipers on, plus the AC for defrost. They're based on a certain protocol developed in a laboratory to level the playing field and ensure that the ratings, for the most part, are similar. It simply means when you buy a new car, you can expect that under ideal conditions and at the beginning of the vehicle's life, it should BE ABLE to get the gas mileage listed on the window (based on an average sampling of the performance of many vehicles).
My point is that there really isn't a decent way to go about ensuring that an estimated statistic is valid for individual situations. By modifying the environmental conditions, the "rules of the game" change. A data-center with exceptional environmental control and voltage regulation systems, and top-quality server components (PSU's, voltage regulators, etc.) should expect to experience fewer drive failures per year than the drives found in an old chicken-shack data center set up in some hillbilly's back yard out in the middle of nowhere where quality is the last thing on the IT team's mind. It's impractical to expect that EVERY data center will be ideal - and since it's very very difficult to have better than the "ideal" testing conditions used in the MTTF tests - the real-life performance can only move towards more frequent and early failures. Using the car example above, since almost nobody is going to be using their vehicle in conditions BETTER than the ideal dictated by the protocols set forth by the government, and almost EVERYONE will be using their vehicles under worse conditions, the population average and median have nowhere to go but down. That doesn't mean the number is wrong, it just means that it's what the vehicle is capable of - but almost never demonstrates in terms of its performance - since ideal conditions in the real world are SO rare.
Re: (Score:2)
In fact, my TRS-80 is still functional too...the tape drive is a little wonky, but what are ya gonna do?
Re: (Score:2)
Re:Masters of estimates (Score:4, Insightful)
Well, the hard-drive makers are correct on the size thing - a Gigabyte is 1000 Megabytes, and the OS and software makers are wrong.
Yeah, they coined the term and have been using it for 40 years, but they're wrong.
Gigabytes are actually displayed as Gigabytes, or that the listing is changed to correctly display Gibibytes as the value? (or Kibibytes, Mebibytes, whatever)
Listen, just because someone comes up with a standard doesn't obligate everyone to use it, especially when they already have a perfectly workable system already. Claiming that NIST can impose an unwanted standard on the world is like saying that it isn't a word until the OED lists it.
Re: (Score:2)