Google Releases Paper on Disk Reliability 267
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
Great (Score:5, Funny)
Re: (Score:3, Funny)
Translation: Run hot, have high failure rate. (Score:3, Insightful)
This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)
Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, afte
Google being stupid: 2 approximately equal #'s... (Score:3, Insightful)
Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.
When two numbers are close to equ
Re:Proprietary reporting (Score:5, Insightful)
You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."
What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?
You describe their conclusions as:
But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.
Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.
So SMART is specific, but not sensitive. (Score:4, Insightful)
Sensitivity/specificity always presents a balancing act of testing, and they are usually in a push/pull relationship. If you make a test too sensitive, then you get too many false positives, and wind up over treating something (i.e. the test says it might fail so you replace the drive even though it's not going to - a false alert)
If you make the test too specific, then usually you wind up decreasing it's sensitivity, or ability to detect something. Now you get false negatives, so when the test works, you can be sure that it's accurate, but it always doesn't detect the problem.
What you want to know is the Positive Predictive Value PPV, which is determnined by the formula PPV=TP/(TP+FP). TP= true positives, FP = false positives
Also useful is the Negative Predictive Value NPV, or this formula NPV=TN/(FN+TN) where TN = true negative, FN = false negative.
What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C0 then it accurately predicts that the drive is ok.
Re: (Score:3, Informative)
To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks. ...
What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C then it accurately predicts that the drive is ok.
But according to the paper none of the SMART parameters was very useful in this regard. Over 50% of drive failures were not predicted by SMART errors, so the "negative test" can't give much confidence that the drive is ok. Conversely while some types of SMART error (e.g. scan errors) indicated a much higher probabily of impending failure, they still weren't all that indicative. 70% of drives that reported a scan error were still functioning normally after 8 months. So the "positive test" isn't all that con
Re: (Score:3, Informative)
What he/she/it is looking for (Score:3, Interesting)
It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.
Re:Proprietary reporting (Score:4, Interesting)
The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.
Re: (Score:3, Informative)
Re: (Score:3, Interesting)
Hmm (Score:2, Interesting)
Re:Hmm (Score:5, Funny)
Didn't read the summary? (Check)
Congratulations, you're not officially a slashdot regular!
Re:Hmm (Score:4, Funny)
Congratulations, you're now officially a slashdot regular! - Pug
Re: (Score:3, Funny)
Didn't hit the 'Preview' button first? (Check)
Congratulations, you are too!
Re:Hmm (Score:4, Informative)
So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
Re: (Score:3, Interesting)
It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure
Re:Samsung! (Score:4, Insightful)
In summary: Your statistical analysis on a sample size of one showed a 100% failure rate, so Samsung are crap. You found some other people also had failed Samsung drives, so Samsung are crap.
Search the net and you will find people ranting about Seagate drives failures, Western Digital drive failures, IBM drive failures, Maxtor drives failures and failures of drives made by companies neither of us have even heard of. You won't find many, if any, reports of recent failures with 8" floppy drives though, so I suggest you use one of those. They must be more reliable, right?
Yes it does (Score:2)
Re:Hmm (Score:4, Interesting)
The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.
But, alas, I didn't see any breakdown for failure type....
Did they ever name the brands? (Score:4, Insightful)
That would be corporate dynamite (Score:5, Insightful)
Re:That would be corporate dynamite (Score:4, Interesting)
Re:That would be corporate dynamite (Score:5, Insightful)
Re: (Score:2)
ps... they cost about $300 then.
And you call yourself an antique:-)
They do say that "vintage" matters (Score:5, Interesting)
Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.
Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).
Re:That would be corporate dynamite (Score:4, Insightful)
Re: (Score:3, Insightful)
Re:That would be corporate dynamite (Score:5, Insightful)
Old Google Motto: Don't do anything evil.
New Google Motto: Don't get into trouble.
Re: (Score:3, Funny)
"Don't get caught doing anything evil."
You can get IDE/SATA drives FAILURE RATES Here (Score:5, Informative)
http://pro.sunrise.ru/articletext.asp?reg=30&id=2
http://pro.sunrise.ru/docs/30/image001.gif [sunrise.ru] - IDE/SATA (3.5" formfactor)
http://pro.sunrise.ru/docs/30/image002.gif [sunrise.ru] - HDD (2.5" notebook formfactor)
In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.
Toshiba is worst in 2.5", and Seagate is best.
The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).
Re:That would be corporate dynamite (Score:4, Informative)
On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.
If you do want to know more about which drives are best - check out storeagereview [storagereview.com] and enter details of your drives to their reliability database.
Re:Did they ever name the brands? (Score:5, Interesting)
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.
But, of course.
Thanks, missed that... (Score:2)
Re: (Score:3, Insightful)
The right to stay silent on something is just as important a freedom as the right to have your say.
Censorship has nothing whatsoever to do with it.
Re: (Score:3, Insightful)
Re:Did they ever name the brands? (Score:4, Insightful)
Re: (Score:2, Interesting)
Re:Did they ever name the brands? (Score:5, Funny)
Re: (Score:3, Insightful)
They should have done brand analysis (without naming the brand) and also rpm analysis.
From the article..
Re: (Score:3, Informative)
Re:Did they ever name the brands? (Score:4, Insightful)
Very true! (Score:2)
Re: (Score:2)
You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives,
Re: (Score:2, Interesting)
Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity [wikipedia.org], HD capacity seems to be increasing by roughly ten times every five years.
It's like the CD-R manufacturers stamping all the packaging with
Re: (Score:3, Insightful)
Re: (Score:3, Interesting)
What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.
Re: (Score:3, Insightful)
The pertinence of the SMART data (pretty much always pertinent) and how often it popped up (about half the time) before a failure was a very interesting bit of information.
A good drive is one that lasts a long time without developing too many bad blocks. A bad drive is one that
Re: (Score:3, Insightful)
Translation (Score:4, Funny)
Ideally, they would have formatted the text to spell out the names of the brands if you take the first letter of every Nth word, or some specific column of text. (Or maybe they have...)
Re:Translation (Score:5, Insightful)
Re:Translation (Score:5, Insightful)
Re:Translation (Score:5, Insightful)
We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.
We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.
We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.
But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.
We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.
And we hope you have a nice day.
Re: (Score:2)
It is clearly not proprietary to the drive manufacturers, because it came from Google's study. This means they regard it as proprietary to themselves.
How do you know that their competitors have done equally good studies? Given the large population (100,000) and the fact that people are surprised even by some of the published conclusions, it is very l
Re: (Score:3, Insightful)
Makes sense. Killing the weaker infants makes the adult population healthier.
Re: (Score:3, Interesting)
You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anvera
Re: (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
Sigh. That's the most misinformed post I've ever seen on Slashdot. Demand, by itself, says absolutely nothing about the price of something.
Re: (Score:2)
More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."
You tard. Demand and price in a free market are reversely proprotional. Go back to high school economics! Not only would that, but the great drive company mentioned would probably get more press and money leading to more R&D and even better drives.
I wish Google released the data they found because it would force the crappy drive companies to improve their products.
Re: (Score:2)
Demand and price in a free market are reversely proprotional.
One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.
Re:Did they ever name the brands? (Score:5, Funny)
Re: (Score:2)
I think that's understandable given the litigious nature of business today...
Makes it a little less useful from a practical standpoint though...
What do you want to bet (Score:2)
Re: (Score:2)
Re: (Score:2)
However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.
and then add to it with:
Interestingly, this does not change our conclusions. In
contrast to age-related results, we note that all results
shown in the rest of the paper are not affected signifi-
cantly by the population mix.
Proprietary? Wrong use of the word there. What they really mean is we do not want to make specific compa
Google had this paper ready a year ago (Score:3, Funny)
Conclusion (Score:4, Informative)
"In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.
One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.
Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
Similar paper (Score:4, Informative)
and in the meanwhile... (Score:4, Informative)
C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...
Re:and in the meanwhile... (Score:4, Insightful)
While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. But while Bianca Shroeder's report, for instance, uses statistics from various unnamed sources and for various unnamed uses, the Google report is interesting because we know exactly where it's coming from and what it's being used for.
Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that.
Temperature conclusion (Score:5, Interesting)
Re: (Score:3, Insightful)
Re:Temperature conclusion (Score:4, Interesting)
While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.
There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.
Re: (Score:3, Insightful)
This is why, if you want your engine to last, you should let your car warm up before driving it hard.
Lower temp == higher failure rates (Score:5, Interesting)
Re: (Score:2)
Re: (Score:2, Insightful)
Re: (Score:3, Insightful)
Proprietary makes sense here (Score:4, Insightful)
Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.
Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.
Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:
http://www.google.com/search?hl=en&safe=off&clien
This speaks volumes. (Score:5, Funny)
power supplies (Score:2)
Run smartd and look for scan errors (Score:2)
The GDRIVE (Score:2, Interesting)
I read the abstract and the conclusion (Score:2)
It's interesting, and I tend to trust their results, but these conclusions may not be relevant to single-drive situations. That is, if two customers purchase 1 drive each, and both drives are not defected, then this study doesn't explain why one drive would fail befo
How many drives really (Score:5, Insightful)
The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.
Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.
So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?
Re: (Score:3, Insightful)
Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine? Cross-reference all this data, alter their ranks, follow your interests, use those to make money and target you with ads?
There is a ton of money for this information, and with enough stored data and having the facility to mine it, filter it, and sort it to location level for vario
Bell Labs (Score:3, Insightful)
Temperatures (Score:3, Interesting)
I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.
This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.
Re: (Score:3, Informative)
I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters
One of TWO best papers at FAST (Score:3, Informative)
You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage [usenix.org], by Jim Cipar [umass.edu], Mark Corner [umass.edu], and Emery Berger [slashdot.org] (Dept. of Computer Science, University of Massachusetts Amherst [umass.edu]). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).
Enjoy!
--
Emery Berger
Dept. of Computer Science
University of Massachusetts Amherst
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Informative)
From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)
Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.
Some Samsungs are good, some are evil - the SP0411N was a partic
Re: (Score:2, Informative)
The DeskStars were nicknamed DeathStars due to their high failure rate.
Maxtor has a terrible reputation in the channel.
Seagate has a fantastic reputation in the channel.
And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).
Re: (Score:2)
Seagate also does NOT offer advance drive replacement in Canada, which means I'll never buy another of their products until this policy changes.
Had good luck with more recent Western Digital drives. Put 5 x 500GB in
Re:OS X SMART tool? (Score:4, Informative)
Not exactly point & click but it'll do.
Re: (Score:3, Informative)
I had a disk reporting a SMART failure once. The result was that the disk was red in the list in Disk Utility, but there were no other warnings. So you might want to check Disk Utility once in a while.
Re: (Score:2, Informative)