Google Releases Paper on Disk Reliability 267
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
Conclusion (Score:4, Informative)
"In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.
One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.
Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
Similar paper (Score:4, Informative)
Re:Did they ever name the brands? (Score:3, Informative)
and in the meanwhile... (Score:4, Informative)
C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...
Re:Hmm (Score:4, Informative)
So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
Re:So (Score:2, Informative)
From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)
Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.
Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.
Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.
IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.
Re:So (Score:2, Informative)
The DeskStars were nicknamed DeathStars due to their high failure rate.
Maxtor has a terrible reputation in the channel.
Seagate has a fantastic reputation in the channel.
And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).
Re:OS X SMART tool? (Score:4, Informative)
Not exactly point & click but it'll do.
Re:I'm obviously behind the times, but... (Score:2, Informative)
Re:Proprietary reporting (Score:3, Informative)
Re:OS X SMART tool? (Score:3, Informative)
I had a disk reporting a SMART failure once. The result was that the disk was red in the list in Disk Utility, but there were no other warnings. So you might want to check Disk Utility once in a while.
You can get IDE/SATA drives FAILURE RATES Here (Score:5, Informative)
http://pro.sunrise.ru/articletext.asp?reg=30&id=2
http://pro.sunrise.ru/docs/30/image001.gif [sunrise.ru] - IDE/SATA (3.5" formfactor)
http://pro.sunrise.ru/docs/30/image002.gif [sunrise.ru] - HDD (2.5" notebook formfactor)
In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.
Toshiba is worst in 2.5", and Seagate is best.
The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).
Re:That would be corporate dynamite (Score:4, Informative)
On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.
If you do want to know more about which drives are best - check out storeagereview [storagereview.com] and enter details of your drives to their reliability database.
Actually this is a profoundly important conclusion (Score:2, Informative)
This is easily the most important thing a sysadmin needs to know about hard drives. Much as I love Spinrite, when drives start to fail they continue to fail.
This story reminds me of the run around I got from Dell [India] when my one-and-only-Dell I'm-not-stupid-enough-to buy-their-crap-again started to have seek errors [just-think-it.com].
Re:So SMART is specific, but not sensitive. (Score:2, Informative)
Re:So SMART is specific, but not sensitive. (Score:3, Informative)
One of TWO best papers at FAST (Score:3, Informative)
You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage [usenix.org], by Jim Cipar [umass.edu], Mark Corner [umass.edu], and Emery Berger [slashdot.org] (Dept. of Computer Science, University of Massachusetts Amherst [umass.edu]). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).
Enjoy!
--
Emery Berger
Dept. of Computer Science
University of Massachusetts Amherst
Re:So SMART is specific, but not sensitive. (Score:3, Informative)
Re:Temperatures (Score:3, Informative)
I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters is not affected.
This report reads like it was done by statisticians, not engineers. Handling of temperature, in particular, reveals this. As someone who has designed electronic circuits, been involved in reliability analysis, and repaired broken computers and other equipment at both the board level and chip level, I get the impression that the writers have not done any of those things.
Also, the conditions in google RAID arrays are likely very different than may be encountered in many other areas such as office and desktop PCs. In the raid arrays drives are not powered down daily and you also expect better cooling design.
The higher failure of lower average temperature drives is a definite eyebrow raiser. Not because it disproves the common wisdom (which still applies in the expected range) but because it is probably the clue that some important data was overlooked. If you actually extrapolate the right side of the graph, you see that failure does increase dramatically with temperature over the range of temperatures that would be experienced in normal cooling situations and particularly cooling failures.
Google has drives that are running at room temperature? This could point to some serious temperature fluctuation, measurement error, or to extremely aggressive cooling local cooling (chilled water or freon A/C) or a server room that is chilled like a walk in freezer. In which case, those drive failures are probably caused by moisture. At normal operating temperatures, a drive will drive off moisture. At the cooler temperatures, there may be condensation issues on the drive itself or on cooling components near the drive.
The reason that we don't see high temperature rate failures is that the sample of temperatures is abnormally low. The most common temperature related failures would be when you have a cooling failure or poor cooling. Good cooling does improve the lifetime of the drive. That does not mean, however, that cooling to extremes is a good idea. In a typical PC, the drive is going to run at somewhere around 40 degrees C. The drive on this computer, right now, which is mounted in a typical mid tower case in a slightly chilly room (it is winter here) that would be a lot more chilly without three computers heating it, is running at 39degrees. That temperature corresponds to the crest of the failure vs. temperature curve on googles graphs. What temperature do you think drive manufacturers would optimize their designs for? A typical commercial grade chip is rated 0 to 70 degrees C so the thresholds would be expected to be optimized for 35 degrees C. Drive manufacturers would expect the normal operating temperature to be around 40 degrees C. The paper says they use consumer grade drives. The datasheet for a WD 250GB hard drive [zdnet.com] says the minimum operating (ambient, not drive temperature) is 5 degrees C (41F) to 55 degrees C (131F). I noticed in doing a google search that some drives specified a minimum storage temperature of -13C.
Also, if the average temperature is low, that may be an indication that the drives in that particular population are drives that are spun down or even powered down much of the time, perhaps because the particular datasets they are serving are infrequently used or because they data is entirely cached in RAM.
Also, they talked about average temperature over the life