Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Data Storage Businesses Google The Internet Hardware

Google Releases Paper on Disk Reliability 267

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
This discussion has been archived. No new comments can be posted.

Google Releases Paper on Disk Reliability

Comments Filter:
  • Conclusion (Score:4, Informative)

    by llZENll ( 545605 ) on Sunday February 18, 2007 @12:39AM (#18057152)
    This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

    "In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

    One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

    Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
  • Similar paper (Score:4, Informative)

    by reset_button ( 903303 ) on Sunday February 18, 2007 @12:42AM (#18057164)
    I was at the talk, and it was very interesting. CMU also had a paper (PDF) [cmu.edu] about disk failures in the same conference (in fact, they presented one after the other).
  • by drmerope ( 771119 ) on Sunday February 18, 2007 @12:47AM (#18057192)
    No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information. The question that really needs to be studied is what distinguishes good drives from bad. This would probably involve disassembling drives of various 'vintages, models, manufacturers' and trying to pin down the relevant details. That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, heat, and acoustics...
  • by pedantic bore ( 740196 ) on Sunday February 18, 2007 @12:50AM (#18057220)
    ... at the same conference, Bianca Schroeder presented a paper [cmu.edu] disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin [ucsc.edu] and dozen papers by John Elerath... [google.com]

    C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

  • Re:Hmm (Score:4, Informative)

    by Anonymous Coward on Sunday February 18, 2007 @01:03AM (#18057284)
    There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

    So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
  • Re:So (Score:2, Informative)

    by mightyQuin ( 1021045 ) on Sunday February 18, 2007 @01:37AM (#18057486)

    From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)

    Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.

    Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.

    Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.

    IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.

  • Re:So (Score:2, Informative)

    by nevesis ( 970522 ) on Sunday February 18, 2007 @02:31AM (#18057738)
    Interesting.. but I disagree with your analysis.

    The DeskStars were nicknamed DeathStars due to their high failure rate.

    Maxtor has a terrible reputation in the channel.

    Seagate has a fantastic reputation in the channel.

    And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).
  • Re:OS X SMART tool? (Score:4, Informative)

    by kimvette ( 919543 ) on Sunday February 18, 2007 @02:48AM (#18057834) Homepage Journal
    http://sourceforge.net/projects/smartmontools [sourceforge.net]

    Not exactly point & click but it'll do.
  • that sounds like a great idea, however, flash memory has a habit of failing with no warning whatsoever as well.
  • by Toba82 ( 871257 ) on Sunday February 18, 2007 @03:03AM (#18057916) Homepage
    It is well known that google uses commodity hardware. SCSI is not commodity, although I'm sure at least some of their servers are high end.
  • Re:OS X SMART tool? (Score:3, Informative)

    by am 2k ( 217885 ) on Sunday February 18, 2007 @06:07AM (#18058608) Homepage

    So what tool on Mac OS X will provide all the SMART data?

    I had a disk reporting a SMART failure once. The result was that the disk was red in the list in Disk Utility, but there were no other warnings. So you might want to check Disk Utility once in a while.

  • by Augur ( 62912 ) on Sunday February 18, 2007 @09:00AM (#18059050) Homepage
    One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

    http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 [sunrise.ru] - the article (in russian, but diagrams are self-explanatory).

    http://pro.sunrise.ru/docs/30/image001.gif [sunrise.ru] - IDE/SATA (3.5" formfactor)

    http://pro.sunrise.ru/docs/30/image002.gif [sunrise.ru] - HDD (2.5" notebook formfactor)

    In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

    Toshiba is worst in 2.5", and Seagate is best.

    The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

  • by gbjbaanb ( 229885 ) on Sunday February 18, 2007 @01:10PM (#18060402)
    When a friend broke down, she asked the breakdown man who came what were the most reliable cars. He said he wasn't allowed to comment but that "he carried no honda parts". I guess the same thing applies here - Google won't say, they'd get sued.

    On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.

    If you do want to know more about which drives are best - check out storeagereview [storagereview.com] and enter details of your drives to their reliability database.
  • after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors

    This is easily the most important thing a sysadmin needs to know about hard drives. Much as I love Spinrite, when drives start to fail they continue to fail.

    This story reminds me of the run around I got from Dell [India] when my one-and-only-Dell I'm-not-stupid-enough-to buy-their-crap-again started to have seek errors [just-think-it.com].
  • by vakuona ( 788200 ) on Sunday February 18, 2007 @03:57PM (#18061490)
    What Gogole is saying is that you cannot rely on SMART to warn you of all or even most hard drive failures. So whilst you do reduce the possibility to lose data, they are saying you are still very likely to lose data anyway.
  • by RedWizzard ( 192002 ) on Sunday February 18, 2007 @04:57PM (#18061906)

    To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks. ... What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C then it accurately predicts that the drive is ok.
    But according to the paper none of the SMART parameters was very useful in this regard. Over 50% of drive failures were not predicted by SMART errors, so the "negative test" can't give much confidence that the drive is ok. Conversely while some types of SMART error (e.g. scan errors) indicated a much higher probabily of impending failure, they still weren't all that indicative. 70% of drives that reported a scan error were still functioning normally after 8 months. So the "positive test" isn't all that convincing either. This is why the paper came to the conclusion that SMART was not useful in building a predictive model for drive failure.
  • by Ristretto ( 79399 ) <emery@@@cs...umass...edu> on Sunday February 18, 2007 @07:07PM (#18062682) Homepage
    This Google paper [usenix.org] just appeared at the 5th USENIX Conference on File and Storage Technologies [usenix.org] (a.k.a. FAST), the premier conference on file systems and storage. It won one of the best paper awards.

    You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage [usenix.org], by Jim Cipar [umass.edu], Mark Corner [umass.edu], and Emery Berger [slashdot.org] (Dept. of Computer Science, University of Massachusetts Amherst [umass.edu]). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).

    Enjoy!

    --
    Emery Berger
    Dept. of Computer Science
    University of Massachusetts Amherst

  • by chriso11 ( 254041 ) on Sunday February 18, 2007 @07:35PM (#18062822) Journal
    No, actually it was around 36% of drive failures did not have an SMART indications. Around 49% were predicted based on 4 or so of the key parameters.
  • Re:Temperatures (Score:3, Informative)

    by whitis ( 310873 ) on Sunday February 18, 2007 @10:11PM (#18063644) Homepage

    I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters is not affected.

    This report reads like it was done by statisticians, not engineers. Handling of temperature, in particular, reveals this. As someone who has designed electronic circuits, been involved in reliability analysis, and repaired broken computers and other equipment at both the board level and chip level, I get the impression that the writers have not done any of those things.

    Also, the conditions in google RAID arrays are likely very different than may be encountered in many other areas such as office and desktop PCs. In the raid arrays drives are not powered down daily and you also expect better cooling design.

    The higher failure of lower average temperature drives is a definite eyebrow raiser. Not because it disproves the common wisdom (which still applies in the expected range) but because it is probably the clue that some important data was overlooked. If you actually extrapolate the right side of the graph, you see that failure does increase dramatically with temperature over the range of temperatures that would be experienced in normal cooling situations and particularly cooling failures.

    Google has drives that are running at room temperature? This could point to some serious temperature fluctuation, measurement error, or to extremely aggressive cooling local cooling (chilled water or freon A/C) or a server room that is chilled like a walk in freezer. In which case, those drive failures are probably caused by moisture. At normal operating temperatures, a drive will drive off moisture. At the cooler temperatures, there may be condensation issues on the drive itself or on cooling components near the drive.

    The reason that we don't see high temperature rate failures is that the sample of temperatures is abnormally low. The most common temperature related failures would be when you have a cooling failure or poor cooling. Good cooling does improve the lifetime of the drive. That does not mean, however, that cooling to extremes is a good idea. In a typical PC, the drive is going to run at somewhere around 40 degrees C. The drive on this computer, right now, which is mounted in a typical mid tower case in a slightly chilly room (it is winter here) that would be a lot more chilly without three computers heating it, is running at 39degrees. That temperature corresponds to the crest of the failure vs. temperature curve on googles graphs. What temperature do you think drive manufacturers would optimize their designs for? A typical commercial grade chip is rated 0 to 70 degrees C so the thresholds would be expected to be optimized for 35 degrees C. Drive manufacturers would expect the normal operating temperature to be around 40 degrees C. The paper says they use consumer grade drives. The datasheet for a WD 250GB hard drive [zdnet.com] says the minimum operating (ambient, not drive temperature) is 5 degrees C (41F) to 55 degrees C (131F). I noticed in doing a google search that some drives specified a minimum storage temperature of -13C.

    Also, if the average temperature is low, that may be an indication that the drives in that particular population are drives that are spun down or even powered down much of the time, perhaps because the particular datasets they are serving are infrequently used or because they data is entirely cached in RAM.

    Also, they talked about average temperature over the life

"And remember: Evil will always prevail, because Good is dumb." -- Spaceballs

Working...