Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Google Releases Paper on Disk Reliability

Posted by Zonk on Sun Feb 18, 2007 12:18 AM
from the fun-saturday-night-reading dept.
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Google Releases Paper on Disk Reliability 25 Comments More | Login /

 Full
 Abbreviated
 Hidden
More | Login
Keybindings Beta
Q W E
A S D
Loading ... Please wait.
  • Great (Score:5, Funny)

    by true_hacker (969330) on Sunday February 18 2007, @12:26AM (#18057090)
    Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
      • Re:Proprietary reporting (Score:5, Insightful)

        by spisska (796395) on Sunday February 18 2007, @03:02AM (#18057912)

        ps.. all their farm is ata/ide?

        You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."

        What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?

        You describe their conclusions as:

        Uselsess

        But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.

        Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.

        [ Parent ]
  • Did they ever name the brands? (Score:4, Insightful)

    by SuperKendall (25149) on Sunday February 18 2007, @12:32AM (#18057114)
    They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?
    • That would be corporate dynamite (Score:5, Insightful)

      by Traf-O-Data-Hater (858971) on Sunday February 18 2007, @12:41AM (#18057158)
      I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!
      [ Parent ]
      • Re:That would be corporate dynamite (Score:4, Interesting)

        by MrZaius (321037) on Sunday February 18 2007, @12:44AM (#18057176) Homepage
        It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.
        [ Parent ]
        • Re:That would be corporate dynamite (Score:5, Insightful)

          by Antique Geekmeister (740220) on Sunday February 18 2007, @01:26AM (#18057414)
          I'm confident that Google is fairly drive agnostic: you just can't run distributed networks that large and stay locked into a single vendor. And given that even reliable vendors have disasters like the IBM Deskstar drives some years ago, and given the remarkable growth of drive sizes over time, there's just not much point for them in buying the extremely stable but vastly more expensive hardware. They've foubtless learned that hardware flexibility provides valuable software flexibility.
          [ Parent ]
          • They do say that "vintage" matters (Score:5, Interesting)

            by Joce640k (829181) on Sunday February 18 2007, @06:04AM (#18058594)
            The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

            Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

            Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

            [ Parent ]
      • by Augur (62912) on Sunday February 18 2007, @09:00AM (#18059050) Homepage
        One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

        http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 [sunrise.ru] - the article (in russian, but diagrams are self-explanatory).

        http://pro.sunrise.ru/docs/30/image001.gif [sunrise.ru] - IDE/SATA (3.5" formfactor)

        http://pro.sunrise.ru/docs/30/image002.gif [sunrise.ru] - HDD (2.5" notebook formfactor)

        In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

        Toshiba is worst in 2.5", and Seagate is best.

        The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

        [ Parent ]
    • Re:Did they ever name the brands? (Score:5, Interesting)

      by iminplaya (723125) on Sunday February 18 2007, @12:45AM (#18057178) Journal
      FTA:However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.


      But, of course.
      [ Parent ]
    • by Anonymous Coward on Sunday February 18 2007, @01:32AM (#18057458)
      They would have released that data, but it was saved on a Maxtor.
      [ Parent ]
        • Re:Translation (Score:5, Insightful)

          by David Price (1200) * on Sunday February 18 2007, @01:13AM (#18057330)
          More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."
          [ Parent ]
          • Re:Translation (Score:5, Insightful)

            by the_womble (580291) on Sunday February 18 2007, @01:21AM (#18057378) Homepage Journal
            Another translation: Our competitors buy millions of dollars worth of drives as well. We are not going to help them avoid the duff ones.
            [ Parent ]
            • Re:Translation (Score:5, Insightful)

              by spisska (796395) on Sunday February 18 2007, @03:31AM (#18058052)
              Another translation:

              We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.

              We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.

              We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.

              But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

              We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.

              And we hope you have a nice day.
              [ Parent ]
  • Conclusion (Score:4, Informative)

    by llZENll (545605) on Sunday February 18 2007, @12:39AM (#18057152)
    This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

    "In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

    One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

    Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
  • Similar paper (Score:4, Informative)

    by reset_button (903303) on Sunday February 18 2007, @12:42AM (#18057164)
    I was at the talk, and it was very interesting. CMU also had a paper (PDF) [cmu.edu] about disk failures in the same conference (in fact, they presented one after the other).
  • and in the meanwhile... (Score:4, Informative)

    by pedantic bore (740196) on Sunday February 18 2007, @12:50AM (#18057220)
    ... at the same conference, Bianca Schroeder presented a paper [cmu.edu] disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin [ucsc.edu] and dozen papers by John Elerath... [google.com]

    C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

  • Temperature conclusion (Score:5, Interesting)

    by phasm42 (588479) on Sunday February 18 2007, @01:08AM (#18057310)
    Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?
  • Lower temp == higher failure rates (Score:5, Interesting)

    by flyingfsck (986395) on Sunday February 18 2007, @01:15AM (#18057350)
    To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."
  • This speaks volumes. (Score:5, Funny)

    by greenguy (162630) <steveh&greens,org> on Sunday February 18 2007, @01:25AM (#18057398) Homepage Journal
    Google releases a paper on disk reliability.
  • How many drives really (Score:5, Insightful)

    by hankwang (413283) * on Sunday February 18 2007, @05:36AM (#18058500) Homepage

    The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.

    Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.

    So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

    • Re:Hmm (Score:5, Funny)

      by Anonymous Coward on Sunday February 18 2007, @12:35AM (#18057132)
      Didn't read the article? (Check)
      Didn't read the summary? (Check)

      Congratulations, you're not officially a slashdot regular!
      [ Parent ]
    • Re:Hmm (Score:4, Informative)

      by Anonymous Coward on Sunday February 18 2007, @01:03AM (#18057284)
      There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

      So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
      [ Parent ]