Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Data Storage Businesses Google The Internet Hardware

Google Releases Paper on Disk Reliability 267

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
This discussion has been archived. No new comments can be posted.

Google Releases Paper on Disk Reliability

Comments Filter:
  • Great (Score:5, Funny)

    by true_hacker ( 969330 ) on Sunday February 18, 2007 @12:26AM (#18057090)
    Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
    • Re: (Score:3, Funny)

      by Compholio ( 770966 )

      Excellent, i have been looking forward to thi *%)%*# DISK FAILURE
      That's what you get for logging into slashdot from Antarctica...
    • The research results are VERY poorly communicated, as research results often are.

      This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)

      Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, afte
      • Here's a quote from the Google paper: "Power-on hours -- Although we do not dispute that power-on hours might have an effect on drive lifetime, it happens that in our deployment the age of the drive is an excellent approximation for that parameter, given that our drives remain powered on for most of their life time." (Page 10, 4th paragraph)

        Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.

        When two numbers are close to equ
  • Hmm (Score:2, Interesting)

    by chanrobi ( 944359 )
    So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
    • Re:Hmm (Score:5, Funny)

      by Anonymous Coward on Sunday February 18, 2007 @12:35AM (#18057132)
      Didn't read the article? (Check)
      Didn't read the summary? (Check)

      Congratulations, you're not officially a slashdot regular!
    • Re:Hmm (Score:4, Informative)

      by Anonymous Coward on Sunday February 18, 2007 @01:03AM (#18057284)
      There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

      So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.
      • Re: (Score:3, Interesting)

        by norton_I ( 64015 )

        So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.

        It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure

    • There are apparently several SMART parameters that are correlated to eventual disk failure. If a disk starts throwing SMART errors in these categories then your best bet is to replace the disk ASAP. While it may be true that most disks fail without warning that doesn't mean it isn't a good idea to look for early warning signs of failure.
    • Re:Hmm (Score:4, Interesting)

      by jemenake ( 595948 ) on Sunday February 18, 2007 @10:44AM (#18059432)

      So if the article summary is correct does it even matter if the consumer desktop pc has SMART enabled or not?
      Well, I was a little disappointed by the article. They looked at a lot of different SMART categories and they looked at the different ages of the drives, but they didn't delve into the different types of failures. I get about 1 "I think my drive crashed and I was hoping you could recover it" call per month and I see a variety of failure types. Probably the most common ones I see now are ones where something has gone wrong with the control circuits/mechanism and not the media itself. For example, something can go wrong with the motor that spins the platters, or you can seize the bearings for the head traversal, etc. I've even seen some where a chip on the controller board literally popped when it got too hot. These aren't going to be detected by SMART... I don't know what would predict failures like that.

      The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.

      But, alas, I didn't see any breakdown for failure type....
  • by SuperKendall ( 25149 ) on Sunday February 18, 2007 @12:32AM (#18057114)
    They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?
    • by Traf-O-Data-Hater ( 858971 ) on Sunday February 18, 2007 @12:41AM (#18057158)
      I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!
      • by MrZaius ( 321037 ) on Sunday February 18, 2007 @12:44AM (#18057176) Homepage
        It's no wonder that Google sidestepped the issue, but, if you assume they purchase primarily from the manufacturers that are more reliable, perhaps those manufacturers will begin to gloat and publish numbers about their Google contracts, if this study gains traction.
        • by Antique Geekmeister ( 740220 ) on Sunday February 18, 2007 @01:26AM (#18057414)
          I'm confident that Google is fairly drive agnostic: you just can't run distributed networks that large and stay locked into a single vendor. And given that even reliable vendors have disasters like the IBM Deskstar drives some years ago, and given the remarkable growth of drive sizes over time, there's just not much point for them in buying the extremely stable but vastly more expensive hardware. They've foubtless learned that hardware flexibility provides valuable software flexibility.
          • by fred911 ( 83970 )
            Didn't Seagate have a disaster with stiction on their RLL drives? ... Seems I remember taking apart some 10 mb RLL drives and cleaning them with windex. Worked every time.

            ps... they cost about $300 then.

              And you call yourself an antique:-)
          • by Joce640k ( 829181 ) on Sunday February 18, 2007 @06:04AM (#18058594) Homepage
            The report does say that "vintage" matters, ie. that "Past performance is not a reliable indicator of future development".

            Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.

            Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).

      • by EonBlueTooL ( 974478 ) on Sunday February 18, 2007 @01:04AM (#18057292)
        Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)
      • by Augur ( 62912 ) on Sunday February 18, 2007 @09:00AM (#18059050) Homepage
        One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

        http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 [sunrise.ru] - the article (in russian, but diagrams are self-explanatory).

        http://pro.sunrise.ru/docs/30/image001.gif [sunrise.ru] - IDE/SATA (3.5" formfactor)

        http://pro.sunrise.ru/docs/30/image002.gif [sunrise.ru] - HDD (2.5" notebook formfactor)

        In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

        Toshiba is worst in 2.5", and Seagate is best.

        The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

      • by gbjbaanb ( 229885 ) on Sunday February 18, 2007 @01:10PM (#18060402)
        When a friend broke down, she asked the breakdown man who came what were the most reliable cars. He said he wasn't allowed to comment but that "he carried no honda parts". I guess the same thing applies here - Google won't say, they'd get sued.

        On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.

        If you do want to know more about which drives are best - check out storeagereview [storagereview.com] and enter details of your drives to their reliability database.
    • by iminplaya ( 723125 ) on Sunday February 18, 2007 @12:45AM (#18057178) Journal
      FTA:However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.


      But, of course.
      • It appears that sentence was right after the part I read about how some makers had better results than others. So of course I scan the whole document looking for said data immediately after reading the first part, but did not return to that exact point thinking I had read it already...
    • Re: (Score:3, Insightful)

      by Xross_Ied ( 224893 )
      They didn't include any data at all about brands.

      They should have done brand analysis (without naming the brand) and also rpm analysis.

      From the article..

      3.2 Manufacturers, Models, and Vintages
      Failure rates are known to be highly correlated with drive
      models, manufacturers and vintages [18]. Our results do
      not contradict this fact. For example, Figure 2 changes
      significantly when we normalize failure rates per each
      drive model. Most age-related results are impacted by
      drive vintages. However, in this paper, we do

    • Re: (Score:3, Informative)

      by drmerope ( 771119 )
      No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information. The question that really needs to be studied is what distinguishes good drives from bad. This would probably involve disassembling drives of various 'vintages, models, manufacturers' and trying to pin down the relevant details. That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, hea
      • by Prof.Phreak ( 584152 ) on Sunday February 18, 2007 @12:53AM (#18057230) Homepage
        At the very least, they could've named brands X, Y, Z, etc., and provided the numbers for those. Would be interesting if the differences are more than marginal.
        • That would have been the perfect way to divulge this data without causing direct harm to any maker - I would really have liked to see if there was a large variance between brands, which might even lead me to purchase brand Y more, even if it's not at the top of the reliability chart - just so long as it was cheaper.
      • That way when new hard-drives get released, reviewers can pull them apart and judge them on something other than read/write performance, heat, and acoustics...

        You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives,

        • Re: (Score:2, Interesting)

          by LunarCrisis ( 966179 )

          If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product.

          Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity [wikipedia.org], HD capacity seems to be increasing by roughly ten times every five years.

          It's like the CD-R manufacturers stamping all the packaging with

        • Re: (Score:3, Insightful)

          by Simon Brooke ( 45012 ) *

          You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives, I actively seek out drives with at least a 3 (preferably 5) year warranty (some Hitachis and Seagates IIRC) and explicitly avoid those with only a 1 year warrant

      • Re: (Score:3, Interesting)

        They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.

        What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.

      • Re: (Score:3, Insightful)

        by Fred_A ( 10934 )

        No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.

        The pertinence of the SMART data (pretty much always pertinent) and how often it popped up (about half the time) before a failure was a very interesting bit of information.

        The question that really needs to be studied is what distinguishes good drives from bad.

        A good drive is one that lasts a long time without developing too many bad blocks. A bad drive is one that

    • Re: (Score:3, Insightful)

      by repvik ( 96666 )
      "However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data." (From TFA)
      • Translation (Score:4, Funny)

        by jd ( 1658 ) <imipak@@@yahoo...com> on Sunday February 18, 2007 @01:05AM (#18057304) Homepage Journal
        "We don't want to be sued to within an inch of our lives by certain very wealthy brands, due to US law allowing manufacturers to prohibit unfavourable reviews."

        Ideally, they would have formatted the text to spell out the names of the brands if you take the first letter of every Nth word, or some specific column of text. (Or maybe they have...)

        • Re:Translation (Score:5, Insightful)

          by David Price ( 1200 ) * on Sunday February 18, 2007 @01:13AM (#18057330)
          More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."
          • Re:Translation (Score:5, Insightful)

            by the_womble ( 580291 ) on Sunday February 18, 2007 @01:21AM (#18057378) Homepage Journal
            Another translation: Our competitors buy millions of dollars worth of drives as well. We are not going to help them avoid the duff ones.
            • Re:Translation (Score:5, Insightful)

              by spisska ( 796395 ) on Sunday February 18, 2007 @03:31AM (#18058052)
              Another translation:

              We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.

              We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.

              We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.

              But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

              We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.

              And we hope you have a nice day.
              • RTFA. It says: However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.

                It is clearly not proprietary to the drive manufacturers, because it came from Google's study. This means they regard it as proprietary to themselves.

                How do you know that their competitors have done equally good studies? Given the large population (100,000) and the fact that people are surprised even by some of the published conclusions, it is very l

                • Re: (Score:3, Insightful)

                  by Eivind ( 15695 )
                  It's not that surprising. The only mildly interesting thing I see is that high load seems to *not* increase failure-rates much, other than the first few month. They hypothesize that this may be because some drives don't handle high load -- and die early -- however those drives that survive the first ~6 months with high load are the more robust ones, and those hold up well.

                  Makes sense. Killing the weaker infants makes the adult population healthier.

          • Re: (Score:3, Funny)

            by bendodge ( 998616 )
            How did that get modded insightful? When there is more demand the price goes down, not up!
            • How did that get modded informative? That's not informative. This [wikipedia.org] is informative.
            • How did that get modded insightful? When there is more demand the price goes down, not up!

              Sigh. That's the most misinformed post I've ever seen on Slashdot. Demand, by itself, says absolutely nothing about the price of something.

          • by Jahz ( 831343 )

            More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."

            You tard. Demand and price in a free market are reversely proprotional. Go back to high school economics! Not only would that, but the great drive company mentioned would probably get more press and money leading to more R&D and even better drives.

            I wish Google released the data they found because it would force the crappy drive companies to improve their products.

            • by osu-neko ( 2604 )

              Demand and price in a free market are reversely proprotional.

              One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.

    • by Anonymous Coward on Sunday February 18, 2007 @01:32AM (#18057458)
      They would have released that data, but it was saved on a Maxtor.
    • by MadMorf ( 118601 )
      They specifically stated they would not be revealing the brands or models.

      I think that's understandable given the litigious nature of business today...

      Makes it a little less useful from a practical standpoint though...
    • that it changes more from year to year and model to model than from one manufacturer to another?
    • I was disappointed that they didn't offer this information in the report - but not really surprised.
    • by nolife ( 233813 )
      You did not miss anything. The report states:

      However, in this paper, we do not show a
      breakdown of drives per manufacturer, model, or vintage
      due to the proprietary nature of these data.


      and then add to it with:

      Interestingly, this does not change our conclusions. In
      contrast to age-related results, we note that all results
      shown in the rest of the paper are not affected signifi-
      cantly by the population mix.


      Proprietary? Wrong use of the word there. What they really mean is we do not want to make specific compa
  • by Anonymous Coward on Sunday February 18, 2007 @12:36AM (#18057138)
    But the disk it was on failed.
  • Conclusion (Score:4, Informative)

    by llZENll ( 545605 ) on Sunday February 18, 2007 @12:39AM (#18057152)
    This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

    "In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

    One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

    Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."
  • Similar paper (Score:4, Informative)

    by reset_button ( 903303 ) on Sunday February 18, 2007 @12:42AM (#18057164)
    I was at the talk, and it was very interesting. CMU also had a paper (PDF) [cmu.edu] about disk failures in the same conference (in fact, they presented one after the other).
  • by pedantic bore ( 740196 ) on Sunday February 18, 2007 @12:50AM (#18057220)
    ... at the same conference, Bianca Schroeder presented a paper [cmu.edu] disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin [ucsc.edu] and dozen papers by John Elerath... [google.com]

    C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

    • by oGMo ( 379 ) on Sunday February 18, 2007 @01:26AM (#18057410)

      While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. But while Bianca Shroeder's report, for instance, uses statistics from various unnamed sources and for various unnamed uses, the Google report is interesting because we know exactly where it's coming from and what it's being used for.

      Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that.

  • by phasm42 ( 588479 ) on Sunday February 18, 2007 @01:08AM (#18057310)
    Their statistics on temperature seem very unusual. I'm surprised they didn't explore this more. For example, is the high failure rate associated with low temperatures because the drives were more likely to be inactive due to failure?
    • Re: (Score:3, Insightful)

      by Chalex ( 71702 )
      The chart implies that the "optimal" operating drive temperature is 35-45 Celsius. Drive temperatures below room temperature (below 22 Celsius) is probably not a scenario that drive manufacturers optimise for.
    • by gnu-sucks ( 561404 ) on Sunday February 18, 2007 @01:54AM (#18057568) Journal
      My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

      While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

      There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.
    • Re: (Score:3, Insightful)

      by bouis ( 198138 )
      If hard drives are anything like car engines [especially those made with iron and aluminum], the designers have taken the standard operating temperature into account in the design. The parts of varying composition fit together best at the right temperature, and temperatures higher or lower result in damage or accelerated wear.

      This is why, if you want your engine to last, you should let your car warm up before driving it hard.
  • by flyingfsck ( 986395 ) on Sunday February 18, 2007 @01:15AM (#18057350)
    To my mind the most significant piece of info: "The gure shows that fail- ures do not increase when the average temperature in- creases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."
    • by beavis88 ( 25983 )
      But did the lower temperature actually cause the failures? Such a counterintuitive conclusion seems like it'd be worth some further examination...I can turn off some fans in my cases and get the drives back up into the 40-45C range pretty quickly if need be!
      • Re: (Score:2, Insightful)

        by Anonymous Coward
        perhaps there is some correlation between lower temperature and higher forces, ie. a drive that starts and stops frequently may have a lower temperature, but would undergo more acceleration and stress
    • Re: (Score:3, Insightful)

      Yes, the low temperature finding is most interesting. I have an hypothesis as to what might be going on. I suspect that absolute temperatures, within certain limits, are not important to drive reliability, but that temperature variation is. Drives that, because of their location and pattern of use, tend to fluctuate in temperature between, say, 20 and 35 degrees centigrade are being stressed more than those an a steady 40 degrees.
  • by Mammothrept ( 588717 ) on Sunday February 18, 2007 @01:25AM (#18057396) Journal
    "...we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."

    Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.

    Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.

    Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:

    http://www.google.com/search?hl=en&safe=off&client =firefox-a&rls=com.ubuntu%3Aen-US%3Aofficial&hs=tq y&q=hard+drive+reliability+research+brands++manufa cturers+models&btnG=Search [google.com]
  • by greenguy ( 162630 ) <estebandido AT gmail DOT com> on Sunday February 18, 2007 @01:25AM (#18057398) Homepage Journal
    Google releases a paper on disk reliability.
  • This is completely anecdotal, unscientific... Since building out two servers a couple years ago, each with approximately 800G of drive space, I've had to replace drives on average of one every 8 weeks. In my lab there are about twenty drives across 8 machines, so that number is not too bad. Or so I thought. After replacing all my power supplies my drive failures have gone way down. The only drive I've lost recently is one in an older machine with an ancient 300W power supply.
  • Well, the article's conclusion looks pretty clear to me. Watch for scan errors in smartd reports. When they start happening, migrate your data off that disk and replace it.
  • The GDRIVE (Score:2, Interesting)

    by Shohat ( 959481 )
    About a year and a half ago, a presentation by Google concerning a massive online storage service called GDrive , was leaked . It was pretty much confirmed that it is on some level operational . The study might have something to do with it , maybe even so kind of clever PR . Just my 2c.
  • Their conclusion (and a glance at their results) indicates that drives fail because of product defects. However, home-use parameters such as brown power (low voltage on the line) are probably not taken into account in their server environment.

    It's interesting, and I tend to trust their results, but these conclusions may not be relevant to single-drive situations. That is, if two customers purchase 1 drive each, and both drives are not defected, then this study doesn't explain why one drive would fail befo
  • by hankwang ( 413283 ) * on Sunday February 18, 2007 @05:36AM (#18058500) Homepage

    The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.

    Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.

    So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

    • Re: (Score:3, Insightful)

      by CRC'99 ( 96526 )

      So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

      Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine? Cross-reference all this data, alter their ranks, follow your interests, use those to make money and target you with ads?

      There is a ton of money for this information, and with enough stored data and having the facility to mine it, filter it, and sort it to location level for vario

  • Bell Labs (Score:3, Insightful)

    by gustgr ( 695173 ) <gustgr.gmail@com> on Sunday February 18, 2007 @05:46AM (#18058536)
    Google Labs, yet in its youth, certainly resembles me of the golden yers of the Bell Labs.
  • Temperatures (Score:3, Interesting)

    by Trogre ( 513942 ) on Sunday February 18, 2007 @05:40PM (#18062174) Homepage
    An interesting document, and I found the data on temperatures particularly interesting.

    I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.

    This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.

    • Re: (Score:3, Informative)

      by whitis ( 310873 )

      I think you are partly right in this assumption, but for the wrong reasons. Some failure modes are a function of temperature and other failure modes are a function of temperature variation. A long time ago platter expansion and contraction was a major cause of problems when drives used stepper motor positioning; since they switched to servo positioning, the drive automatically tracks the expansion and contraction of the platters and that is pretty much a non-issue as long as the coating on the platters

  • by Ristretto ( 79399 ) <emery@@@cs...umass...edu> on Sunday February 18, 2007 @07:07PM (#18062682) Homepage
    This Google paper [usenix.org] just appeared at the 5th USENIX Conference on File and Storage Technologies [usenix.org] (a.k.a. FAST), the premier conference on file systems and storage. It won one of the best paper awards.

    You might be interested in the other best paper award winner (in the shameless self-promotion department): TFS: A Transparent File System for Contributory Storage [usenix.org], by Jim Cipar [umass.edu], Mark Corner [umass.edu], and Emery Berger [slashdot.org] (Dept. of Computer Science, University of Massachusetts Amherst [umass.edu]). Briefly, it describes how you can make all the empty space on your disk available for others to use, without affecting your own use of the disk (no performance impact, and you can still use the space if you need it).

    Enjoy!

    --
    Emery Berger
    Dept. of Computer Science
    University of Massachusetts Amherst

The shortest distance between two points is under construction. -- Noelie Alito

Working...