Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Data Storage Businesses Google The Internet Hardware

Google Releases Paper on Disk Reliability 267

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
This discussion has been archived. No new comments can be posted.

Google Releases Paper on Disk Reliability

Comments Filter:
  • by SuperKendall ( 25149 ) on Sunday February 18, 2007 @12:32AM (#18057114)
    They stated at one point in the document that some brands did have higher failure rates than others - yet I somehow missed any mention or ranking of brands. Did anyone else find that data?
  • by Traf-O-Data-Hater ( 858971 ) on Sunday February 18, 2007 @12:41AM (#18057158)
    I noticed this too. If a Google-sanctioned report had charts of which brands were more reliable, this would do serious damage to the brands that didn't perform so well. No wonder they sidestepped the whole issue!
  • by Anonymous Coward on Sunday February 18, 2007 @12:46AM (#18057186)
    Google's studies are like their searchengine: you get a bunch of results, but you have to sift through them yourself to get anything specific, and you'll probably end up reading the section closest related to boobies.
  • by Xross_Ied ( 224893 ) on Sunday February 18, 2007 @12:47AM (#18057188) Homepage
    They didn't include any data at all about brands.

    They should have done brand analysis (without naming the brand) and also rpm analysis.

    From the article..

    3.2 Manufacturers, Models, and Vintages
    Failure rates are known to be highly correlated with drive
    models, manufacturers and vintages [18]. Our results do
    not contradict this fact. For example, Figure 2 changes
    significantly when we normalize failure rates per each
    drive model. Most age-related results are impacted by
    drive vintages. However, in this paper, we do not show a
    breakdown of drives per manufacturer, model, or vintage
    due to the proprietary nature of these data.

  • by repvik ( 96666 ) on Sunday February 18, 2007 @12:51AM (#18057222)
    "However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data." (From TFA)
  • by Prof.Phreak ( 584152 ) on Sunday February 18, 2007 @12:53AM (#18057230) Homepage
    At the very least, they could've named brands X, Y, Z, etc., and provided the numbers for those. Would be interesting if the differences are more than marginal.
  • by ryturner ( 87582 ) on Sunday February 18, 2007 @01:03AM (#18057286)
    It would be useful to you and me. But it is not useful to google to release that information.

  • by EonBlueTooL ( 974478 ) on Sunday February 18, 2007 @01:04AM (#18057292)
    Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)
  • Re:Translation (Score:5, Insightful)

    by David Price ( 1200 ) * on Sunday February 18, 2007 @01:13AM (#18057330)
    More likely: "We buy millions of dollars worth of drives each year, and our buying decisions are driven in part by the reliability data that we collect. If we told everyone what kind of drives work best, more people would buy those drives, driving up the price that we pay."
  • by AmigaBen ( 629594 ) on Sunday February 18, 2007 @01:20AM (#18057374)
    How was it useful to Google to publish the report at all?

    I don't see the point in pretending to provide information while obfuscating the most meaningful bits of it, unless it's a sales attempt to garner attention for a paid-for version of the report. Obviously, Google has concerns in the process different than what our concerns are, but again, I don't really see the point in the report without the brands.

  • Re:Translation (Score:5, Insightful)

    by the_womble ( 580291 ) on Sunday February 18, 2007 @01:21AM (#18057378) Homepage Journal
    Another translation: Our competitors buy millions of dollars worth of drives as well. We are not going to help them avoid the duff ones.
  • by Mammothrept ( 588717 ) on Sunday February 18, 2007 @01:25AM (#18057396) Journal
    "...we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data."

    Litigation avoidance may be a consideration here but why not take Google at their word? Google is a search company that buys lots of hard drives. Based on their own internal research, they have developed information about which hard disk models and/or manufacturers are shite.

    Yahoo is also a search company that buys lots of hard drives. Why should Google give that hard drive reliability information to you, me and Yahoo for free? Let Yahoo/Excite/MSN and the competitors figure it out for themselves.

    Yeah, sure I'd like to have access to Google's data the next time I'm in the market for a hard drive but I won't hold a grudge against them if they don't do my consumer research for me. On the other hand, whereinafuck is the data from Tom's Hardware Guide, Anandtech, Consumer Reports and all the other reviewer and consumer sites? If someone doesn't have a handy link to their results, I'll see if I can google something up:

    http://www.google.com/search?hl=en&safe=off&client =firefox-a&rls=com.ubuntu%3Aen-US%3Aofficial&hs=tq y&q=hard+drive+reliability+research+brands++manufa cturers+models&btnG=Search [google.com]
  • by oGMo ( 379 ) on Sunday February 18, 2007 @01:26AM (#18057410)

    While at a glance, it may seem like this is simply "the latest thing google did," and... let's be honest, given the editor in question... this was most likely the reason it made the front page. But while Bianca Shroeder's report, for instance, uses statistics from various unnamed sources and for various unnamed uses, the Google report is interesting because we know exactly where it's coming from and what it's being used for.

    Of course, a truly insightful story would have taken this opportunity to compare Google's findings with the others and report on that.

  • by Antique Geekmeister ( 740220 ) on Sunday February 18, 2007 @01:26AM (#18057414)
    I'm confident that Google is fairly drive agnostic: you just can't run distributed networks that large and stay locked into a single vendor. And given that even reliable vendors have disasters like the IBM Deskstar drives some years ago, and given the remarkable growth of drive sizes over time, there's just not much point for them in buying the extremely stable but vastly more expensive hardware. They've foubtless learned that hardware flexibility provides valuable software flexibility.
  • by devilspgd ( 652955 ) * on Sunday February 18, 2007 @01:40AM (#18057498) Homepage
    Organizing and making accessible information which is already available is one thing, producing information is completely different.
  • by Chalex ( 71702 ) on Sunday February 18, 2007 @01:51AM (#18057548) Homepage
    The chart implies that the "optimal" operating drive temperature is 35-45 Celsius. Drive temperatures below room temperature (below 22 Celsius) is probably not a scenario that drive manufacturers optimise for.
  • by spisska ( 796395 ) on Sunday February 18, 2007 @03:02AM (#18057912)

    ps.. all their farm is ata/ide?

    You really didn't read the article, did you? On page 3 (Section 2.2 Deployment Details), the authors state: "More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units were put into production in or after 2001. [...] The data used for this study were collected between December 2005 and August 2006."

    What are you waiting for Google to tell you? Are you really accusing them of being evil because they did a study, described their methodology, detailed their results, presented their analyses, and published it all for anyone who is interested?

    You describe their conclusions as:

    Uselsess

    But there is no contradiction at all if you are smart enough to understand. They are telling you that if SMART identifies a problem with a drive then it is very likely that drive will fail within 60 days. But in a sample of 100,000 drives, many drives will also fail that have not returned errors on SMART scans. Thus SMART is a reliable indicator of impending failure but is not a silver bullet that can recognize and predict all failures before they happen.

    Next time you have access to 100,000 hard drives, can analyze patterns of failure among them, can use those failures as a benchmark against which to measure analysis tools, and can come up with better recommendations for predicting failure than this study, then by all means let us know. But if you're looking for Microsoft or Western Digital or Seagate or Yahoo to perform and publish this kind of study for free, I think you may be waiting a good long while.

  • by HUADPE ( 903765 ) on Sunday February 18, 2007 @03:15AM (#18057980) Homepage
    There are several good reasons to not release the brand names. First, while the sample size is huge, the sample size for a particular model of a particular brand might not be. If they only happened to have 10 of one particular model, and one failed within a month, then 10% fail within a month, but it could just be a fluke. Second, liability. This wasn't a controlled test, it was done live within the Google servers (presumably). Whoever is on the bottom of the list could very well sue Google for libel. Without merit? Probably, but they might eke a few million in a settlement out of them. Google can't appear to be doing evil after all.
  • by Jah-Wren Ryel ( 80510 ) on Sunday February 18, 2007 @03:20AM (#18058002)
    Google:Organizing all the world's information and making it universally accessible and useful(unless it could be troublesome)

    Old Google Motto: Don't do anything evil.
    New Google Motto: Don't get into trouble.
  • by Anonymous Coward on Sunday February 18, 2007 @03:28AM (#18058032)
    perhaps there is some correlation between lower temperature and higher forces, ie. a drive that starts and stops frequently may have a lower temperature, but would undergo more acceleration and stress
  • Re:Translation (Score:5, Insightful)

    by spisska ( 796395 ) on Sunday February 18, 2007 @03:31AM (#18058052)
    Another translation:

    We're not so bloody stupid to believe that our competitors are standing in the aisle of Circuit City and scratching their head over whether to buy a Seagate or WD drive.

    We know that our competitors all have their own metrics and their own relationships with manufacturers and frankly, we don't care. We know our competitors also measure these things, and we're not telling them anything they don't already know.

    We aren't particularly worried about saying that some drives fail, because everyone who cares already knows that some drives fail. Everyone whose job it is to know which drives fail first already knows that as well.

    But we're not going to tell you which brand fails at a higher rate than normal because we don't need a lawsuit that would cost us a lot of money but in the end would only confirm what the people who need to know these things already know.

    We will, on the other hand, describe the tests we ran, our methodology, our results, and our analyses. We do this just for kicks and we hope you can learn something from the results.

    And we hope you have a nice day.
  • by bouis ( 198138 ) on Sunday February 18, 2007 @04:45AM (#18058366)
    If hard drives are anything like car engines [especially those made with iron and aluminum], the designers have taken the standard operating temperature into account in the design. The parts of varying composition fit together best at the right temperature, and temperatures higher or lower result in damage or accelerated wear.

    This is why, if you want your engine to last, you should let your car warm up before driving it hard.
  • by Mostly a lurker ( 634878 ) on Sunday February 18, 2007 @05:01AM (#18058412)
    Yes, the low temperature finding is most interesting. I have an hypothesis as to what might be going on. I suspect that absolute temperatures, within certain limits, are not important to drive reliability, but that temperature variation is. Drives that, because of their location and pattern of use, tend to fluctuate in temperature between, say, 20 and 35 degrees centigrade are being stressed more than those an a steady 40 degrees.
  • Re:Translation (Score:3, Insightful)

    by Eivind ( 15695 ) <eivindorama@gmail.com> on Sunday February 18, 2007 @05:06AM (#18058426) Homepage
    It's not that surprising. The only mildly interesting thing I see is that high load seems to *not* increase failure-rates much, other than the first few month. They hypothesize that this may be because some drives don't handle high load -- and die early -- however those drives that survive the first ~6 months with high load are the more robust ones, and those hold up well.

    Makes sense. Killing the weaker infants makes the adult population healthier.

  • by hankwang ( 413283 ) * on Sunday February 18, 2007 @05:36AM (#18058500) Homepage

    The paper claims "more than 100 thousand drives". But the nice thing is that you can derive the actual number from the error bars, for example those in figure 4. The data should be governed by Poisson statistics, which means that the standard deviation in the counts is equal to the square root of the count. However, their error bars seem to be about a factor 2 larger than the standard deviation, because normally around 68% of the data points should lie within one standard deviation from the "smooth curve". Let's assume the error bars are 95% confidence intervals, i.e. 2 standard deviations.

    Look at the data for 20 to 21 C. It tells you that it represents a fraction 0.0135 of their total drive population, with an average failure rate of 7 +- 0.5 %. Following the reasoning above, this 7% should represent 784+-28 drives. Since these represent 7% of 1.35% of the total number of drives, we can derive that the total number of drives is 784/0.07/0.0135 = 830,000 drives. Trying the same thing for 30 to 31 C gives 826,000 drives, which seems fairly consistent.

    So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?

  • by mabinogi ( 74033 ) on Sunday February 18, 2007 @05:38AM (#18058510) Homepage
    Being able to choose freely to not say something is freedom of speech.
    The right to stay silent on something is just as important a freedom as the right to have your say.

    Censorship has nothing whatsoever to do with it.
  • Bell Labs (Score:3, Insightful)

    by gustgr ( 695173 ) <gustgrNO@SPAMgmail.com> on Sunday February 18, 2007 @05:46AM (#18058536)
    Google Labs, yet in its youth, certainly resembles me of the golden yers of the Bell Labs.
  • Re:Translation (Score:1, Insightful)

    by Anonymous Coward on Sunday February 18, 2007 @06:52AM (#18058716)


    Demand and price in a free market are reversely proprotional.

    One way to spot someone who doesn't really understand economics is how quickly they make statements like that. You would need to know a lot more about the thing in question before being able to make a generalization like that. Sometimes, they're directly proportional, sometimes, they're reversely proportional, and sometimes they're neither. It depends on a lot of other things which relationship hold true, if any.


    It could probably have been better stated, "demand and price in a free market are reversely proportional, in the long term assuming that there are no barriers to entry" and bearing in mind it'd cost a few bn dollars to setup an enterprise harddisc company then it doesn't really apply here.

    And realistically even if google did say "Hitachi disks are 10x times more reliable than everyone elses", who apart from a few thousand geeks would even know to be able to make buying decisions based upon it ?

    Alex
  • by Simon Brooke ( 45012 ) * <stillyet@googlemail.com> on Sunday February 18, 2007 @09:28AM (#18059136) Homepage Journal

    You forgot one metric of comparison: the warranty. As far as I'm concerned, this number alone is the most important in determining the reliability of the hard drive. If the manufacturer is willing to say "This drive will last for X years or we replace it free," it speaks volumes about their confidence behind their product. When buying hard drives, I actively seek out drives with at least a 3 (preferably 5) year warranty (some Hitachis and Seagates IIRC) and explicitly avoid those with only a 1 year warranty period (I'm looking at you WD).

    You know, I don't give a monkey's. What you lose when a disk goes down (if you haven't done your backups properly) is typically far more valuable than the disk mechanism itself. Any manufacturer can put a five-year warranty on a disk mechanism as a gimmick. Most users won't remember the warranty when the disk goes down, and, even if they have to replace 10% of the units 'free', it doesn't take much on the retail price to cover that.

    20 years ago we had a spate of failures on Western Digital drives on machines which were out with customers. That really hurt - giving our customers free drives would not have cheered them up. 10 years ago we had a spate of failures of Samsung drives in a server farm. That was more under control, but it was still a bloody nuisance. I don't want a drive which fails, but when it fails I get a new one free. I want a drive that doesn't fail. The warranty has absolutely nothing to do with it.

  • by Fred_A ( 10934 ) <fred@f r e d s h o m e . o rg> on Sunday February 18, 2007 @09:31AM (#18059150) Homepage

    No. They explicitly said they would not disclose that... which is a shame because that is probably the only interesting bit of information.
    The pertinence of the SMART data (pretty much always pertinent) and how often it popped up (about half the time) before a failure was a very interesting bit of information.

    The question that really needs to be studied is what distinguishes good drives from bad.
    A good drive is one that lasts a long time without developing too many bad blocks. A bad drive is one that fails within a couple years. In both cases you only know it after the fact or because a whole series happens to be poorly designed (like it happens to every manufacturer every now and then). Unless that model is already widely deployed and known to be bad, or already widely deployed and likely no longer sold, there's no way to tell.

    And thus on the third day the FSM created backups and saw it was good.
  • by CRC'99 ( 96526 ) on Sunday February 18, 2007 @10:02AM (#18059248) Homepage

    So can we assume that Google has deployed 830,000 hard disk drives since 2001? How many servers do they have now?


    Do you really think that they don't store every cookie and search pattern that everyone who uses their search engine? Cross-reference all this data, alter their ranks, follow your interests, use those to make money and target you with ads?

    There is a ton of money for this information, and with enough stored data and having the facility to mine it, filter it, and sort it to location level for various advertising categories for advertisers.

    Google has been very smart in the way they do business - they make money of studying your habits and selling the result (in the form of stats and/or ads).
  • by spineboy ( 22918 ) on Sunday February 18, 2007 @12:00PM (#18059924) Journal
    To me it's useful - if I get a SMART warning, then I'm definitely backing up my drive and will replace it before it croaks.

    Sensitivity/specificity always presents a balancing act of testing, and they are usually in a push/pull relationship. If you make a test too sensitive, then you get too many false positives, and wind up over treating something (i.e. the test says it might fail so you replace the drive even though it's not going to - a false alert)

    If you make the test too specific, then usually you wind up decreasing it's sensitivity, or ability to detect something. Now you get false negatives, so when the test works, you can be sure that it's accurate, but it always doesn't detect the problem.

    What you want to know is the Positive Predictive Value PPV, which is determnined by the formula PPV=TP/(TP+FP). TP= true positives, FP = false positives
    Also useful is the Negative Predictive Value NPV, or this formula NPV=TN/(FN+TN) where TN = true negative, FN = false negative.

    What information these give are as such. If a test is positive (i.e. the drive temperature is >80 C), then it accurately will predict that the drive will fail. If the test is negative (drive temp 40 C0 then it accurately predicts that the drive is ok.
  • Re:Samsung! (Score:4, Insightful)

    by mollymoo ( 202721 ) on Sunday February 18, 2007 @01:15PM (#18060442) Journal

    In summary: Your statistical analysis on a sample size of one showed a 100% failure rate, so Samsung are crap. You found some other people also had failed Samsung drives, so Samsung are crap.

    Search the net and you will find people ranting about Seagate drives failures, Western Digital drive failures, IBM drive failures, Maxtor drives failures and failures of drives made by companies neither of us have even heard of. You won't find many, if any, reports of recent failures with 8" floppy drives though, so I suggest you use one of those. They must be more reliable, right?

  • by Futurepower(R) ( 558542 ) on Sunday February 18, 2007 @03:34PM (#18061362) Homepage
    The research results are VERY poorly communicated, as research results often are.

    This seems to be the most relevant sentence: "What stands out are the 3 and 4- year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." (Page 5, Section 3.4, 4th paragraph)

    Often poor communication in research pages is intended to hide the fact that the results are not very useful. The above sentence can be translated to: "If you run hard drives hot, after 3 or 4 years you will have a high failure rate."

    All of our drives have their own vibration-isolated fans. Google, I recommend you do that too, based on your research results.

    --
    Is U.S. government violence a good in the world, or does violence just cause more violence?
  • by Futurepower(R) ( 558542 ) on Sunday February 18, 2007 @04:01PM (#18061512) Homepage
    Here's a quote from the Google paper: "Power-on hours -- Although we do not dispute that power-on hours might have an effect on drive lifetime, it happens that in our deployment the age of the drive is an excellent approximation for that parameter, given that our drives remain powered on for most of their life time." (Page 10, 4th paragraph)

    Translation: The number of hours the drives are powered is the same as the age of the drives, since the drives are always powered.

    When two numbers are close to equal, they are approximations for each other. LOL. Is there a social breakdown at Google? Are the people who don't like to think taking power at Google?

Kleeneness is next to Godelness.

Working...