Google Releases Paper on Disk Reliability 267
oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"
Hmm (Score:2, Interesting)
Re:That would be corporate dynamite (Score:4, Interesting)
Re:Did they ever name the brands? (Score:5, Interesting)
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.
But, of course.
Temperature conclusion (Score:5, Interesting)
Lower temp == higher failure rates (Score:5, Interesting)
Re:Proprietary makes sense here (Score:0, Interesting)
I forget: It's always "fuck people", and "fuck trying to make this world a better place", and "Where's my goddamn profit I'm entitled too?!", and "Get back to work slaves..."
Yeah it makes sense to lock everything up as proprietary. Nothing to spur progress and prevent waste like having multiple efforts duplicated and hiding the results so nobody is sure what is the best way, and taxing and profiting any way how. I can't wait until they figure out a way to charge us to breath. Can I get my verichip tracking device embedded in my skull please? Open Source is treason. Zeig Heil her Bush & Blair and Haliburtton and Google.
Re:Temperature conclusion (Score:4, Interesting)
While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.
There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.
Re:Did they ever name the brands? (Score:2, Interesting)
Or maybe the manufacturer just realized that 5 years down the road, a replacement for your then 5 year old HD will cost them peanuts. Accoring to the graph at http://en.wikipedia.org/wiki/Hard_drives#Capacity [wikipedia.org], HD capacity seems to be increasing by roughly ten times every five years.
It's like the CD-R manufacturers stamping all the packaging with 100-year guarantees. They don't really have any good way of telling that they will actually last that long, but the replacement costs nearly nothing, and thus is payed for by the marketing benefits.
Re:Did they ever name the brands? (Score:3, Interesting)
What? So the part about which variables are correlated with drive failures (which is what the report was about) wasn't interesting to you? Too bad.
The GDRIVE (Score:2, Interesting)
Re:Did they ever name the brands? (Score:2, Interesting)
They do say that "vintage" matters (Score:5, Interesting)
Manufacturers have good years and bad years. The writers don't want to damn a company because it had a couple of bad years during this time period.
Still, it's a bummer that the single most important factor goes unpublished. Even if it could cause a panic I'm sure there's some useful information in there (eg. a company to avoid like the plague).
Re:Hmm (Score:4, Interesting)
The article states that, in about half of the failures, there were no SMART warnings at all. Okay, but what was the breakdown in the kinds of failures of these unpredicted ones? If they were all spindle motor and head traversal failures, then you can't blame SMART for that. If it turns out that SMART gave warnings for 95% of all failures that were media-degradation related (like bad sectors, etc... where the drive still talks to your machine properly, and just can't get the data you want), then I'd say SMART is pretty darn useful.
But, alas, I didn't see any breakdown for failure type....
Re:Translation (Score:3, Interesting)
You need backups anyway, that's not the point. But it makes a difference for your maintenance-costs if you experience 1% of your disc-drives dying in an anverage year or 5%.
Re:Hmm (Score:3, Interesting)
It isn't even that good. Many of the failure flags indicate between 70% and 90% survavability to 8 months. This is much worse than the ~2%/year baseline failure rate, but not as strong of a predictor as you might like. It would be nice to see data on this out to 2 or 3 years, so you could calculate the integrated chance of failure over the service lifetime, but by eye it looks like the trends were leveling off by 8 months.
So, if you want to avoid replacing too many good drives, you probably have to move to a multiple error model, which probably reduces your detection liklihood well below the already low 44% reported.
Re:Proprietary reporting (Score:4, Interesting)
The amount of positive press they get from these types of releases easily justifies the effort to polish internal reports up to a publication standard. By releasing these types of papers, others may change their buying habits, which in turn will change the products sold. Google may believe that these types of papers would cause shame, not from individual manufacturers, but the industry in a whole, and thus cause better products to be produced.
What he/she/it is looking for (Score:3, Interesting)
It is also interesting to note the magnificent jump in failure rates once the drives get outside the three year warrenty period. No coincidence there.
Temperatures (Score:3, Interesting)
I have been previously led to believe that it's not so much the average temperature of a hard drive that causes failure, but temperature fluctuations. This makes sense, since repeated expansion and contraction of the disk platters is likely to cause warpage before too long. This, I guess, is where glass platters like what IBM toyed with would come in useful. In the meantime I guess we still need our HVAC units to keep a constant temperature, just not too low anymore.
This also has implications for data centers that spend a considerable amount of energy pumping heat out of the server room. If we can raise the undustry-accepted temperature ceiling from 22C to say 30C then a lot of energy can be saved over time. Perhaps not quite enough to dip below 1% of US-wide power use but every bit helps.
SpinRite Disk Error Problem Detection (Score:2, Interesting)
The program sounds pretty amazing from their web site.
Are many companies using it for preventative maintenance to avoid data loss on their servers?
Re:Great (Score:3, Interesting)
What the report really shows is that SMART doesn't accurately indicate the life of the drive... if anything Google drives their hardware harder than normal users, so it should be a good testbed for predictive tools.... Google would be directly interested and probably pay a lot of money to somebody that implemented the changes this engineer said... chasing around 20k+ hard drives is an EXPENSIVE task... I'd bet Google pays a MILLION dollars a year in salary just to have somebody available to run out and replace unscheduled drive failures. That's a big process improvement that they would like to see hard drive manufactures answer.