Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
BLACK FRIDAY DEAL: Trust the World's Fastest VPN with Your Internet Security & Freedom--A Lifetime Subscription of PureVPN at $48 with coupon code "BFRIDAY20" ×
Data Storage Stats Hardware IT

Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures 142

Lucas123 writes Backblaze, which has taken to publishing data on hard drive failure rates in its data center, has just released data from a new study of nearly 40,000 spindles revealing what it said are the top 5 SMART (Self-Monitoring, Analysis and Reporting Technology) values that correlate most closely with impending drive failures. The study also revealed that many SMART values that one would innately consider related to drive failures, actually don't relate it it at all. Gleb Budman, CEO of Backblaze, said the problem is that the industry has created vendor specific values, so that a stat related to one drive and manufacturer may not relate to another. "SMART 1 might seem correlated to drive failure rates, but actually it's more of an indication that different drive vendors are using it themselves for different things," Budman said. "Seagate wants to track something, but only they know what that is. Western Digital uses SMART for something else — neither will tell you what it is."
This discussion has been archived. No new comments can be posted.

Data Center Study Reveals Top 5 SMART Stats That Correlate To Drive Failures

Comments Filter:
  • by Anonymous Coward on Wednesday November 12, 2014 @04:39PM (#48373007)

    https://www.backblaze.com/blog/hard-drive-smart-stats/

    Goes into a lot more detail too.

  • by russotto ( 537200 ) on Wednesday November 12, 2014 @04:49PM (#48373129) Journal

    Uncorrected reads do not indicate a drive will fail. They indicate the drive has _already_ failed.

    The number one predictor is probably power-on time, they go into that in an earlier post.

    • Re:Uncorrected reads (Score:5, Interesting)

      by ls671 ( 1122017 ) on Wednesday November 12, 2014 @06:54PM (#48374205) Homepage

      I have had drives fail. I took them off line and wrote 0 and 1 to them with dd until Reallocated_Sector_Ct stops raising and Current_Pending_Sector goes to zero then ran e2fsck -c -c on them 2 or 3 times then, I put them back on line!!!

      Most people would say this is crazy but in my opinion, the surface of the drives often have bad spots while the rest is perfectly OK. Some on those drives are still on line without reporting any new errors after more than 5 years, some almost 10 years. Those are server drives with very low Start_Stop_Count, Power_Cycle_Count and Power-Off_Retract_Count. All lower than 250 after 10 years. Those drives are spinning all the time.

      Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.
       

      • Newer drives will relocate bad sectors to free reserved space they keep for that purpose.

        IBM Mainframe drives did that back in the 1960s.

        From what I've seen of hard drives, they're a lot like silicon wafers. Rarely perfect, but as long as they're "good enough", the controller maps around the bad spots that they came with as well as a certain number of ones that form over the operating life.

      • Newer drives will relocate bad sectors to free reserved space they keep for that purpose. As long as you don't run out of free spare space, IMHO, it is worth a try.

        HDDs don't rely on user addressable free space to remap LBAs; they now have their own non-user accessible spare space that gets allocated for the remapping purpose automatically. Effectively, it happens on-the-fly at the hardware layer. It's why you rarely, if ever, will have bad clusters at the file system level; it's oblivious to what's really

        • by ls671 ( 1122017 )

          I know that. I run e2fsck -c -c (write+read test) to generate random pattern writes on the drives then read the data to make sure it is the same. If I put the drive back on line, e2fsck -c -c will always report 0 bad blocks and no timeouts will have occurred. I also check for timeouts in the logs.

          Failed reads on a drive part of a RAID array will usually cause the drive to be kicked out of the RAID array after a timeout slowing down the machine. The strategy I suggested allow the drive hardware to indeed rel

          • Error recovery control [wikipedia.org]. Also known as TLER, ERC, or CCTL.

            You shouldn't have to script any of this if your using drives that support error recovery. Western Digital desktop drives do not have TLER. As such, the slightest hesitation can kick a drive out of an RAID array. Sucks balls, but don't use generic desktop drives (or any drive for that matter) that doesn't support this in hardware.

            • Just keep your Raid arrays small in number of drives if you use Desktop drives and/or
              spring the extra money to buy WD Red's which do have TLER IIRC.
            • by ls671 ( 1122017 )

              I suggest you do a little more research. If a sector was successfully written to and then 2 months later the drive hardware can't read from it, there is no way for the drive hardware to automagically correct the error and recover the data. The drive hardware then just increment the Current_Pending_Sector count. You could start by reading your own link but then again, you seem to have problems reading my own posts so your mileage may vary ;-)

              • Damn your full of yourself!

                If you're running RAID 1, 5, 6, 10, etc, it's a moot point as data will be rebuilt from remaining parity information. Secondly, if a drive drops out of an array from an extended error recovery timeout, chances are you can't trust the reliability of the drive anyways. That's regardless if it trips SMART or not.

                My point to you is this: why do you go through convoluted motions to micromanage your hardware when this is a solved problem. Solutions exist! Run the cost/risk aassessment a

                • by ls671 ( 1122017 )

                  Run the cost/risk aassessment and apply accordingly.

                  Exactly, use ZFS that does just that if you want to afford the extra memory. Use a fancy hardware raid controller that does that if you wish. I just use cheap drives and Linux MD. Do your research before commenting on setup you don't seem to know about. You don't have to brag about your hardware here and try to convince others to do as you do.

                  Didn't I mention in my first post: "Most people would say this is crazy but in my opinion,..."?

                  I do not see what was your point in replying to my posts anyway other th

                  • I don't work for myself, I work for others. That is to say, when I'm having to administer over 100+TB of data on 50+ servers, I won't be rolling my own software-based solution. I'm not saying it can't be done, but there's just too many variable and permutations to deal with; more so when an update rolls around and potentially throws a wrinkle in the mix. And to be perfectly honest, going with Dell or HP provides next-day warranty replacement of drives. That, and the level of R&D put into a hardware base

                    • by ls671 ( 1122017 )

                      Who says I don't ALSO work for others and I don't know about more expensive solutions? I just don't brag about it mister Shaman ;-)

                      I know enough to know about people covering their arses, it is pretty common you know...

                      Yet, I never lost any data on the cheaper setup I run on the side.

                      Take care man!

                    • by ls671 ( 1122017 )

                      more so when an update rolls around and potentially throws a wrinkle in the mix.

                      You are right about this. Once, a linux kernel update, or was it mdtools? was screwed. You would add a new partition to an linux MD raid array and it wouldn't sync the partition before putting it online ;-) This is where a good backup strategy comes into place.

                      Anyways, toying around with linux MD and cheap solutions makes you more creative in the long run IMHO.

                      Just keep your mind open please. There are plenty of approaches and trade-offs available and just as you said:

                      Run the cost/risk aassessment and apply accordingly.

                      Furthermore, it depends on SLAs and su

      • by AmiMoJo ( 196126 ) *

        The problem is you have no idea how many free reallocated sectors are available. It isn't even consistent between drives, as some will have been used at the factory before the count was reset to zero.

        Your strategy is reasonable if the drives are part of a redundant array or just used for backed up data, but for most people once the reallocated sector count starts to it's best to just return the drive as a SMART failure and get it replaced.

      • So you overwrite your drive with 0 (/dev/zero) and 1's (/dev/one???) but still you were able to e2fsck it afterwards ?
      • by gweihir ( 88907 )

        Still a valid approach today for surface defects. And if you had run regular full surface scans, you would probably not have had to do anything yourself.

    • by gweihir ( 88907 )

      Wrong. Uncorrected roads indicate surface defects. The rest of the surface may be entirely fine. All disks have surface defects and not all are obvious on manufacturer testing.

      They also indicate faulty drive care. Usually, data goes bad over a longer tome. If you run your long SMART selftests every 7-14 days, you are very unlikely to be hit by this and will get reallocated sectors with no data-loss instead. Not doing these tests is like never pumping your bicycle tires and complaining when they eventually g

  • by Immerman ( 2627577 ) on Wednesday November 12, 2014 @04:53PM (#48373169)

    for those who are only passingly curious and don't want to read the article.
            SMART 5 - Reallocated_Sector_Count.
            SMART 187 - Reported_Uncorrectable_Errors.
            SMART 188 - Command_Timeout.
            SMART 197 - Current_Pending_Sector_Count.
            SMART 198 - Offline_Uncorrectable

    • by SpaceManFlip ( 2720507 ) on Wednesday November 12, 2014 @05:05PM (#48373269)
      I read the article to find those "5 Top SMART Stats" they refer to, but I'm replying here because it's the relevant place.

      Those 5 SMART stats match up exactly with what I habitually look at on the job monitoring lots of RAID arrays' drives. Those are the stats that tell you if the drive is going bad most often in my experience.

      • by jedidiah ( 1196 )

        Yes. This article isn't exactly news as it pretty much confirms what the global peanut gallery has already said about this stuff.

      • by afidel ( 530433 )

        Those 5 SMART stats match up exactly with what I habitually look at on the job monitoring lots of RAID arrays' drives

        Really? At my job I get notified that the array is ejecting a drive based on whatever parameters the OEM uses, it's already started the rebuild to spare space on the remaining drives, and a ticket has been dispatched to have a technician bring a replacement drive. If it's a predictive fail it generally doesn't notify until the rebuild has completed as it can generally use the "failing" drive

        • by 0123456 ( 636235 )

          We just look at the flashing lights every once in a while. Though we've got drives the RAID controller has been telling us are failing for the best part of a year now, and haven't got around to replacing them.

    • by AmiMoJo ( 196126 ) *

      I tend to think a drive has failed once it has any uncorrectable errors... I lost some data, it couldn't be read back. Drive gets returned to the manufacturer under warranty. Don't wait around for it to fail further.

      I agree with the reallocated sector count though. The moment that starts to rise I usually make sure the data is fully backed up and then do a full surface scan. The full scan almost always causes the drive to find more failed sectors and die, so it gets send back under warranty too.

    • by rduke15 ( 721841 )

      And to list these for your own drive:

      $ sudo smartctl -A /dev/sda | egrep '^\s*(ID|5|1[89][78])'
      ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
      187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
      188 Command_Timeout 0x0032 100 253 000 Old_age Always - 0
      197 Current_Pending_Sector 0x0012 100 1

    • by omnichad ( 1198475 ) on Wednesday November 12, 2014 @05:23PM (#48373453) Homepage

      And I can confirm. Reallocated Sector Count rarely goes above zero when the drive is fine. It's possible to have a few sectors go bad and get reallocated, but it's usually part of a bigger problem when it happens (this number is reset to zero at the factory, after all initially bad sectors have been remapped). If the Current Pending Sector Count is non-zero, it's likely over.

      I always clone a drive immediately with ddrescue when it gets to this point, while the drive is still working.

      • If it comes to you having to clone the drive, it's too late. That's going to bite you in the ass sooner or later.

        • At the first sign of trouble? How much earlier should I do it? I'm not saying in place of a backup. Just as a quicker way to get a new drive up and running.

    • Re: (Score:3, Informative)

      by koinu ( 472851 )
      Reallocated_Sector_Count
      sectors that the drive successfully replaced
      Reported_Uncorrectable_Errors
      errors that could not be recovered by ECC
      Command_Timeout
      controller hanging and had to be resetted
      Current_Pending_Sector_Count
      sectors to be replace by the next write access
      Offline_Uncorrectable
      sectors that the drive tried to repair, but failed (try offline test, maybe it is not dead yet)
      • Reallocated_Sector_Count
        sectors that the drive successfully replaced
        Reported_Uncorrectable_Errors
        errors that could not be recovered by ECC
        Command_Timeout
        controller hanging and had to be resetted
        Current_Pending_Sector_Count
        sectors to be replace by the next write access
        Offline_Uncorrectable
        sectors that the drive tried to repair, but failed (try offline test, maybe it is not dead yet)

        Did some idiot mod you DOWN?

        This is information that bears frequent repetition.

    • In other words: nothing new and people have been tracking these values for decades anyway.

    • They needed a study to arrive at that conclusion?

      Reallocated, uncorrectable and pending sectors are all obvious indicators of closing drive failure.

      Command Timeouts, depending on definition, could be timeouts after failing a read, so nothing unusual there.

    • by dargaud ( 518470 )
      Is there a tool that will parse a smartctl output and tell you 'good' or 'no good' ?
    • Huh what about 196 Reallocated_Event_Count. Nothing we didn't know already but is there any data out
      for SSD's, that's what I would like to know.
      • I don't know. I believe though that, unlike hard drives, SSDs are designed on the presumption that cells will gradually fail as part of normal operations, and hence any such statistics would mean something very different than they would for a hard drive.

        • Exactly, but what I would like to know is what are the critical SMART values to watch for in SSD's ?
          Any results on that yet, should we expect them or will they vary even more by manufacturer/model making a "top 5"
          list impossible.
    • by gweihir ( 88907 )

      Well, these are exactly the ones every knowledgeable person was watching anyways. 188 can also be controller or cable problems though.

  • Ever find it odd that most PC manufacturers (at least the variety I've seen over the years) disable S.M.A.R.T. in BIOS by default? Never understood the reasoning behind that...

    • by fnj ( 64210 )

      I could never imagine why it is even POSSIBLE to disable it. If you don't want to read it, just freakin don't read it.

      • I could never imagine why it is even POSSIBLE to disable it. If you don't want to read it, just freakin don't read it.

        I think there's some routine testing going on that adds overhead unless you disable it.

    • by Rashkae ( 59673 )

      If the PC has less than optimal cooling, it's possible, even l iikely, the drive temperature will exceed operating specs at some point. Even if there is no ill effect or any long term problem, the BIOS will forever more report "Imminent Drive Failure" on every boot if BIOS SMART is enabled.

    • Ive seen that as well. But I have had drives that report a SMART error at boot for years and still never failed (nothing important on that drive, thats why I didnt care) Maybe they would just rather the end user surprisingly looses all their data one day, rather then be troubled by a message at boot up when a problem us suspected.

      I would like to see SMART tools built into Windows and other OS's (maybe there are some I don't know about). Especially since some of my computers are up for 6 months or more a
      • I would like to see SMART tools built into Windows and other OS's (maybe there are some I don't know about). Especially since some of my computers are up for 6 months or more at a time, a drive could be fine 4 or 5 months ago when it was last booted, but I wont get a smart message until next reboot, maybe a month or two from now, after it's to late.

        Linux smartmontools package has smartd, the "SMART Disk Monitoring Daemon", which will monitor SMART-capable drives and will log problems and send email alerts. Can be handy. Don't know about Windows.

      • I generally use HD Tune (www.hdtune.com) which is free unless you want to buy the Pro version with a bunch of features that are irrelevant if all you want is SMART reporting.If I was going to spend actual money on a checker though, I would tend toward the LSoft Hard Disk Monitor (www.lsoft.net).

    • Less warranty replacement.

    • The meaning of that BIOS option may vary by system.

      I have used utilities to view the SMART info on drives where this BIOS option is disabled, can't recall any systems where it flat-out didn't work. I won't say that this information couldn't be blocked in some cases, but I believe that this option is for whether the BIOS checks SMART status during POST. It has made the difference between a system merrily proceeding to boot with a SMART failure versus reporting that the drive's SMART indicates failure and

      • Correct. The SMART status in BIOS is for whether or not the HDD SMART status get reported at POST. For example on Dell systems, it will warn the user with an option to press the space bar to continue booting into the OS (assuming the drive is still functional). With it turned off in BIOS, you can still poll SMART status with any number of HDD utilities available to whatever OS you're running.

  • As someone who is suspicious of a couple of hard drives, this data will help me to determine just how concerned I should be. I don't know what Backblaze gets out of making this information public (except publicity), but it is refreshing to a company release information such as this rather than guard it as a trade secret or sell it.
    • by fnj ( 64210 )

      The list of parameters that are closely correlated with failure is pretty bloody obvious.

      • Perhaps they are obvious to a System Administrator but to someone who is not an admin, everything in SMART probably looks like an error. In addition to that, the article describes common errors that sound indicative of a drive failure but are actually relatively benign. So there is definitely value in this information.
      • And yet they aren't even by Backblaze's admission. SMART values they expected to be an indication on drive wear showed no correlation with failure.

        • Disclaimer: I work at Backblaze.

          > SMART values they expected to be an indication on drive wear showed no correlation with failure

          Exactly. Also, some people care more than "approximately correlates" vs seeing the actual data of exactly how correlated it is.
  • I've used Crystal Disk Info and while it reports SMART info, I can't make much out of the info.

    Many values for Samsung spinning rust just have values of Current and Worst of 100 and either a raw value of 0 or some insanely huge number.

    • A few of them aren't accounted for very well (and some of Samsung's stats are not accumulative stats). Crystal Disk Info makes it idiot-proof. If the square is blue, the drive is fine, yellow and the drive is probably failing soon, and red is a definite failure.

      Raw value of zero is good. If Current Pending Sector Count or Reallocated Sector Count go above zero, you're likely dealing with a failing drive.

      Most of the numbers are not important.

      • I work at a school and see plenty of failing laptop drives - mostly from kids not sleeping their laptops while walking around.

        We use (currently) PartedMagic Linux distribution on a boot USB. The "Disk Health" tool happily reports on failing drives and gives reasons.

        Added bonus is that Linux is better than windows at allowing data to be copied from a failing drive (and doesn't care about the NTFS file permissions)

        • On Linux, I just use smartmontools. Gives the same grid of data (mostly) as Crystal Disk Info. But when copying a failing drive, always use ddrescue. It will allow you to unplug the drive (to do some mysterious temporary fix like putting it back in the freezer) and plug it back in and restart from where you left off. Unless you only need a small amount of data (I prefer to just clone the entire system to a new drive to boot from).

  • I never take a look at SMART values or do disk benchmarks. They just make me more stressful and paranoid. If it should occur, I'll let the drive die a mighty death and restore the latest backup to a new disk.
  • The biggest sign that correlates to drive failure is: it's a brick and all your data is gone.

    Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.
    • Let's be real here. You almost never get advanced warning from SMART. Maybe one in twenty. Almost without fail you'll go from a drive running properly to a drive that won't rotate the spindle or the heads smash against the casing or you've suddenly got so many bad sectors that it's effectively unusable. Failure prediction is almost (but not quite) valueless compared to the reality of how drives fail.

      Yeah, I did mention smartd in an earlier post, and I said it "can be handy" but I suppose I must agree with you based on my own life as its been lived until now. We never put a server into service without at least software raid, usually with just two disks with some exceptions. A lot of our equipment are tiny supermicro 1u's that can only hold two. But after many years we have yet to have two go at once (knock on wood) so the warning of a raid out of sync has saved us.

    • If you go by Google's definition of failing (the raw value of any of Reallocated_Sector_Ct, Current_Pending_Sector, or Offline_Uncorrectable goes non-zero) rather than the SMART definition of failing (any scaled value goes below the "failure threshold" value defined in the drive's firmware), about 40% of drive failures can be predicted with an acceptably low false-positive rate. You're correct, though, that the "SMART health assessment" is useless as a predictor of failure.

      They did a study on this [google.com] a few ye

    • I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created.
      So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.

      Sure, it's not scientifically representative, but I'll not take that chance with clients data...

      • I'd disagree. As an MSP we see occasional SMART errors and they're logged and tickets created. So far we've cloned / backed up / moved everything of note off all 27 of them, but the three we left in and just spinning have all died within a month or so.

        Sure, it's not scientifically representative, but I'll not take that chance with clients data...

        Yeah, I won't dispute your experience because it happened. On the other hand, the only SMART warnings I've seen in our fleet of... four-digits worth of spindles... have ended up false-positives. As in, I contact DELL / IBM / HP / Lenovo and report the issue, they instruct me to flash some controller firmwares, reboot, and go away. If those drives ever fail, it's years later, well beyond any correlation with the SMART events.

        • As MSP, false-positives are not always a negative. There, I said it... and most MSPs will agree begrudgingly when off the record.

          That said, our support prices alter when the device is no longer under warranty, so the device usually gets moved to a location covered under a different support structure like only 8x5 or have a longer response time to compensate.

  • Take all the drives that have signs of failure, put them in a testing environment where you can read and write them all day but don't care about any of the data on them and see how long it takes for them to really fail. That will give you an indication of how reliable the SMART stats are at predicting real disk failure.
    • by brianwski ( 2401184 ) on Wednesday November 12, 2014 @09:56PM (#48375169) Homepage
      Disclaimer: I work at Backblaze. Essentially this is what we did. We don't care at all if one drive dies, so we left it in an environment where we can read and write them all day (the storage pods with live customer data) and when they failed we calmly replaced them with zero customer data loss and produced this blog post. :-)
    • Google did this [google.com] about seven years ago. Of the stats, a drive with a non-zero scan error count has a 70% chance of surviving eight months, one with a non-zero reallocated sector count has a 85% chance of survival, and one with a non-zero pending sector count has a 75% chance of survival. For comparison, a drive with no error indications has a better than 99% chance of surviving eight months.

      Overall, 44% of failures can be predicted with a low false-positive rate, while 64% can be predicted with an unaccept

If all the world's economists were laid end to end, we wouldn't reach a conclusion. -- William Baumol

Working...