Forgot your password?
typodupeerror
Data Storage Stats

Ask Slashdot: Do You Test Your New Hard Drives? 348

Posted by timothy
from the just-bite-the-corner-a-little dept.
An anonymous reader writes "Any Slashdot thread about drive failure is loaded with good advice about EOL — but what about the beginning? Do you normally test your new purchases as thoroughly as you test old, suspect drives? Has your testing followed the proverbial 'bathtub' curve of a lot of early failures, but with those that survive the first month surviving for years? And have you had any return problems with new failed drives, because you re-partitioned it, or 'ran Linux,' or used stress-test apps?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Do You Test Your New Hard Drives?

Comments Filter:
  • by X0563511 (793323) on Sunday December 23, 2012 @01:30PM (#42375807) Homepage Journal

    If dban can write out every sector and not have smartctl show any pending sectors after the fact (and the average speed of the dban wipe was normal) then you've got good chances the drive will be fine.

  • by Anonymous Coward on Sunday December 23, 2012 @01:42PM (#42375901)

    I run some ZFS systems at work. With the current version of the filesystem, you can expand the zpools but you can't shrink them, so adding a bad drive causes immediate problems.

    I've found that some drives are completely functional but write at extremely slow rates: maybe 10% of normal. With typical consumer drives, maybe 1/20 is like this. To ensure I don't put a slow drive into a production zpool array of disks, I always make a small test zpool consisting of just the new batch of drives and stress-test them.

    This catches not only obviously bad drives, but also the slow or otherwise odd ones.

  • by bill_mcgonigle (4333) * on Sunday December 23, 2012 @01:45PM (#42375913) Homepage Journal

    Yes, this. I do it online:

    dd if=/dev/zero of=/dev/sdX bs=8M

    and then check smartctl. If I'm making a really big zpool, I fill them up and let ZFS fail out the turkeys:

    dd if=/dev/zero of=/tank/zeros.dd bs=8M
    zpool scrub tank

    If I'm building a 30-drive storage server for a client I'll often see 1-2 fail out. Better to catch them now then when they're deployed (especially with the crap warranties on spinning rust these days). I need to order in staggered lots anyway, so having 10% overhead helps keep things moving along.

  • SMART + badblocks (Score:5, Interesting)

    by SuperBanana (662181) on Sunday December 23, 2012 @02:23PM (#42376087)

    I run smartctl and capture the registers, then run badblocks, and compare smartctl's output to the pre-bad-blocks check.

    If there are any remapped blocks, the drive goes back, as the factory should have remapped the initial defects already, and that means new failed blocks in the first few hours of operation.

  • by PlusFiveTroll (754249) on Sunday December 23, 2012 @03:05PM (#42376361) Homepage

    Two DOA of the same part isn't out of the question, a good amount of the time the same part number is from the same batch, which may suffer from the same manufacturing defects. I see things like that pretty often in batches of disks that fall out of RAIDs.

  • Re:Heh (Score:5, Interesting)

    by hairyfeet (841228) <bassbeast1968@NOsPAM.gmail.com> on Sunday December 23, 2012 @03:13PM (#42376407) Journal

    The problem is the best damned tool ever made for testing drives hasn't been updating in years and now won't work on drives bigger than 500Gb, I am of course talking about Spinrite. With Spinrite on lvl 2 you just bypass the firmware and write patterns of zeroes and ones and then read back what it reports, if its spitting errors right off the bat then you know to send it back. Problem is Gibson hasn't updated the thing since 06 so it can't handle drives bigger than 500Gb which makes it all but useless today.

    So if anybody has found something that works similar to spinrite but works on large drives I too would like to know, I get drives coming in from all over the place at the shop with ZERO history here at the shop so I don't know if they've been barely used or thoroughly abused and having a tool I can run on them would be a big help.

  • Re:SSDs (Score:5, Interesting)

    by cpghost (719344) on Sunday December 23, 2012 @03:14PM (#42376411) Homepage
    Actually, the only use for SSDs currently are ZILs (ZFS intent logs) and we're evaluating whether we put PostgreSQL transaction logs on an SSD, but that's another story. Our main storage farm is still HDD-based.
  • Re:Heh (Score:5, Interesting)

    by greg1104 (461138) <gsmith@gregsmith.com> on Sunday December 23, 2012 @03:54PM (#42376627) Homepage

    Spinrite hasn't been useful for years. There's a good analysis why at Does SpinRite do what it claims to do? [serverfault.com]. Everything the program does can be done more efficiently with a simpler program run from a Linux boot CD. And the fact that it takes so long is a problem--you want to get data off a dying drive as quickly as possible. Here's what I wrote on that question years ago, and the rise of SSDs make this even more true now:

    SpinRite was a great program in the era it was written, a long time ago. Back then, it would do black magic to recover drives that were seemingly toast, by being more persistent than the drive firmware itself was.

    But here in 2009, it's worthless. Modern drives do complicated sector mapping and testing on their own, and SpinRite is way too old to know how to trigger those correctly on all the drives out there. What you should do instead is learn how to use smartmontools, probably via a Linux boot CD (since the main time you need them is when the drive is already toast).

    My usual routine when a drive starts to go back is to back its data up using dd, run smartmontools to see what errors its reporting, trigger a self-test and check the errors again, and then launch into the manufacturer's recovery software to see if the problem can be corrected by it. The idea that SpinRite knows more about the drive than the interface provided by SMART and the manufacturer tools is at least ten years obsolete. Also, getting the information into the SMART logs helps if you need to RMA the drive as defective, something SpinRite doesn't help you with.

    Note that the occasional reports you see that SpinRite "fixes" problems are coincidence. If you access a sector on a modern drive that is bad, the drive will often remap it for you from the spares kept around for that purpose. All SpinRite did was access the bad sector, it didn't actually repair anything. This is why you still get these anecdotal "it worked for me" reports related to it--the same thing would have been much better accomplished with a SMART scan.

  • Re:Heh (Score:5, Interesting)

    by SuperTechnoNerd (964528) on Sunday December 23, 2012 @03:56PM (#42376647)
    You have to interpret the data correctly. Looking at seek error rate and raw read errors tells if the heads are positioning accurately. Run the drive hard (read/write patterns )and watch the temperature. And of course if you start seeing a non 0 pending, and realloc sector count you know the end is near. And watch as a drive gets older the spin up time will increase. (I rarely shut the raid server down so this is less important). I have smartd email and text me any time things start to get out of a happy place.. I do nightly quick test and weekly extended tests. Smart is useful - if your smart about it...
  • Re:Heh (Score:5, Interesting)

    by greg1104 (461138) <gsmith@gregsmith.com> on Sunday December 23, 2012 @04:10PM (#42376727) Homepage

    SMART is a part of the modern drive's firmware. You can't bypass it. Anyone who tells you otherwise--such as the makers of Spinrite--is lying to you in order to sell a product.

    The quality of SMART implementation varies significantly based on the manufacturer. Anecdotally, I have 3 failed Western Digital drives here that flat out lie about the drive's errors. Running the tool needed to generate an RMA does a full SMART scan of the drive, remaps some bad sectors, and then says everything is good. But it's not--each drive is still broken, in a way the firmware seems downright evasive about. Try to use it again, it doesn't take long until another failure. It does seem like the sole purpose of SMART and its associated utilities on WD drives is to keep people from returning a bad drive, by providing a gatekeeper in that process that never says there's a problem.

    Most of my serious installations avoid WD drives like the plague for this reason. I think that Seagate's drives are probably less reliable overall than WD nowadays. Regardless I prefer them, simply because the firmware is more honest about the errors that do happen. Drives fail and I plan for that. What I can't deal with is drives that fail but don't admit it.

    The reason there are "RAID edition" firmware available is to provide a drive that isn't supposed to be as evasive about errors. It may be that some WD RAID edition models might not have the problem I'm describing. I soured on them as a brand before those became mainstream.

  • Re:SSDs (Score:5, Interesting)

    by hairyfeet (841228) <bassbeast1968@NOsPAM.gmail.com> on Sunday December 23, 2012 @04:18PM (#42376777) Journal

    Unless you are using SLC, which is getting harder to find and more expensive every day you are really pushing your luck. The problem is the hot/crazy scale [codinghorror.com] when it comes to these drives, specifically the fact that nobody has figured out how to lick the controller issue. For those that haven't run into it yet (lucky bastards) the controller issue will cause a drive to suddenly fail without ANY warning and unlike how the SSDs are always bragged on to "fail safe" into a read only mode what actually happens is when the controller fails the whole drive is completely dead, it won't even show up in BIOS/UEFI.

    So until somebody figures out how to lick the controller problem, and when they do the money they make will truly be insane, or come up with the idea that i have been advocating for years of putting a second cheaper ARM controller on the board designed to take over as a read only backup while you get your data out? Well I'd be seriously leery of trusting any data I cared about to an SSD, not without spinning rust backups at the very least. The controller bug seems to bite every OEM on the ass, I have seen it from Intel to OCZ and its always the same. Push the button and poof! Data all gone with the drive. And of curse since you can't get your data off or even wipe it you have to hope they don't send it to some third world country for refurb where they help themselves to your data. Because of this I don't think my customers have even used 10% of their warranties for fear of the data falling into the wrong hands, great for the OEMs which rarely have to make good on warranties, not so good for the customer.

  • Re:Heh (Score:2, Interesting)

    by Anonymous Coward on Sunday December 23, 2012 @08:44PM (#42378373)

    Agreed. I just recovered a very messed up 120GB drive with gnu ddrescue. It took over 7 days to read, but only lost 300MB of data. Very happy with the results.

  • Re:Heh (Score:4, Interesting)

    by Pentium100 (1240090) on Sunday December 23, 2012 @11:35PM (#42379163)

    MHDD works best for me for testing the drive. Spinrite (and ddrescue) is good for data recovery, but not that good for testing. I had one drive that have a lot of sectors that were good, except that the drive took 10-30 seconds to read them making the PC extremely slow (Windows would drop to PIO mode and be slow even when reading the good sectors).Chkdsk didn't detect anything, Spinrite didn't detect anything, only mhdd showed lots of slow sectors (I later made a list and manually marked them as bad, getting a 2.5" IDE drive is not that easy or fast, so it will have to do until then).

[Crash programs] fail because they are based on the theory that, with nine women pregnant, you can get a baby a month. -- Wernher von Braun

Working...