Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Data Storage Hardware

Recovering a Wrecked RAID 175

Dr. Eggman writes "Tom's Hardware recently posted an article specifying how the professionals at Kroll Ontrack recover data from a RAID array that has suffered a hard drive failure, allowing for recovery of even RAID 5 arrays suffering two failures. The article is quick to warn this is costly, however, and points out the different types of hard drive failures that occur, only some of which are repairable. Ultimately the article concludes that consistent backups and other good practices are the best solution. Still, it provides an interesting look into the world of data after death."
This discussion has been archived. No new comments can be posted.

Recovering a Wrecked RAID

Comments Filter:
  • by QuietLagoon ( 813062 ) on Friday February 23, 2007 @01:26PM (#18124590)
    It takes far too many pages to say what could actually fit in a page or two.
  • by canUbeleiveIT ( 787307 ) on Friday February 23, 2007 @01:27PM (#18124606)
    Never put all of your eggs in one little basket (RAID or otherwise)! For the love of God, if your data is critical, you need a backup *and* an offsite backup. At least one of each. There are no exceptions to this rule.
    • by eln ( 21727 ) on Friday February 23, 2007 @01:31PM (#18124668)
      That's true, but the most common cause of data loss on a RAID system that I've seen is when a disk fails, and people leave it there for days or even weeks without bothering to replace it.

      When a disk fails in a RAID, it needs to be replaced IMMEDIATELY. A RAID system with a failed disk is a disaster waiting to happen. I've been in smaller shops that don't even have spare disks around. When a disk failed, they would order a disk at that point and have it shipped.

      You should always have plenty of spare disks around, and you should replace disks as soon as they fail. A double disk failure is rare, but the longer you put off replacing a failed disk, the more likely it becomes.
      • When a disk fails in a RAID, it becomes an AID.

        Or in the case of RAID-1, it becomes just an ID.
      • you should replace disks as soon as they fail. A double disk failure is rare, but the longer you put off replacing a failed disk, the more likely it becomes.

        That's what you have (cold or warm) spare drives for. The RAID can start rebuilding on the standby disk immediately if it has a spare.
        Given the recent results showing that disk failure rate is higher than it was thought, that the bathtub curve doesn't apply, and that failures are correlated, this sounds more and more like you really have to have it.

    • Totally. I come home with a few GB of data from a photo shoot, and it goes onto the main drive on my computer. Then I back it up to TWO SEPARATE drives, not a RAID of two drives, two independent drives. Eventually, when the drive on my computer fills up, I transfer the files from there onto a DVD and bring it off site. Not only does this method protect me from drive failure, it protects me from user error. For example, if I am working on a photo, convert it to black and white and accidentally "save" instead
  • Software RAID (Score:4, Insightful)

    by Kludge ( 13653 ) on Friday February 23, 2007 @01:30PM (#18124638)
    People often poopoo software RAID (it is more of a pain to manage). But when it comes to recovery, it's what you want. You know the disk format and have the tools. Of course, you really shouldn't have to recover, you should keep good backups or another mirror if its that important.
    • by mnmn ( 145599 )
      Ok I'm getting sick.

      Backups are not a 'solution'. They are a 'backup solution' to the 'main solution'.

      Of course one should keep backups, but I'm sick of it being called a solution to drive crashes.

      I had a drive crash this morning (on a server that is fully backed up daily. And I've had to get another server started and serving DHCP and DNS simply because I needed the thing up and running FAST. The RAID system crashed. If a drive crashed and it was in a RAID system, the server will keep running. Now that's w
      • by Sancho ( 17056 ) *
        You're throwing around the word 'solution' without defining the real problem.

        RAID is to keep the system running (except for that absurd RAID0 crap).
        Backups are to mitigate data loss.

        Now let's look at what the article was about. The title is "Recovering a Wrecked RAID". Why might you need to do this? To keep the system running? Not with what they're talking about. No, they're talking about recovering from a data loss where RAID is involved. Responding to this with, "Well, you should have kept backups.
    • by darrylo ( 97569 )

      Agreed (for home use), and ZFS's raidz is the easiest. ;-)

      Unfortunately, Solaris's IDE controller support sux. :-( If they only supported PCI-based IDE controllers, it would be soooo easy to create and maintain a RAID array using old hardware.

      • by TCM ( 130219 )
        I find NetBSD's RAIDframe [gw.com] to be very reliable and hassle-free. I'm using RAID-5 with 5 disks and get 110MB/s reads and 70MB/s writes. It also never gave me _any_ headache whatsoever. It just works.

        I think software RAIDs are better than hardware RAIDs (for home use) due to their flexibility. You can mix different disk interfaces (IDE, SATA, SCSI, ...) and sizes. If one of my 320G disks were to fail and a new disk was more expensive than the next bigger size, I could just use the bigger disk. It's a stupid ex
  • by sjbe ( 173966 ) on Friday February 23, 2007 @01:30PM (#18124650)
    Could these articles be any more annoying to read?

    They painstakingly

    NEXT PAGE

    pull data

    NEXT PAGE

    off the

    NEXT PAGE

    damaged drive
    • Printer Friendly (Score:4, Insightful)

      by TubeSteak ( 669689 ) on Friday February 23, 2007 @01:40PM (#18124784) Journal
      http://www.tomshardware.com/2007/02/14/raid_recove ry/print.html [tomshardware.com]

      I don't know why TH has printer friendly pages that they don't ever link to.
    • IntelliTXT too (Score:4, Insightful)

      by Skadet ( 528657 ) on Friday February 23, 2007 @01:45PM (#18124848) Homepage
      Yeah, between that and IntelliTXT, I pretty much gave up.

      What if your hard drive [slashdot.org] decides to enter the Elysian Fields [slashdot.org] in this very moment? [slashdot.org] Sure, you could simply get a new hard drive [slashdot.org] to substitute for the defective [slashdot.org] one with a quick run to your favorite hardware store. And with last night's backup [slashdot.org] you might even reconstruct [slashdot.org] your installation quickly. But what if you don't have a backup? We have experienced [slashdot.org] the truth to be more like this: many users don't even have a backup, or it simply is too old and thus useless for recovering any useful files at all. In case of real hard drive damage [slashdot.org], only a professional data recovery specialist can help you - say bye-bye [slashdot.org] to your vacation savings [slashdot.org]!
      Anyone remember when Tom's Hardware was good?
      • Re: (Score:3, Insightful)

        by operagost ( 62405 )
        Besides dedicating only about 10% of the page to actual content, the grammar is actually even worse than it used to be. Don't they have any native English-speaking editors?
      • Anyone remember when Tom's Hardware was good?
        Not actually I don't, I remember when they said 30 FPS is enough for everybody and professed it as the ultimate authoritative truth. I hope by now it is clear to everyone how false that is. Like claims about WMD's. I'm still waiting for a public retraction on both fronts.
        • by dpilot ( 134227 )
          Well once upon a time, the Doom engine was capped at 35fps. They left it so you could get higher fps numbers for benchmarketing, but they only put images onto the screen at a max rate of 35fps.

          I won't say whether or not that's giving THW too much credit or not.
      • Re: (Score:3, Informative)

        by PitaBred ( 632671 )
        *.intellitxt.com is blocked in my adblock list. Makes hundreds of sites more readable.
    • by fossa ( 212602 )

      I never understood the "next page" obsession that various websites have. I assume it's a way to fit more advertising in a given article, but why not, instead of splitting articles over multiple pages, simply insert more advertising on a single page? Are publishers afraid multiple ads will not load immediately? Surely loading an entire new page is worse than one more flash box? Do contracts require a given ad to have its own page? I'm curious.

      • I assume it's a way to fit more advertising in a given article, but why not, instead of splitting articles over multiple pages, simply insert more advertising on a single page?

        Let's see...10 ads per page spread out over 10 pages? Or 50 ad per page spread out over 2 pages? Both are very annoying, but if I can't even find the article text in a sea of ads, I'll never visit the site again. Then again, Tom's crossed that threshold for me a long time ago...
    • by Duncan3 ( 10537 )
      Hey, if advertisers are stupid enough to pay Google 10 times to show an ad to the same person, they deserve what they get...

      Which is to have to sell stuff at such a high markup noone buys them. HA HA!

  • And tell them to cause more damage next time or I will tell everyone they secretly admire Steve Ballmers commitment to developers.
  • OK, this is for the very extreme (and rare) cases where the disk is physically very damaged. Most of the time, you'll find that available tools are enough. See http://en.wikipedia.org/wiki/SpinRite [wikipedia.org], for example. Has worked for me, but 1. Copy the entire disk contents first. 'Low-level' disk-to-disk dup utilities (Seagate...) can work fine here. 2. Be prepared to wait. Of course, if your disk is on its way out, the intensive reading, (and writing, in the case of SpinRite) may accelerate its demise. Kee
    • Gibson the Hack (Score:4, Insightful)

      by spun ( 1352 ) <loverevolutionary&yahoo,com> on Friday February 23, 2007 @01:52PM (#18124922) Journal
      SpinRite is a Steve Gibson product. Steve Gibson is a pompous blowhard with few real skills [wikipedia.org]. There are plenty of other ways to do a low level copy of a disk.
      • by SEMW ( 967629 )
        The criticisms on the Wikipedia article you linked to were all regarding internet security. That says nothing whatsoever about how good or bad Spinrite is. I happen to write terrbile poetry; that doesn't say anything about how good any Mathematical papers I may produce would be.
        • by spun ( 1352 )
          True. But Spinrite follows the Gibson pattern: do something that everyone else is doing, but because of ego, do not pay attention to what anyone else is doing and reinvent the wheel over and over again. When people point out the stupid mistakes you make because you think everyone else in the world is inferior to you, attack them.

          Spinrite isn't bad, per se. It's just not in any way revolutionary or important. There are many better tools out there for doing low level copies.
      • That may be so, but I gotta say that he did write a damn fine disk repair/recovery utility...
      • Your argument is what we'd call "poisoning the well". Many have successfully recovered data using Spinrite, so however much of a "hack" or "blowhard" Steve Gibson is, he seems to have done well with Spinrite.
    • Re: (Score:2, Interesting)

      by goarilla ( 908067 )
      what's wrong with popping in a livecd like sysreccd http://www.sysresccd.org/Main_Page/ [sysresccd.org]
      and to use dd to take an image of the disk or ghost (but iirc ghost uses dd) ?
      i have been able to successfully recover 99% of a crashed, broken, badly partitioned hard drive that way numerous of times
      offcorse i do not claim i have the expertise as ontrack but seeing as i've done this for quite
      a few friends and since well not everybody can pay what they ask for their service, i can understand
      why they get drives that have
    • Spinrite was a sometimes useful utility in a few years following its earlier releases; I know I saved a drive or two with it back in the day back where it was the only useful tool around for home recovery work. We moved past that period quite some time ago. For many years now, hard drive electronics have corrected errors like bad sectors at a level well below where it's possible for Spinrite to operate at. Gibson and company should have withdrawn Spinrite from the market some time ago. Do not be fooled
  • by greg1104 ( 461138 ) <gsmith@gregsmith.com> on Friday February 23, 2007 @01:43PM (#18124816) Homepage
    I have a concern with the recommendations given in the introduction:

    We assume that all hard drives will be handled with care, so they should be installed in suitable drive bays. If you use multiple drives, we recommend removable drive frame solutions, which help reduce vibration transfer onto the computer chassis and even back to individual hard drives. Make sure that your system has sufficient ventilation, so high speed hard drives won't overheat.

    I've found that the removable drive frames available for cheap consumer hardware to be total crap. The metal enclosure keeps heat close to the drive, and the tiny fans used don't move nearly as much air past the drive as when it's inside the case, being cooled by the airflow of the case fans. The drive temperature is therefore higher even under the best conditions. In addition, the smaller fans fill with gunk quickly and as a result wear out faster than larger ones, leading regularly to a drive trapped in an uncooled box.

    I've used enclosures from Promise, Enermax, and several other companies whose products were so bad I tried to forget their names; all had fans that instantly became the least reliable part of the entire system once I installed the drive frame, and I wasn't happy with the drive's temperature from day one.

    I don't think the person making this comment at Tom's ever keeps systems running long enough to realize the long-term issues that come with anything cheaper than server-grade drive enclosures for hard drives. I'd welcome suggestions for a better quality product in this category. It's a hard subject to cover, because by the time you've had several units setup for a year or two to gather useful data on how rugged they are, the product is obsolete; not something any review site I'm aware of is setup to cover.
    • A sun D1000 loaded with latest-generation 300GB disk drives? Not a bad solution, slow, and not the cheapest.

      Apple X-serve RAID? Cheapest - does it work reliably with Linux or Solaris? Word in the street is that it does, but I have not seen a demo yet.

      We're actually going with recycling our ancient D-1000s and A-1000s with no-name 300 GB SCSI drives. Pretty old school, but reliable.
  • RAID 11? Or, more to the point, how would I implement a mirror, but with 3 drives? Does linux 'md' do this? How about any controllers?

    After all, we're supposed to replicate data 3 times, right?
    • how would I implement a mirror, but with 3 drives? Does linux 'md' do this? How about any controllers?


      Linux md RAID-1 allows you to replicate to n number of drives, PLUS set m more drives as spares that will be automatically substituted for failed drives without intervention. You can spread the drives among as many controllers as you want.

      Of course you need off site backups too (fire, theft, lightning, human error).
      • A very handy feature. After all, if you're going to have a RAID1 + hot-spare setup, there's some advantages to just making it a 3-way active mirror instead. When one of the primary drives dies, you don't have a recovery window while the hot-spare syncs up with the array.

        The only downside would be that the hot-spare drive gets used and may suffer from wear-and-tear more then a hot-spare drive that spends its time powered down.

        On the flip side... you don't have to sit and wonder if that hot-spare drive
    • Solaris Volume Manager (aka ODS and SDS) does this out of the box. I wouldn't be surprised if other LVMs do too.

      With SVM, create three stripes, create a mirror with one stripe, newfs it, when the newfs is complete, add a stripe to the mirror, (and then add another stripe to the mirror).

      The stuff in the brackets is the only difference between creating RAID 1+0. Oh, and you don't want to use the GUI...

      FWIW, my standard rollout these days is software-triple-mirrored hardware-RAID5 enclosures with an independa
    • by Spirilis ( 3338 )
      I know Linux software RAID1 lets you do this. In addition, don't you get a little extra benefit from this in the form of further improvements in multithreaded reads? (ability to service up to 3 read requests simultaneously, although it doesn't help you at all for writes...)
  • by AmiMoJo ( 196126 ) on Friday February 23, 2007 @02:11PM (#18125166) Homepage Journal
    With recent articles on HDDs not being very good for redundancy (because they often fail at the same time if they are from the same batch, or fail because of things like electrical spikes which affect all drives in an array) it is clear that HDDs are not an ideal backup medium. I use an external 2.5" HDD which is totally disconnected from the PC and everything else when not in use (to avoid power surges etc), but only for critical data as my machine has 1.2TB of HDD storage.

    Optical discs are a joke - 4.3GB is just not enough. Larger formats exist but are relatively expensive. Tape is expensive per MB and slow, plus it isn't random access and not suited to anything but slow full backups. MO is too small and expensive.

    It seems like the best bet is something like a Century Tower - basically a USB enclosure that can take up to 4/8 drives. Keep it totally disconnected when not in use, and use RAID 0 mirroring with drives from different manufacturers.
    • by operagost ( 62405 ) on Friday February 23, 2007 @02:39PM (#18125590) Homepage Journal

      Optical discs are a joke - 4.3GB is just not enough. Larger formats exist but are relatively expensive. Tape is expensive per MB and slow, plus it isn't random access and not suited to anything but slow full backups.
      Your knowledge is out of date. For example, a SuperDLT 640 backs up at 32 MB/s with compression. Slower than a disk, but not "slow". Sequential access: well that's a given. Only suited for full backups? That's news to my company. Even daily incrementals and differentials are usually hundreds of megabytes or a few GB, which negates the small spool-up time of the tape. Besides, most modern tapes now store metadata on an internal chip so that an on-tape index does not need to be searched.

      use RAID 0 mirroring
      RAID 0 is striping. You probably mean RAID 10 or RAID 0+1.
    • by Sketch ( 2817 )

      It seems like the best bet is something like a Century Tower - basically a USB enclosure that can take up to 4/8 drives. Keep it totally disconnected when not in use, and use RAID 0 mirroring with drives from different manufacturers.
      ...and hope you don't have a fire.
      • by AmiMoJo ( 196126 )
        In terms of risk, I get power surges all the time, and so do most places I have ever checked out. Lights dimming etc. On the other hand, I have never actually seen a house on fire. I saw one burned out once... but, looking at it, I think a fire is probably not something I'm that worried about. My life would be screwed anyway.
    • Hard? Why?

      mt -f /dev/nst0 eod
      tar cvbf 512 /dev/nst0 / --exclude /proc --exclude /mnt --exclude /sys --exclude /media --exclude /dev

      Every now and then you could do an image backup with
      dd if=/dev/hda of=/dev/nst0 bs=64k
      so you're able to restore your drive quickly. Works fine even on a live filesystem if it's the journaling type.

      And if you're on Windows, install cygwin first, or boot Knoppix for an image backup.

      Oh, of course you'll need a tape drive. Yes, you can do the same with optical disks but I don't trus
      • by AmiMoJo ( 196126 )
        I think you managed to totally miss the point there. Quite spectacular really, congrats.

        If I was going to only need 40GB tapes, I wouldn't bother. I'd use a USB HDD. Much easier to work with. It's not like I need to archive my backups for years, I just need one that works when my main HDD breaks.

        I was talking about a system with 1.2TB of data. That would need over 30 tapes to back up, not even counting incrementals. Not DVD rips either, renders and associated data which cannot be easily replaced except by d
      • by kasperd ( 592156 )

        Works fine even on a live filesystem if it's the journaling type.

        you shouldn't rely on that. You are not reading it atomically. Data can change while you are reading the drive. The image you end up with could be in an inconsistent state. If you take the image and write it back to disk, I would expect the file system driver to replay the journal on the first mount and mark the file system clean. But just because the file system driver flips a flag to say the file system is good doesn't mean it has fixed the

  • Oooh, and this just happened to me a few weeks ago. well, not quite, but close enough.

    I had an LVM container that sat on a RAID-1 volume go bad.

    the lvm tools couldn't reconstruct the container, so I effectively 'lost' my partitions.

    There wasn't any program I could find which would scan the raid volume for the data partitions,
    so I ended up cobbling one together on my own, out of the sources in the ext2-tools distro.

    And yes, I did get my data back, and no, i'm no longer using LVM containers.
  • Cheap Solution (Score:2, Informative)

    I'm a big fan of the hard drive->freezer method. It has been alleged that putting a broken hard drive into a freezer can sometimes make the data readable again for a short period of time.
    • by Jjeff1 ( 636051 )
      It's true, but only for specific problems. We used to call it stiction [wikipedia.org] at the shop I worked at. I don't think we really knew what we were doing, but if the drive didn't spin up, putting it in the freezer could fix it. It usually fixed it enough to spin up and get your data back, but that was good enough.
    • by Intron ( 870560 )
      My wife's laptop drive (30 GB Travelstar) went bad and I did this. After trying everything else, I put it in the freezer for a couple of hours. It brought it back to life long enough to get most of the data off, then died for good.
  • by hurfy ( 735314 ) on Friday February 23, 2007 @03:04PM (#18126022)
    Besides having a backup not connected to system, i found simply having a spare disk to steal the circuit board off of to be a life saver :)

    I miss the old bigfoot drives we had, everyone said they had problems with them but it was always (in our case) the board that died NOT the disk. I saved a couple of those by swapping in a board for a 1 hour recovery.

    If you buy several HD for RAID or whatever buy one more and stick it on shelf for a rainy day. Along with a few utilities you can do 3/4's of what they do for $100 instead of $1000+

  • Lunch (Score:4, Interesting)

    by Seraphim_72 ( 622457 ) on Friday February 23, 2007 @03:28PM (#18126372)

    I attended a small conference where the Kroll VP of Data Recovery was speaking. He came in, his assistant set up his power point stuff, made sure the projector was right etc. He then gave a very interesting talk about what Kroll could pull off of a drive, despite what had been done to it. By way of example he showed a slide of a burnt and bent hard drive - that came out of the sky when the shuttle broke up. They recovered 99% of the data on that drive. He also mentioned that they do the data recovery for all of the spook organizations in D.C.

    When we broke for lunch I got to sit at his table and we got to ask him all sorts of questions about their processes. He mentioned they have things they use that they have never patented because it would be too much of a leg up for both the competition and those that seek to destroy data. We tried to get him to tell us what we would have to do to a drive to make it unreadable. Mostly his answers to our "Surely this would make the data unreadable" queries were "You would think that would work wouldn't you?" Someone referenced his assistant who was sitting next to him and the VP said:

    "Him? No, no, no. (laughs) He is not my assistant, in fact he doesn't work for me at all. He is a lawyer for the company and is here to make sure I don't say anything I am not supposed to." The assistant then gave us one of those 'I could eat you alive' lawyer smiles.

    I walked out secure in the knowledge that short of melting the platters down the data can *always* be recovered.

    Sera

    • by rthille ( 8526 )

      I've got those thermite packs against the drives in my server for a reason damn it!

      Actually, I suppose if I was really paranoid I could use the welding torch in the garage to melt the drives down, but I don't think I'd get as much for them on eBay...
    • I walked out secure in the knowledge that short of melting the platters down the data can *always* be recovered.

      Encrypt. I guarantee that even if the NSA can break AES, they won't do it for anything short of top secret cases that will never see the light of day. Breaking random drives encrypted with AES or any other modern cipher would disclose their ability to break that cipher and no one would use it anymore, removing their advantage.
  • As long as you know how the RAID config was setup(striping size), most disk recovery programs will do the job just fine. GetDataBack NTFS is functional and simple tool to use as long as you know how the disks were setup. Including RAID5...I've rebuilt 3 RAID5's and a shitload of 0's, 1's, and 01's. You should see the look on some of these people's faces after your done(with all 18+hrs of it...)The problem usually I find is that if you recovered the data then the customer is usually under the impression that
  • by cyanics ( 168644 ) on Friday February 23, 2007 @04:40PM (#18127324) Homepage Journal
    Last week, i did a data recovery on a client that had multiple disk head crash from a power outage, or a kick or something. The drives were resulting in a click-seek, which for the most parts is unrecoverable.

    Popped in a Helix disk, and checked what the MFT was doing. Low and behold, no MFT, no boot sector, and a huge list of bad sectors. Basically, the crash had resulted in a bad sector in the bad sector table, and all over the first portion of the disk.

    These were 200GB disks, but eventually I was able to get a sector repair program to read through and do a non-destructive repair. Data was safe, but was now corrupt. Next step was to repair the data, and I was finally able to just use chdisk to repair.

    Eventually, it was back to real data, and was able to push the data over to a new replacement hard drive.

    Told the client to invest in RAID 1, but seriously doubt they would be willing to spend that $100 for the RAID. Instead, they prefer to pay $1000 for a repair.

    BACKUPS. make lots of BACKUPS. RAID your stuff, and get those backups offsite. Do them regularly. Seriously, it would save your ass if something happens. For example, I have a LAN HD that is parked out in a shed in my backyard. Total cost $200, and has already saved my ass 2x.
  • by swordgeek ( 112599 ) on Friday February 23, 2007 @07:03PM (#18129238) Journal
    As much as this stuff is cool, it's going to be insanely expensive to restore data from these guys.

    Data integrity and uptime are served by RAID5. If it's not good enough, then it should be backed with mirroring (RAID5+0) or some form of dual-parity RAID (RAID-DP from NetApp, etc.).

    But data gets lost or corrupted, even without disk failures. Backups are the place where data recovery is done. DO YOUR BACKUPS!
  • ZFS makes some types of recovery simple:
    http://docs.sun.com/app/docs/doc/819-5461/6n7ht6qt 0?a=view [sun.com]

    For example:

    Once you have determined that a device can be replaced, use the zpool replace command to replace the device. If you are replacing the damaged device with another different device, use the following command:

    # zpool replace tank c1t0d0 c2t0d0
    # zpool status tank
    pool: tank
    state: DEGRADED
    reason: One or more devices is being resilvered.
    action: Wait for the resilvering process to c

  • And people who treat it as such are ignorant. The RAID drives are still on the same server, thus vulnerable to electrical surges, fire, malicious action, etc. Far better to back up to tapes or external drives that are isolated from the machine when not backing up and keep the media in a fireproof safe or deposit box somewhere off-site.

    -b.

Our OS who art in CPU, UNIX be thy name. Thy programs run, thy syscalls done, In kernel as it is in user!

Working...