Forgot your password?
typodupeerror
Data Storage

Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ? 405

Posted by samzenpus
from the save-often dept.
An anonymous reader writes "Hi there ! I'm looking for a simple solution to backup a big data set consisting of files between 3MB and 20GB, for a total of 24TB, onto multiple hard drives (usb, firewire, whatever) I am aware of many backup tools which split the backup onto multiple DVDs with the infamous 'insert disc N and press continue', but I haven't come across one that can do it with external hard drives (insert next USB device...). OS not relevant, but Linux (console) or MacOS (GUI) preferred... Did I miss something or is there no such thing already done, and am I doomed to code it myself ?"
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Simple Way To Backup 24TB of Data Onto USB HDDs ?

Comments Filter:
  • by gagol (583737) on Friday August 10, 2012 @05:30AM (#40943529)
    If you can achieve a sustained write speed of 50 megabytes per second, you are in for 140 hours of data transfer. I hope it is not a daily backup!
  • by bernywork (57298) <bstapleton.gmail@com> on Friday August 10, 2012 @05:32AM (#40943543) Journal

    http://www.bacula.org/en/ [bacula.org]

    There's even a howto here:

    http://wiki.bacula.org/doku.php?id=removable_disk [bacula.org]

  • by Anonymous Coward on Friday August 10, 2012 @05:34AM (#40943549)

    I'm guessing you don't have enough space to split a backup on the original storage medium and then mirror the splits onto each drive?

    Given the size requirements, it seems that might be prohibitive, but it would make things easier for you:

    How to Create a Multi Part Tar File with Linux [simplehelp.net]

  • RAID (Score:5, Informative)

    by Anonymous Coward on Friday August 10, 2012 @05:34AM (#40943553)

    For that much data you want a RAID since drives tend to fail if left sitting on the shelf, and they also tend (for different reasons) if they are spinning.

    Basically: buy a RAID enclosure, insert drives so it looks like one giant drive, then copy files.

    For 24TB you can use eight 4TB drives for a 6+2 RAID-6 setup. Then if any two of the drives fail you can still recover the data.

  • git-annex (Score:4, Informative)

    by Anonymous Coward on Friday August 10, 2012 @05:40AM (#40943585)

    You might want to look into git-annex:
    http://git-annex.branchable.com/ [branchable.com]

    I've not tried it, but it sounds like an ideal solution for your request, especially if your data is already compressed.

  • by Anonymous Coward on Friday August 10, 2012 @05:50AM (#40943643)

    Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium. As long as your mySQL catalog is intact restoration is a synch...

    Did I mention it supports backup archiving as well if you want duplicate copies for Tapes being shipped off site...

  • by cyocum (793488) on Friday August 10, 2012 @05:56AM (#40943669) Homepage
    Have a look at tar and it's "multi-volume" [gnu.org] option.
  • by Anonymous Coward on Friday August 10, 2012 @05:56AM (#40943671)

    Here's a Linuxquestions thread [linuxquestions.org] outlining multi-disk backup strategies.

    The gist of the discussion is to use DAR [linux.free.fr].

  • Bash.... (Score:5, Informative)

    by djsmiley (752149) <djsmiley2k@gmail.com> on Friday August 10, 2012 @06:08AM (#40943733) Homepage Journal

    First bash script to grab the size of the "current" storage;

    compress the files up until that size;

    Move compressed file onto storage;

    request new storage, start again.

    ----------

    Or, if you've got all the storage already connected; bash for 0..x; do { cp $archive$x /mount/$x/ }; done :D

  • by leuk_he (194174) on Friday August 10, 2012 @06:09AM (#40943739) Homepage Journal

    multi volume tar [gnu.org]Just mount a new usb disk whenever it is full.

    However to have reasonable retrieve rate (going through 24 TB of data will rake some days over USB2), You better split the dataset in multiple smaller sets. That also has the advantage that if one disk chrashes (AND Consumer grade USB disk will chrash!) not your entire dataset is lost.

    For that reason (diskfailure), do not use some linux spanning disk feature. File systems are lost when one of the disks they write on are lost. Unless you use a feature that can handle lost disks (Raid/ Zraid)

    And last but not least: Test your backup. I have seen myself cheap USB interfaces failing to write the data to disk without a good error messages. All looks ok until you retreive the data and some files are corrupted.

  • Use DAR or KDAR (Score:3, Informative)

    by pegasustonans (589396) on Friday August 10, 2012 @06:18AM (#40943771)

    If you don't want to invest in new hardware, you could use DAR [ubuntugeek.com] or KDAR [sourceforge.net] (KDE front-end for DAR).

    With KDAR, what you want is the slicing settings [sourceforge.net].

    There's an option to pause between slices, which gives you time to mount a new disk.

  • Re:solution (Score:5, Informative)

    by aglider (2435074) on Friday August 10, 2012 @06:23AM (#40943797) Homepage

    3.samba

    Uh? Why?
    cp -a is all you need once you put the HDD inside the target machine.
    And if you put it into another machine on the same network, then rsync is the answer.
    Forget about the buggy and slow SAMBA.

  • Re:No. (Score:2, Informative)

    by ledow (319597) on Friday August 10, 2012 @06:34AM (#40943851) Homepage

    USB 2.0 provides 480Mbps of (theoretical) bandwidth. So unless you go Gigabit all over your network (not unreasonable), you won't beat it with a NAS. Even then, it's only 1-and-a-bit times as fast as USB working flat-out (and the difference being if you have multiple USB busses, you can get multiple drives working at once). And USB 3.0 would beat it again. And 10Gb between the client and a server is an expensive network to deploy still.

    Granted, eSATA would probably be faster but there's nothing wrong with USB for such tasks if you *don't* want to provide Gigabit connections everywhere and (presumably) greater-than-gigabit backbones.

  • Re:Tape? (Score:5, Informative)

    by Anonymous Coward on Friday August 10, 2012 @06:57AM (#40943955)

    No kidding. For $2400, you get 24x TB HDs and a bookkeeping nightmare if you ever actually resort to the "backup." For $3k, you get a network-ready tape autoloader with 50-100TB capacity and easy access through any number of highly refined backup and recovery systems.

    Now, if the USB requirement is because that's the only way to access the files you want to steal from an employer or government agency, then the time required to transfer across the USB will almost guarantee you get caught. Even over the weekend. You should come up with a different method for extracting the data.

  • PAR (Score:4, Informative)

    by fa2k (881632) <pmbjornstad@@@gmail...com> on Friday August 10, 2012 @06:59AM (#40943967)

    I have just seen "PAR" a couple of times here on slashdot, haven't used it, but it seems great for this: http://en.wikipedia.org/wiki/Parchive [wikipedia.org] . You need enough redundancy to allow one USB drive to fail. And I would rather get a SATA bay and use "internal" drives than having to deal with external USB drives. Get "green" drives, they are slow but cheap.

  • by arth1 (260657) on Friday August 10, 2012 @07:17AM (#40944049) Homepage Journal

    Yes, Bacula is the only real solution out there that isn't going to cost you an arm and a leg, and that allows you to switch easily between any backup medium.

    Except for good old tar, which is present on all systems.

    Most people are probably not aware that tar has the ability to create split tar archives. Add the following options to tar:
    -L <max-size-in-k-per-tarfile> -M myscript.sh ... where myscript.sh echoes out the name to use for the next tar file in the series. It can be as easy as a for loop checking where the tar file already exists and returning the next hooked up volume where it doesn't.
    Or it could even unmount the current volume and automount the next volume for you. Or display a dialogue telling you to replace the drive.

    One advantage is that you can easily extract from just one of the tar files; you don't need all of them or the first-and-last like with most backup systems. Each tar file is a valid one, and at most you need two tar files to extract any file, and most of them just one.

    Tar multivolume can, of course, be combined with tar's built in compression.

  • Re:solution (Score:1, Informative)

    by myowntrueself (607117) on Friday August 10, 2012 @07:35AM (#40944119)

    3.samba

    Uh? Why?
    cp -a is all you need once you put the HDD inside the target machine.
    And if you put it into another machine on the same network, then rsync is the answer.
    Forget about the buggy and slow SAMBA.

    cp copies file by file.

    A more efficient way is something like

    tar -cf - .|(cd /somewhere/ ; tar xf -)

    tar treats the directory contents as a data stream. Its much faster for large amounts of files and data.

  • by Anonymous Coward on Friday August 10, 2012 @08:09AM (#40944319)

    It's "nudge-nudge", not "notch-notch".

    Also, you left out "wink-wink".

    Yes, I know, I should get a life..

  • by v1 (525388) on Friday August 10, 2012 @08:31AM (#40944455) Homepage Journal

    I have a setup here where the server's video media is about 8tb in size. That backs up via rsync to the backup server which is in another room over rsync. It contains a large number of internal and external drives. None of them are over 2tb in capacity. The main drive has data separated into subfolders and the rsync jobs back up specific folders to specific drives.

    A few times I've had to do some rearranging of data on the main and backup drives when a volume filled up. So it helps to plan ahead to save time down the road. But it works well for me here.

    The only thing with rsync you need to worry about is users moving large trees or renaming root folders in large trees. This tends to cause rsync to want to delete a few TB of data and then turn around and copy it all over again on the backup drive. It doesn't follow files and folders by inode, it just goes by exact location and name.

    I help mitigate this by hiding the root folders from the users. The share points are a couple levels deeper so they can't cause TOO big of a problem if someone decides to "tidy up". If they REALLY need something at a lower level moved or renamed, I do it myself, on both the source and the backup drives at the same time.

    Another alternative is to get something like a Drobo where you can have a fairly inexpensive large pool of backup storage space that can match your primary storage. This prevents the problem of smaller backup volumes filling up and requiring data shuffling, but does nothing for the issue of users mucking with the lower levels of the tree.

  • Re:RAID (Score:4, Informative)

    by Sarten-X (1102295) on Friday August 10, 2012 @08:58AM (#40944647) Homepage

    As mentioned already, RAID is not a backup solution. While it will likely work fine for a while, the risk [datamation.com] of a catastrophic failure rises as drive capacity increases. From the linked article:

    With a twelve -terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent - meaning that RAID 5 has no functionality whatsoever in that case. There is always a chance of survival, but it is very low.

    Granted, this is talking about RAID 5, so let's naively assume that doubling the parity disks for RAID 6 will halve the risk... but then since we're trying to duplicate 24 terabytes instead of twelve, we can also assume the risk doubles again, and we're back to being practically guaranteed a failure.

    Bottom line is that 24 terabytes is still a huge amount of data. There is no reliable solution I can think of for backing it all up that will be cheap. At that point, you're looking at file-level redundancy managed by a backup manager like Backup Exec (or whatever you prefer) with the data split across a dozen drives. As also mentioned already, the problem becomes much easier if you're able to reduce that volume of data somewhat.

  • Re:solution (Score:5, Informative)

    by fnj (64210) on Friday August 10, 2012 @08:58AM (#40944651)

    No. It's slower. Informative, my ass.

  • by milgr (726027) on Friday August 10, 2012 @09:17AM (#40944841)

    The LHC generates a petabyte per second [slashdot.org].

  • Re:DaisyChain (Score:5, Informative)

    by Painted (1343347) on Friday August 10, 2012 @10:35AM (#40945779) Homepage
    DON'T DO THIS.

    We did this exact thing using WD Green drives for our 18Tb backup problem. Got two of 'em, planning on using their built-in rsync for onsite/off siting the data. Unfortunately, the units never broke 1MB/s transfer, and no amount of work with Drobo yielded faster performance reliably. Both of our units are now sitting unused, ($2500 each!), and we put the drives into a RAID-50 8 bay USB3 enclosure. The new unit runs about 150x faster, and ended up costing $400 (prices are for enclosures only, drives were additional).

    Most disappointing was Drobo's support- they just seemed to shrug a lot, and were hyper-agressive about closing trouble tickets.
  • by voltorb (2668983) on Friday August 10, 2012 @10:36AM (#40945799)
  • Re:RAID (Score:4, Informative)

    by louic (1841824) on Friday August 10, 2012 @10:54AM (#40946019)

    As mentioned already, RAID is not a backup solution.

    Nevertheless, there is nothing wrong with using disks that happen to be in a RAID configuration as backup disks. In fact, it is probably a pretty good idea for large files and large amounts of data.

I took a fish head to the movies and I didn't have to pay. -- Fish Heads, Saturday Night Live, 1977.

Working...