Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files? 440
First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.
Don't waste your time. (Score:5, Insightful)
if you really want, sort, order and index it all, but my suggestion would be different.
If you didn't need the files in the last 5 years, you'll probably never need them at all.
Maybe one or two. Make one volume called OldSh1t, index it, and forget about it again.
Really. Unless you have a very good reason to un-dupe everything, don't.
I have my share of old files and dupes. I know what you're talking about :)
Well, the sun is shining. If you need me, I'm outside.
Prioritize by file size (Score:5, Insightful)
Since the objective is to recover disk space, the smallest couple of million files are unlikely to do very much for you at all. It's the big files that are the issue in most situations.
Compile a list of all your files, sorted by size. The ones that are the same size and the same name are probably the same file. If you're paranoid about duplicate file names and sizes (entirely plausible in some situations), then crc32 or byte-wise comparison can be done for reasonable or absolute certainty. Presumably at that point, to maintain integrity of any links to these files, you'll want to replace the files with hard links (not soft links!) so that you can later manually delete any of the "copies" without hurting all the other "copies". (There won't be separate copies, just hard links to one copy.)
If you give up after a week, or even a day, at least you will have made progress on the most important stuff.
Don't worry about it (Score:2, Insightful)
First, copy everything to a NAS with new drives in it in RAID5. Store the old drives someplace safe (they may stop working if left off for too long, but its better if something does go wrong with the NAS to have them right?).
Then, copy everything current to your new backup drives on your computer, and automate the backup so that it only keeps two or three versions of files so you don't end up with this problem again. Keep track of things you want to archive and archive them separately.
An ounce of prevention is better than a pound of cure. We all get into backup and duplicate problems eventually. I have found keeping my core work in dropbox and making a backup of it occasionally provides enough measure of data backup for me, but the information I generate in the lab doesn't take up so much space.
Re:CRC (Score:5, Insightful)
Re:CRC (Score:4, Insightful)
s/CRC32/sha1 or md5, you won't be CPU bound anyway.
Whatever you use it's going to be SLOW on 5TB of data. You can probably eliminate 90% of the work just by:
a) Looking at file sizes, then
b) Looking at the first few bytes of files with the same size.
After THAT you can start with the checksums.
Re:Linux livecd? (Score:4, Insightful)
Re:CRC (Score:5, Insightful)
Part 2 of your method will quickly bog down if you run into many files that are the same size. Takes (n choose 2) comparisons, for a problem that can be done in n time. If you have 100 files all of one size, you'll have to do 4950 comparisons. Much faster to compute and sort 100 checksums.
Also, you don't have to read the whole file to make use of checksums, CRCs, hashes and the like. Just check a few pieces likely to be different if the files are different, such as the first and last 2000 bytes. Then for those files with matching parts, check the full files.
Re:CRC (Score:4, Insightful)
5TB only why dedupe? (Score:4, Insightful)
You say the data is important enough that you don't want to nuke it. Wouldn't it be also true to say that the data that you've taken the trouble to copy more than once is likely to be important? So keep those dupes.
To me not being able to find stuff (including being aware of stuff in the first place) would be a bigger problem
Just hash first 4K of each file, avoid 2nd pass (Score:2, Insightful)
Only hash the first 4K of each file and just do them all. The size check will save a hash only for files with unique sizes, and I think there won't be many with 4.2M media files averaging ~1MB. The second near-full directory scan won't be all that cheap.
By the time you have sorted this out... (Score:4, Insightful)
...it will have cost you far more than simply buying another drive(s) if all you are really concerned about is space...
Re:CRC (Score:5, Insightful)
Someone who's technical expertise is in areas other than writing script files. There are technical jobs other than being a sysop you know.
Re:Wait it out (Score:4, Insightful)
I will go out on a limb, risk my geek card and propose another alternative:
Windows Server 2012 has a deduplication feature which works atop of NTFS (not ReFS). Unlike "real" deduplication on the LVM level which you get with your EMC, the files are written to the filesystem fully "hydrated", and as time passes, a background task [1] sifts through the blocks, finds ones that are the same, then adds reparse points.
The reason I'm suggesting this is that if one already has a Windows file server, it might be good to slap on 2012 when it is available, configure deduplication on a dedicated storage volume, and let it do the dirty work on the block level for you.
Of course, ZFS is the most elegant solution, but it may not be the best in the application.
[1]: Fire up PowerShell and type in:
Start-DedupJob E: â"Type Optimization
if you want to do it in the foreground after setting it up, if you did a large copy and want to dedupe it all.
Re:CRC (Score:5, Insightful)
I once had to write an audio file de-deuplicator; one of the big problems was you would ignore the metadata and the out-of-band data when you did the comparisons, but you always had to take this stuff into account when you were deciding which version of a file to keep -- you didn't want to delete two copies f a file with all the tags filled out and keep the one that was naked.
My de-duper worked like everyone here is saying -- it cracked open wav and aiff (and Sound Designer 2) files, captured their sample count and sample format into a sqlite db, did a couple of big joins and then did some SHA1 hashes of likely suspects. All of this worked great, but once I had the list I had the epiphany that the real problem of these tools is the resolution and how you make sure you're doing exactly what the user wants.
How do you decide which one to keep? You can just do hard links, but...
But let's say you can do hard links, no problem. How do you decide which instance of the file is to be kept, if you've only compared the "real" content of the file and ignored metadata? You could just give the user a big honking list of every set of files that are duplicates -- two here, three here, six here, and then let them go through and elect which one will be kept, but that's a mess and 99% of the time they're going to select a keeper on the basis of which part of the directory tree it's in. So, you need to do a rule system or a preferential ranking of parts of the directory hierarchy that tell the system "keep files you find here." Now, the files will also have metadata, so you also have to preferentially rank the files on the basis of its presence -- you might also rank files higher if your guy did the metadata tagging, because things like audio descriptions are often done with a specialized jargon that can be specific to a particular house.
Also, it'd be very common to delete a file from a directory containing an editor's personal library, and replacing it with a hard link to a file in the company's main library -- several people would have copies of the same commercial sound, or an editor would be the recordist of a sound that was subsequently sold to a commercial library, or whatever. Is it a good policy to replace his file with a hardlink to a different one, particularly if they differ in the metadata? Directories on a volume are often controlled by different people with different policies and proprietary interest to the files -- maybe the company "owns" everything, but it still can create a lot of internal disputes if files in a division or individual project's library folder starting getting their metadata changed, on account of being replaced with a hard link to a "better" file in the central repository. We can agree not to de-dup these, but it's more rules and exceptions that have to be made.
Once you have to list of duplicates, and maybe the rules, do you just go and delete, or do you give the user a big list to review? And, if upon review, he makes one change to one duplicate instance, it'd be nice to have that change intelligently reflected on the others. The rules have to be applied to the dupe list interactively and changes have to be reflected in the same way, otherwise it becomes a miserable experience for the user to de-dupe 1M files over 7 terabytes. The resolution of duplicates is the hard part, the finding of dupes is relatively easy.
Re:CRC (Score:4, Insightful)
$19.95 for a beta of something you can whip up in about an hour of shell scripting.
Hell, I wrote exactly what people are talking about here in an afternoon in college - I even did both SHA and MD5, because I ended up finding a SHA collision between one of the Quake 3 files and a Linux system file.
Re:CRC (Score:5, Insightful)
$19.95 for a beta of something you can whip up in an hour of shell scripting.
If the poster were you, they wouldn't have had to 'ask slashdot'.