Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files? 440
First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.
Re:CRC (Score:3, Interesting)
4. play with inner joins.
Much like there's 50 ways to do anything in Perl, there's quite a few ways to do this in SQL.
select filename_and_backup_tape_number_and_stuff_like_that, count(*) as number_of_copies
from pile_of_junk_table
group by md5hash
having number_of_copies > 1
Theres another strategy where you mush two tables up against each other... one is basically the DISTINCT of the other.
triggers are widely complained about, but you can implement a trigger system (or psuedo-trigger, where you make a wrapper function in your app) where basically a table of "files" is stored with a row called "count of identical md5hash" and then your sql looks like select * from pile where identicalcount>1
There's ways to play with views.
Do you need to run it interactively or batch it or just run it basically once or ... If you're allowed to barf on data input you can even enforce the md5 hash as a UNIQUE INDEX or UNIQUE KEY in the table definition.
You'll learn a lot about how to think about high performance computing. Are you trying to minimize latency or minimize storage or minimize index size or maximize reliability/uptime or minimize processor time or minimize NAS bandwidth or minimize (initial OR maintenance) programming time or ....
The funniest thing is if you're never tried restoring data from backups (hey, it happens), and/or never had a tape failure (hey it happens), you'll THINK you want to eliminate dupes, but trust me, those dupes will save your bacon someday, and tape is cheap compared to cost of programmer and cost of lost data.... 5 TB is not much technically but is obviously worth a lot from a business standpoint...
Also from personal experience you're going to find people gaming the system where DOOM3.EXE and NOTEPAD.EXE happen to have the same md5hash and length and NOTEPAD.EXE was found an a not-totally but pretty much noob's desk. Use some judgement and don't come down too hard on the newest of new learners.
Re:Good free command line tool (Score:4, Interesting)
I recently had this problem and solved it with finddupe (http://www.sentex.net/~mwandel/finddupe/). It's a free command line tool. It can create hardlinks, you can tell it which is a master directory to keep and which directories to delete, and it can create a batch file to do actually do the deletion if you don't trust it or just want to see what it will do. Highly recommend. In any case, 5 TB is going to take forever but with finddupe you can be sure your time is not wasted, unlike one of the free tools that analyzed my drive for 12 hours and then told me it would only fix ten duplicates.
I tried this vs. Clone Spy, Fast Duplicate File Finder, Easy Duplicate File Finder, and the GPL Duplicate Files Finder (crashy). (Side note: Get some creativity guys). There's no UI but I don't care. It doesn't keep any state between runs so run it a few times on subdirectories to make sure you know what it's doing first then let it rip.
Re:CRC (Score:4, Interesting)
I confess, if I had a modern i5 or i7 processor and appropriate software I'd be tempted to in fact calculate some sort of AES-based HMAC, as I would have hardware assist to do that.
Re:Don't waste your time. (Score:4, Interesting)
If You're Like Me (Score:3, Interesting)
The problem started with a complete lack of discipline. I had numerous systems over the years and never really thought I needed to bother with any tracking or control system to manage my home data. I kept way to many minor revisions of the same file, often forking them over different systems. As time past and rebuilt systems, I could no longer remember where all the critical stuff was so I'd create tar or zip archives over huge swaths of the file system just in case. I eventually decided to clean up like you are now when I had over 11 million files. I am down to less than half a million now. While I know there are still effective duplicates, at least the size is what I consider manageable. For the stuff from my past, I think this is all I can hope for; however, I've now learned the importance of organization, documentation and version control so I don't have this problem again in the future.
Before even starting to de-duplicate, I recommend organizing your files in a consistent folder structure. Download wikimedia and start a wiki documenting what you're doing with your systems. The more notes you make, the easier it will be to reconstruct work you've done as time passes. Do this for your other day to day work as well. Get git and start using it for all your code and scripts. Let git manage the history and set it up to automatically duplicate changes on at least one other backup system. Use rsync to do likewise on your new directory structure. Force yourself to stop making any change you consider worth keeping outside of these areas. If you take these steps, you'll likely not have this problem again, at least on the same scope. You'll also find it a heck of a lot easier to decommission or rebuild home systems and you won't have to worry about "saving" data if one of them craps out.
Re:CRC (Score:5, Interesting)
This was theorized by one of the RSA guys (Rivest, if I'm not mistaken). I helped support a system that identified files by CRC32, as a lot of tools did back then. As soon as we got to about 65k files (2^16), we had two files with the same CRC32.
Let me say, CRC32 is a very good algorithm. So good, I'll tell you how good. It is 4 bytes long, which means in theory you can change any 4 bytes of a file and get a CRC32 collision, unless the algorithm distributes them randomly, in which case you will get more or less.
I naively tried to reverse engineer a file from a known CRC32. Optimized and recursive, on a 333 mHz computer, it took 10 minutes to generate the first collision. Then every 10 minutes or so. Every 4 bytes (last 4, last 5 with the original last byte, last 6 with original last 2 bytes, etc) there was a collision.
Compare file sises first, not CRC32. The s^16 estimate is not only mathematically proven, but also in the big boy world. I tried to move the community towards another hash.
CRC32 *and* filesize are a great combination. File size is not included in the 2^16 estimate. I have yet to find two files in the real world, in the same domain (essentially type of file), with the same size and CRC32.
Be smart, use the right tool for the job. First compare file size (ignoring things like mp3 ID3 tags, or other headers). Then do two hashes of the contents - CRC32 and either MD5 or SHA1 (again ignoring well-known headers if possible). Then out of the results, you can do a byte for byte comparison, or let a human decide.
This is solely to dissuade CRC32 based identification. After all, it was designed for error detection, not identification. For a 4-byte file, my experience says CCITT standard CRC32 will work for identification. For 5 byte files, you can have two bytes swapped and possibly have the same result. The longer the file, the less likely it is to be unique.
Be smart, use size and two or more hashes to identify files. And even then, verify the contents. But don't compute hashes on every file - the operating system tells you file size as you traverse the directories, so start there.
Re:CRC (Score:3, Interesting)
Re:Anyway... (Score:2, Interesting)