Ask Slashdot: How Do I De-Dupe a System With 4.2 Million Files? 440
Posted
by
samzenpus
from the copies-of-the-copies dept.
from the copies-of-the-copies dept.
First time accepted submitter jamiedolan writes "I've managed to consolidate most of my old data from the last decade onto drives attached to my main Windows 7 PC. Lots of files of all types from digital photos & scans to HD video files (also web site backup's mixed in which are the cause of such a high number of files). In more recent times I've organized files in a reasonable folder system and have an active / automated backup system. The problem is that I know that I have many old files that have been duplicated multiple times across my drives (many from doing quick backups of important data to an external drive that later got consolidate onto a single larger drive), chewing up space. I tried running a free de-dup program, but it ran for a week straight and was still 'processing' when I finally gave up on it. I have a fast system, i7 2.8Ghz with 16GB of ram, but currently have 4.9TB of data with a total of 4.2 million files. Manual sorting is out of the question due to the number of files and my old sloppy filing (folder) system. I do need to keep the data, nuking it is not a viable option.
Simple dedupe algorithm (Score:5, Funny)
Delete all files but one. The remaining file is guaranteed unique!
Re:CRC (Score:3, Funny)
Do a CRC32 of each file. Write to a file one per line in this order: CRC, directory, filename. Sort the file by CRC. Read the file linearly doing a full compare on any file with the same CRC (these will be adjacent in the file).
Would you be so kind to write a program/script which can do that ?
Payment information please, AC?
Re:CRC (Score:4, Funny)
I looked at this as I, like the subby, have terabytes of porn to sort.
But $19.95 for a beta?