Automated PDF File Integrity Checking? 40
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
Duplicity + S3 + log file (Score:1, Interesting)
quick script (Score:3, Interesting)
Sure there's a way (Score:3, Interesting)
Or maybe there are other cmd-line tools which issue a "failed to load" error. That's where I'd look first. Like a tool to strip content out of a PDF - script it so it outputs to
Linus to the rescue: Use Git (Score:2, Interesting)
http://git.or.cz/ [git.or.cz]
Check them all into a repository, then periodically run git-fsck. Git hashes all files in a repository with SHA-1 when they're first committed, and git-fsck recalculates the hashes.
Jon
Ghostscript (Score:3, Interesting)
Many are commenting on using checksums (MD5, SHA, ....) to validate the file hasn't changed. This is good. However, none of these can actually tell if the PDF was is good to begin with. I would suggest using Ghostscript to verify that the PDF is properly structured. Ghostscript is an opensource tool that can convert PDF and Postscript files to several other formats. If Ghostscript can interpret the PDF file without errors, then odds are the file is good too.