Automated PDF File Integrity Checking? 40
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
How about... (Score:5, Informative)
Re:How about... (Score:5, Informative)
Re:How about... (Score:4, Informative)
It's part of one of the resource kits.
Re:How about... (Score:5, Informative)
use ZFS (Score:3, Informative)
Multivalent (Score:2, Informative)
http://multivalent.sourceforge.net/ [sourceforge.net]
The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.
Written in Java, so should run anywhere.
md5? WTF? RTFQ, morans (Score:5, Informative)
MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.
To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.
I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
PDF validation (Score:5, Informative)
http://multivalent.sourceforge.net/Tools/pdf/Validate.html [sourceforge.net]
There is also a tool for repairing some pdf errors:
http://multivalent.sourceforge.net/Tools/index.html [sourceforge.net]
Never used it myself, just stumbled over it when I was searching for some pdf software.
--
Regards
Re:How about... (Score:2, Informative)
XPdf (Score:2, Informative)