Forgot your password?
typodupeerror
Data Storage Software

Automated PDF File Integrity Checking? 40

Posted by timothy
from the one-at-a-time dept.
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
This discussion has been archived. No new comments can be posted.

Automated PDF File Integrity Checking?

Comments Filter:
  • How about... (Score:5, Informative)

    by Uncle Focker (1277658) on Thursday May 22, 2008 @02:14PM (#23509666)
    Maintaining a database of md5 checksum on the archived versions of the files and periodically check your live versions against it?
  • Re:How about... (Score:5, Informative)

    by ZephyrXero (750822) <zephyrxero@nOsPam.yahoo.com> on Thursday May 22, 2008 @02:20PM (#23509760) Homepage Journal
    It sounds more like what he needs is to take an md5sum of new files when they are added to the archive and verify any changes to them are made by a user specifically overwriting the file rather than some sort of software/hardware corruption as he's apparently experiencing. The md5 part is easy to automate, however the second part may require a human eye :/
  • Re:How about... (Score:4, Informative)

    by Ritchie70 (860516) on Thursday May 22, 2008 @02:25PM (#23509820) Journal
    For Windows, Microsoft has a free command line tool, "FCIV.EXE", that will do this (MD5 and/or SHA) and save it all in an XML database for you. It will also then validate the files against that database.

    It's part of one of the resource kits.
  • Re:How about... (Score:5, Informative)

    by Azarael (896715) on Thursday May 22, 2008 @02:38PM (#23510018) Homepage
    This is one of the features of the git revision control system:

    File integrity checking is built into the basic lookup mechanism, so that corruption will be detected automatically
    from http://lwn.net/Articles/145194/ [lwn.net]
  • use ZFS (Score:3, Informative)

    by larry bagina (561269) on Thursday May 22, 2008 @02:55PM (#23510312) Journal
    it has built in integrity checking and stuff.
  • Multivalent (Score:2, Informative)

    by Anonymous Coward on Thursday May 22, 2008 @03:42PM (#23510930)
    I once found this:

    http://multivalent.sourceforge.net/ [sourceforge.net]

    The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.

    Written in Java, so should run anywhere.
  • by gblues (90260) on Thursday May 22, 2008 @03:47PM (#23510998)
    The OP is not asking about preventing future corruption; the OP wants an automated way to sift through 6500 PDFs to find corrupt (or at least, potentially corrupt) PDF files without having to open each one by hand.

    MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.

    To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.

    I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
  • PDF validation (Score:5, Informative)

    by Peter H.S. (38077) on Thursday May 22, 2008 @04:01PM (#23511182) Homepage
    Here is a java command line tool designed to check the validity of 1000's of pdf files:

    http://multivalent.sourceforge.net/Tools/pdf/Validate.html [sourceforge.net]

    There is also a tool for repairing some pdf errors:
    http://multivalent.sourceforge.net/Tools/index.html [sourceforge.net]

    Never used it myself, just stumbled over it when I was searching for some pdf software.

    --
    Regards
  • Re:How about... (Score:2, Informative)

    by TheRaven64 (641858) on Thursday May 22, 2008 @04:11PM (#23511320) Journal
    MD5 gives you error detection, but not correction. You'd be better off with par2 for this kind of thing. When you add a file, run the par2 utility to generate the check file. On OS X, do this with a Folder Action whenever a new file is created with a .pdf extension. Then just set up a cron job that runs every month or so and attempts to verify / repair the files. Make sure you check the output of this, since silent data corruption is usually a sign that the drive is on its way out.
  • XPdf (Score:2, Informative)

    by dtrumpet (1294668) on Friday May 23, 2008 @01:36PM (#23520820)
    XPdf comes with a 'pdfinfo' command line utility. It returns non-zero if the PDF is corrupt. Should be somewhat efficient and very easy to automate.

Forty two.

Working...