Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
Data Storage Software

Automated PDF File Integrity Checking? 40

Posted by timothy
from the one-at-a-time dept.
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
This discussion has been archived. No new comments can be posted.

Automated PDF File Integrity Checking?

Comments Filter:
  • by Anonymous Coward on Thursday May 22, 2008 @02:17PM (#23509712)
    Remote backup with a notification of changed files and versions of all previous files for restoration.
  • quick script (Score:3, Interesting)

    by debatem1 (1087307) on Thursday May 22, 2008 @02:17PM (#23509714)
    wouldn't be too hard to write an inotify script that stores a backup of the file and an md5sum whenever you drop a file in. wouldn't help you recover an already corrupt document, but it would help you to stop it in the future. a tie-in to the actions menu would make it more usable, but that's a bit more effort, and such solutions probably already exist.
  • Sure there's a way (Score:3, Interesting)

    by b4dc0d3r (1268512) on Thursday May 22, 2008 @02:29PM (#23509882)
    There are PDF libraries out there - write a wrapper that loads a file, and when it gets to the end without error emits a 0 "no error" return code, and any errors result in a non-zero code.

    Or maybe there are other cmd-line tools which issue a "failed to load" error. That's where I'd look first. Like a tool to strip content out of a PDF - script it so it outputs to /dev/null and check the exit code. I'd be surprised if there were a ready-made solution for this somewhere.
  • by stew1 (40578) on Thursday May 22, 2008 @03:44PM (#23510950) Homepage Journal
    Use git.

    http://git.or.cz/ [git.or.cz]

    Check them all into a repository, then periodically run git-fsck. Git hashes all files in a repository with SHA-1 when they're first committed, and git-fsck recalculates the hashes.

    Jon
  • Ghostscript (Score:3, Interesting)

    by Marillion (33728) <ericbardes.gmail@com> on Thursday May 22, 2008 @04:08PM (#23511272)

    Many are commenting on using checksums (MD5, SHA, ....) to validate the file hasn't changed. This is good. However, none of these can actually tell if the PDF was is good to begin with. I would suggest using Ghostscript to verify that the PDF is properly structured. Ghostscript is an opensource tool that can convert PDF and Postscript files to several other formats. If Ghostscript can interpret the PDF file without errors, then odds are the file is good too.

The use of anthropomorphic terminology when dealing with computing systems is a symptom of professional immaturity. -- Edsger Dijkstra

Working...