Automated PDF File Integrity Checking?

Automated PDF File Integrity Checking? 40

Posted by timothy on Thursday May 22, 2008 @03:13PM from the one-at-a-time dept.

WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."

Automated PDF File Integrity Checking?

This discussion has been archived. No new comments can be posted.

Search 40 Comments Log In/Create an Account

Comments Filter:

Duplicity + S3 + log file (Score:1, Interesting)

by Anonymous Coward writes: on Thursday May 22, 2008 @03:17PM (#23509712)

Remote backup with a notification of changed files and versions of all previous files for restoration.

quick script (Score:3, Interesting)

by debatem1 ( 1087307 ) writes: on Thursday May 22, 2008 @03:17PM (#23509714)

wouldn't be too hard to write an inotify script that stores a backup of the file and an md5sum whenever you drop a file in. wouldn't help you recover an already corrupt document, but it would help you to stop it in the future. a tie-in to the actions menu would make it more usable, but that's a bit more effort, and such solutions probably already exist.

Sure there's a way (Score:3, Interesting)

by b4dc0d3r ( 1268512 ) writes: on Thursday May 22, 2008 @03:29PM (#23509882)

There are PDF libraries out there - write a wrapper that loads a file, and when it gets to the end without error emits a 0 "no error" return code, and any errors result in a non-zero code.

Or maybe there are other cmd-line tools which issue a "failed to load" error. That's where I'd look first. Like a tool to strip content out of a PDF - script it so it outputs to /dev/null and check the exit code. I'd be surprised if there were a ready-made solution for this somewhere.

Linus to the rescue: Use Git (Score:2, Interesting)

by stew1 ( 40578 ) writes: on Thursday May 22, 2008 @04:44PM (#23510950) Homepage Journal

Use git.

http://git.or.cz/ [git.or.cz]

Check them all into a repository, then periodically run git-fsck. Git hashes all files in a repository with SHA-1 when they're first committed, and git-fsck recalculates the hashes.

Jon

Ghostscript (Score:3, Interesting)

by Marillion ( 33728 ) writes: <ericbardes@gm[ ].com ['ail' in gap]> on Thursday May 22, 2008 @05:08PM (#23511272)

Many are commenting on using checksums (MD5, SHA, ....) to validate the file hasn't changed. This is good. However, none of these can actually tell if the PDF was is good to begin with. I would suggest using Ghostscript to verify that the PDF is properly structured. Ghostscript is an opensource tool that can convert PDF and Postscript files to several other formats. If Ghostscript can interpret the PDF file without errors, then odds are the file is good too.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Automated PDF File Integrity Checking? 40

Automated PDF File Integrity Checking? More Login

Automated PDF File Integrity Checking?

Duplicity + S3 + log file (Score:1, Interesting)

quick script (Score:3, Interesting)

Sure there's a way (Score:3, Interesting)

Linus to the rescue: Use Git (Score:2, Interesting)

Ghostscript (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot