Automated PDF File Integrity Checking? 40
WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
How about... (Score:5, Informative)
Re:How about... (Score:5, Informative)
Re: (Score:2)
Re:How about... (Score:5, Informative)
Re: (Score:2, Insightful)
Re:How about... (Score:4, Informative)
It's part of one of the resource kits.
Re: (Score:2, Informative)
Re: (Score:2)
There is no article to NOT read here, buddy. And PAR(2) could be the worst suggestion for such a situation as I have ever heard. Parity is meant to work over a finite set of data. This guy has variable amounts of PDF's. You just added a layer of complexity (you'd need to somehow define "sets" of PDF's),
Re: (Score:2)
I'm definitely not a programmer nor math geek.
Duplicity + S3 + log file (Score:1, Interesting)
quick script (Score:3, Interesting)
Sure there's a way (Score:3, Interesting)
Or maybe there are other cmd-line tools which issue a "failed to load" error. That's where I'd look first. Like a tool to strip content out of a PDF - script it so it outputs to
md5sum (Score:2, Insightful)
md5sum -c sums
Not exactly automated, but I wouldn't exactly call typing 2 lines to be manual labor; and once you've got the sums you really just need the second line.
Put something like this in a shell script and you can make it automatically replace files that fail a hash check with a good backup. Use perl, python, or whatever, and you can make it work across Windows, OS X, and *nix.
Re: (Score:2)
That assumes that all the PDF's start out valid, and will never be validly changed. What you really want is something like just using ghostscript to render each PDF to a temporary image, and then an automated check to make sure the image isn't 100% blank. (Or, just accepting the result if ghostscript doesn't exit with an
Re: (Score:1)
haha... you should see my R-commandscript-sed-awk-paste-echo-forloop- bash one liners I did to process some R data analysis and make it latex-table-ready and their respective graphics =oD
Yay for Linux... that was teh k1ll3r app that made me not run windows at work
Re: (Score:2)
Mind you, if the rendering is fubared, like a font problem or something, so the page looks like crap, it may still be a valid pdf and pass through any sort of check with no problem. A corrupted image will still show up
use ZFS (Score:3, Informative)
Re: (Score:2)
Checksums (Score:2)
Just use the Linux md5sum utility:
Create checksums: md5sum file > file.md5
Test: md5sum -c file.md5
Or use a compressor: bzip2 file
Test: bzip2 -tv file.bz2
Multivalent (Score:2, Informative)
http://multivalent.sourceforge.net/ [sourceforge.net]
The Multivalent suite of document tools includes a command-line utility that validates PDFs. It can be run across a whole directory of files too, so should do the trick.
Written in Java, so should run anywhere.
Linus to the rescue: Use Git (Score:2, Interesting)
http://git.or.cz/ [git.or.cz]
Check them all into a repository, then periodically run git-fsck. Git hashes all files in a repository with SHA-1 when they're first committed, and git-fsck recalculates the hashes.
Jon
md5? WTF? RTFQ, morans (Score:5, Informative)
MD5 generates a hash of the binary data of the PDF file. A MD5 hash will not tell you if a PDF file is corrupt; it is only useful once the integrity of the PDF has been confirmed. After the integrity is confirmed, then you can make your database of MD5 hashes, to detect future corruption.
To test that a given file is a valid PDF, you could probably use something like pdf2ps; you don't care about the PostScript output per se, but you'd be testing for an error code. If pdf2ps returns an error code, you set the file aside for manual verification. This should, if nothing else, whittle down that 6500 PDF archive into a much smaller subset that you can feasibly test manually using Adobe Acrobat. And those, if you "refry" them (print them back to the Adobe PDF printer to re-PDF it), will probably fix the PDF so it passes the pdf2ps test.
I will leave the actual writing of a script to recurse through your directories, feed each PDF file through pdf2ps, and test for error codes, as an exercise to the OP. Now that you have an idea of what to do, actually doing it should be pretty simple.
Re: (Score:2)
If that is indeed the case, and he's repeatedly encountering corrupt files, then I'd suggest he's asked the wrong question.
As for pdf2ps, I'm unfamiliar with what error codes it returns, but if it's useful as you state, then it's worth pointing out that all the utilities he'll need (including md5,
PDF validation (Score:5, Informative)
http://multivalent.sourceforge.net/Tools/pdf/Validate.html [sourceforge.net]
There is also a tool for repairing some pdf errors:
http://multivalent.sourceforge.net/Tools/index.html [sourceforge.net]
Never used it myself, just stumbled over it when I was searching for some pdf software.
--
Regards
Ghostscript (Score:3, Interesting)
Many are commenting on using checksums (MD5, SHA, ....) to validate the file hasn't changed. This is good. However, none of these can actually tell if the PDF was is good to begin with. I would suggest using Ghostscript to verify that the PDF is properly structured. Ghostscript is an opensource tool that can convert PDF and Postscript files to several other formats. If Ghostscript can interpret the PDF file without errors, then odds are the file is good too.
Prevention, first (Score:3, Insightful)
From a general accessibility perspective, the age of the folders shouldn't matter, nor should the age of the files contained within them: A properly operating file system will maintain the integrity of the files it tracks indefinitely, assuming the underlying media is sound and all related hardware is functioning correctly.
Certainly, for verification of critical data, checksums are a good measure so long as they are done at the time of file creation, after verification that the files are good, but in light of the reported symptoms, I'd investigate the source of the problem first, and correct it. Then I'd make provisions for checksumming, in addition to regular file system health checks, before backing up those files and their checksums.
Proceeding from a "bottom-up point of view": For Windows-based systems, regardless of the file system in use (although I'd hope you'd be using NTFS), regular file system scans via CHKDSK are a must. The same applies to the file systems of other OS': Run whatever utilities are available to verify the integrity of the file system on each hard drive regularly.
In addition, most hard drive manufacturers have utilities that you can download for free that will non-destructively scan the media for grown defects. These are typically available as ISOs: Make a CD, boot from it, and follow the instructions carefully, preferably after making a full, verified backup. Naturally, you'll have to know the manufacturer(s) of your hard drives.
Once you've identified the cause of the corruption, and corrected it, then you can (and should) make provisions for checksums.
But, there are other things that you can, and should check as well. Make sure that the AC power to your computer is sound from an electrical perspective and that the power available is sufficient for the load being placed upon it. Buy a good UPS if you don't have one already, and if you do have one, test it.
Then, test the power supply in the computer to ensure that it is providing adequate power.
Then test the memory in your computer.
Then test the hard drives, both surface level and file system level.
Hope this helps.
Re: (Score:2)
I was going to post exactly this. Files randomly becoming corrupted? Maybe if I had 5,000 Chinese kids remembering numbers, but data shouldn't just change on a computer, whether it's over the wire, on disk, or in memory. Treat the disease, not the symptom.
Re: (Score:2)
filesystem or hardware issue? (Score:2)
If you've got files on your computer that you only read, never write, and those files are getting corrupted, then it sounds like you have a problem with your filesystem, or a problem with your hardware. You need to find and fix the problem with the filesystem or hardware, not apply band-aids to PDF files if the problem has nothing to do with the PDF format per se.
Another possibility would be that you're using buggy software that is supposed to open PDF files in read-only mode, but actually corrupts them.
Re: (Score:3, Insightful)
Your suggestions are of course valid, it must considered a high priority to find out whether the system is corrupting the files, or if they were bad from the beginning.
--
Regards
Might I suggest... (Score:1)
Version Control System (Score:2)
http://en.wikipedia.org/wiki/Comparison_of_revision_control_software [wikipedia.org] seems interesting, look for distributed and atomic commits, and signed tags (though this by itself doesn't guarantee it catches file errors right away).
I use and love Git, and though Windows support is there, it is questionable I have heard.
XPdf (Score:2, Informative)