Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Data Storage Software

Automated PDF File Integrity Checking? 40

WomensHealth writes "I have about 6500 pdfs in my 'My Paperport Documents' folder that I've created over the years. As with all valuable data, I maintain off-site backups. Occasionally, when accessing a very old folder, I'll find one or two corrupted files. I would like to incorporate into my backup routine, a way of verifying the integrity of each file, so that I can immediately identify and replace with a backed-up version, any that might become corrupted. I'm not talking about verifying the integrity of the backup as a whole, instead, I want to periodically check the integrity of each individual PDF in the collection. Any way to do this in an automated fashion? I could use either an XP or OS X solution. I could even boot a Linux distro if required."
This discussion has been archived. No new comments can be posted.

Automated PDF File Integrity Checking?

Comments Filter:
  • md5sum (Score:2, Insightful)

    by Nozsd ( 1080965 ) on Thursday May 22, 2008 @03:38PM (#23510012)
    md5sum *.pdf > sums
    md5sum -c sums

    Not exactly automated, but I wouldn't exactly call typing 2 lines to be manual labor; and once you've got the sums you really just need the second line.

    Put something like this in a shell script and you can make it automatically replace files that fail a hash check with a good backup. Use perl, python, or whatever, and you can make it work across Windows, OS X, and *nix.
  • Re:How about... (Score:2, Insightful)

    by Last_Available_Usern ( 756093 ) on Thursday May 22, 2008 @04:14PM (#23510588)
    A checksum won't help if the user replaces/saves the file with a corrupted version.
  • Prevention, first (Score:3, Insightful)

    by Anonymous Coward on Thursday May 22, 2008 @05:26PM (#23511512)
    One of the things that strikes me about the posts thus far is that nobody has asked the first and most important question: *WHY* are the files becoming corrupted? And what is the nature of the corruption?

    From a general accessibility perspective, the age of the folders shouldn't matter, nor should the age of the files contained within them: A properly operating file system will maintain the integrity of the files it tracks indefinitely, assuming the underlying media is sound and all related hardware is functioning correctly.

    Certainly, for verification of critical data, checksums are a good measure so long as they are done at the time of file creation, after verification that the files are good, but in light of the reported symptoms, I'd investigate the source of the problem first, and correct it. Then I'd make provisions for checksumming, in addition to regular file system health checks, before backing up those files and their checksums.

    Proceeding from a "bottom-up point of view": For Windows-based systems, regardless of the file system in use (although I'd hope you'd be using NTFS), regular file system scans via CHKDSK are a must. The same applies to the file systems of other OS': Run whatever utilities are available to verify the integrity of the file system on each hard drive regularly.

    In addition, most hard drive manufacturers have utilities that you can download for free that will non-destructively scan the media for grown defects. These are typically available as ISOs: Make a CD, boot from it, and follow the instructions carefully, preferably after making a full, verified backup. Naturally, you'll have to know the manufacturer(s) of your hard drives.

    Once you've identified the cause of the corruption, and corrected it, then you can (and should) make provisions for checksums.

    But, there are other things that you can, and should check as well. Make sure that the AC power to your computer is sound from an electrical perspective and that the power available is sufficient for the load being placed upon it. Buy a good UPS if you don't have one already, and if you do have one, test it.

    Then, test the power supply in the computer to ensure that it is providing adequate power.

    Then test the memory in your computer.

    Then test the hard drives, both surface level and file system level.

    Hope this helps.
  • by Peter H.S. ( 38077 ) on Thursday May 22, 2008 @08:03PM (#23512912) Homepage
    Personally I think that the pdf files were dodgy from the beginning, but that the errors just show up when using newer generation pdf-viewing software. That could explain why it only seems to be very old files that are corrupted, a more random system error or a systematic software problem would corrupt newer files too.

    Your suggestions are of course valid, it must considered a high priority to find out whether the system is corrupting the files, or if they were bad from the beginning.


Matter cannot be created or destroyed, nor can it be returned without a receipt.