Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Hardware

Digitizing Your Dead Trees? 367

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

This discussion has been archived. No new comments can be posted.

Digitizing Your Dead Trees?

Comments Filter:
  • Re:Go To Kinko's!!!! (Score:4, Interesting)

    by Microsift ( 223381 ) on Thursday May 09, 2002 @04:53PM (#3493003)
    I seriously doubt Kinko's would do this. They are ultra-paranoid about violating copyright. I imagine if you could do it at Kinko's, you'd have to all the work yourself in the Self-Service area. I doubt they have machines like that in self-service.
  • Essential? (Score:4, Interesting)

    by daeley ( 126313 ) on Thursday May 09, 2002 @04:53PM (#3493005) Homepage
    If they're that 'essential' how can you justify cutting them up? (A 100 pounds of tech books is, what, three or four books? ;)

    Maybe you could donate the bulk of them to a school or something, follow the other suggestions about downloading fair-use versions where possible, digitize the few remaining ones, and start using ebooks or Safari [oreilly.com] (or similar) exclusively from now on.
  • Try one of these... (Score:3, Interesting)

    by matthew.thompson ( 44814 ) <matt@acERDOStuality.co.uk minus math_god> on Thursday May 09, 2002 @04:56PM (#3493023) Journal
    Canon DR-5020 [canon.com]

    Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.

    Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.

  • by Anonymous Coward on Thursday May 09, 2002 @05:00PM (#3493053)
    Think about it.

    People love books in dead tree format for the most part. You don't really want to curl up with a cup of coffee and a nice monitor. No, you want some good old dead tree.

    But when you're coding, you don't want to curl up with a cup of coffee. You want to sit in a chair and hammer out code while quaffing coffee as if it were, well, coffee.

    Most of the time when I look through books for reference, it's annoying. I'd rather be able to just grep for info.

    Thankfully, at least O'Reilly's catching on to this. :)
  • by veggiespam ( 5283 ) on Thursday May 09, 2002 @05:03PM (#3493075)
    Schools for the blind have been doing this for years, especially with technical books. Many of my V.I. friends would remove the binding and feed them through a high-speed sheet feeder to a scanner. Then, the books are proofed by seeing people for OCR perfection. Contact your local school and ask if they already have some of your works in pdf/jpeg/tiff/WordPerfect (yes, lots of Word Perfect). They may be willing to give you some legal copies of your books in exchange for you converting some of the books you have that they don't into blind readable format (which means, you'd have to proof your own book for accuracy - but you're doing that anyway). Basically, you're donating your time for a good cause and bennifiting yourself.
  • by ComputerSlicer23 ( 516509 ) on Thursday May 09, 2002 @05:04PM (#3493083)
    All depends. I have probably 8 C++ books that have lots of different useful information in them. Really, I probably only need 3 of them, the ISO standard (yes I own a copy), Strousup's C++ Language and Jossutis's book (big black book, can't remember the title).

    I own probably 500 computer books that completely cover an 6ft by 6ft section on my wall. No I haven't read all of them, but I have read 80% of them cover to cover, and I know the table of contents on the rest of the books. It's generally very useful to keep lots of reference material "grey matter indexed". That is, I know which book to find it in and roughly where it is in the book. I have found on-line documentation to be of very low quality personally, and I like to peruse it when I don't have a computer handy

    The other consideration is it is nice to know the documentation isn't going to change, or move, or do anything weird. Of course it isn't going to get updated either so, cuts both ways.

  • PDF and OCR (Score:2, Interesting)

    by 4/3PI*R^3 ( 102276 ) on Thursday May 09, 2002 @05:08PM (#3493105)
    If you really want to go through all this effort use both PDF and OCR.
    OCR sucks royally for large documents, documents with images or diagrams, handwritten comments, etc. However scanning the pages to an image and then creating a PDF of the images does not care about any of that.
    So, scan all of your books as images that your OCR software can process. Use the OCR output to create an index of pages. If a specific word on a specific page doesn't OCR well who cares. With typed and professionally printed books your OCR software should be about 90% accurate. Take the images and create PDF files.
    Now you have your nice clean images but you still have a searchable index. BTW, when you get this done post your procedures, problems, and solutions to a web site somewhere so that you can share your experiences with the rest of the world.
  • by Embedded Geek ( 532893 ) on Thursday May 09, 2002 @05:12PM (#3493135) Homepage
    My father passed on Sunday and we were going through all the family papers. We have lots of original documents from my family during the Civil War and earlier. My sister and I were thinking of donating them to a museum, so there would be no risk of their loss should my house get damaged (there's way too many documents to fit in my fire safe).

    Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:

    What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?

    What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?

    What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)

    Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.

    Thanks in andvance for your advice.

  • by itsdave ( 105030 ) on Thursday May 09, 2002 @05:30PM (#3493258)
    I subscribed to the safari club shortly after they announced it and I was not pleased.

    for starters, I could only have access to three books at any givin time, I decided to just choose 3 books right when i signed up and later decided i wanted to trade one of the books in for another which they allowed me to do just fine. However, I then decided I wanted to check out another book and it said, sorry, you can only switch a selection once per month.. oh, isnt that handy, so .. do you really have access to all the books no matter where you are? no, you only get access to a few. then I thought, it would be nice if I could save a local copy and then put it in a nice searchable databse. no way, they stopped me in my tracks for turning the pages too fast because they detected that I was a spider.

    thanks oreilly, I love your books but you can keep your safari club.
  • by Effugas ( 2378 ) on Thursday May 09, 2002 @06:35PM (#3493625) Homepage
    Run, don't walk, to http://djvu.research.att.com/home.html . DJVu is a image-based competitor to PDF that is a feat of beautiful engineering -- 300DPI scans break down to about 10-30K a page, the viewer is about an order of magnitude faster than PDF, the format cleanly supports separate encoding of page texture/graphics vs. page text, there's significant amounts of open source for it, and more.

    It's truly a brilliant format. Go check it out.

    Yours Truly,

    Dan Kaminsky
    DoxPara Research
    http://www.doxpara.com
  • by savetz ( 201597 ) on Thursday May 09, 2002 @07:54PM (#3493985) Homepage
    I have scanned several books (in my case, Atari and other classic computing books) for atariarchives.org [atariarchives.org]. The process takes time, but is worth it.

    A scanner with a reliable sheet feeder is essential. This doesn't necessarily mean expensive -- I've seen a lot of reasonable-looking scanners with ADFs on ebay for less than $100.

    I cut the pages off the books using a single-edge razor blade -- non-ragged cuts are essential. Then I scan then into TIFF format at 300 DPI, greyscale. If I want searchable PDFs, I use OmniPage X on a Mac to create image-over-text PDF, it's quick and easy.

    But most of the time, I these books are for Web viewing. So I use a graphics conversion program with batch capability (GraphicConverter on the Mac) to a) increase the contrast dramatically -- near 100%; b) trim the whitespace from the edge of the images; c) scale the pages as necessary. d) scale them more to create thumbnail versions. [atariarchives.org]

    There are no hard-and-fast rules for choosing the final file type. Just got to balance file size and readability, and this varies from book to book. Sometimes I go with JPEG, sometimes 8-bit GIF, and sometimes 4-bit GIF. Sometimes I'll convert every page to GIF and also to JPG, then use a little script to select the smallest one for each page.
  • by barfy ( 256323 ) on Thursday May 09, 2002 @08:02PM (#3494019)
    The digital representation of the "copyrighted" work as existed in a "page layout" program, using a technological means to prevent digital copying: Imaged to paper using digitally created "Plates".

    By attempting to "recreate" the digital representation by using technological means to defeat the digital copy protection of a bound book, you are criminally liable to the owner of the copyright.

    (Now if you were just copying this to another piece of paper, you may be ok under existing laws. But moving it to digital... Um, hands up scofflaw!)

  • dead trees to CDs (Score:2, Interesting)

    by Simonetta ( 207550 ) on Thursday May 09, 2002 @10:07PM (#3494402)
    I am also faced with the task of converting thousands of pages from paper to text files. I suggest looking into using a high resolution digital camera in a custom docking station above a flat surface that holds the printed material. (a photo enlarger comes to mind). Then instead of waiting for the scanner carriage to pass downward over the page, you can take a snapshot of the page.
    Send the image directly from the camera to the OCR program. I find that the Xerox TextBridge program can do OCR on a page almost as fast as I could turn the page were I not using a scanner to input the text. TextBridge is quite ackward to use and not very customizable for new types of applications such as this.
    Using a high resolution digital camera to input OCR text is also a good way to get around the question of whether or not to cut off the binding of the book.
    By the way, I assume that you're wishing to scan european language text. Doing OCR on Japanese, Chinese, or Korean I would assume is much slower than recognizing ASCII. Does anyone know of an available program that will do OCR on Chinese?
    With our friends in the middle east obsessed with blowing the shit out of us, it might be time to develop an open-source program that will do OCR on Arabic and Farsi, along with a translation program companion. Would Arabic be much more difficult to OCR because all of the phonetic symbols are joined together? I sometimes wonder about these things when I'm bumming about not having a life.
  • by Lord Vipor Scorpion ( 218440 ) on Friday May 10, 2002 @12:38AM (#3494868)
    I was locked out because of their spidering filter, too. But I called up at like eight o'clock one night & someone unlocked it for me (& set it so that it wouldn't happen again).

    Safari also has a very good search engine, althought it's wierd that they coded it in MS ASP.

    The spidering filter seems intent on inhibiting the casual copier. I thought this was lame, but there's actually a certain logic to it. If you go to all the trouble to download & reassemble the books, then you've put enough work into it not to not just throw the book out there on Gnutella.

    At it's most expensive, Safari books cost $2 per month. So I'm not impeding anyone's education, and I'd like to see this service stick around. In fact, I can save people a bundle if I get them to use it the way it's meant to be used.

    The one lame thing is that OReilly pads their selection with multiple editions of the same book and also with books that are available for free on the openbook site--ok, that's like five books, but still... They're really starting to get a good selection now.

    In college, I used a free (as in stolen beer) html copy of a textbook for a class, and realized at the end of the year that someone had purposefully altered the book so that a lot of information was horribly incorrect. They'd basically cut out the word "not" all through the book, and inserted it after "is" in other places. Most people would not do that, but some a-hole did. Ah, college, what a hellhole.

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (5) All right, who's the wiseguy who stuck this trigraph stuff in here?

Working...