Forgot your password?
typodupeerror
Hardware

Digitizing Your Dead Trees? 367

Posted by Cliff
from the when-you-can't-carry-them-with-you dept.
smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

This discussion has been archived. No new comments can be posted.

Digitizing Your Dead Trees?

Comments Filter:
  • Great (Score:2, Insightful)

    by Quill_28 (553921) on Thursday May 09, 2002 @04:50PM (#3492975) Journal
    Now the bookseller's will join with the entertainment industry. Nexty we will be seeing books that can't be scanned easily.

    Remeber those passkeys for computer games in the 80's that were black on maroon paper? Or some dial thingy.

  • by fractalus (322043) on Thursday May 09, 2002 @04:52PM (#3492987) Homepage
    Most of my technical books contain vast quantities of useful information in charts, diagrams, and illustrations... which are far more of a challenge to OCR than mere printed text.

    I suspect that even were this sort of thing really possible, it's a major time investment. I have several dozen technical books I'd like to scan, each with four hundred or so pages... and I'm not sure I want to spend a week's vacation time doing it.

    And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.
  • by alt.sex.fetish.jesus (542450) on Thursday May 09, 2002 @04:52PM (#3492991)
    I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?

    Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper ;-) ). The rest of the time, I get what I need off the web or USENET.

    As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.
  • by cheesyfru (99893) on Thursday May 09, 2002 @04:54PM (#3493008) Homepage
    I've got about 30+ O'Reilly books, Design Patterns, Stroustrap C++, etc. They're out there if you look long enough. LimeWire has also been a big help in it as well.
  • by SirWhoopass (108232) on Thursday May 09, 2002 @04:57PM (#3493030)
    Electronic manuals are great, particularly because of the ability to search them. I certainly use plenty of them.

    Personally, however, I still like printed manuals. Using an online manual means either reducing some windows or switching desktops. With a paper manual I can keep the screen exactly as it is. Higher resolution screens, or the use of multiple screens, are making online manuals much more useful (anyone remember what a pain in the ass it was to try and figure out something with only an online manual on a 640x480 screen?). Occasionally I still manage to fill two 1600x1200 screens with a bunch of stuff I want to keep visible while still reading the manual.

  • by deacon (40533) on Thursday May 09, 2002 @05:03PM (#3493073) Journal
    You are going to cut up thousands of dollars worth of your "essential" books?

    And put them into an inferior visual format you cannot read without the computer being working and on?

    And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.

    All this just so you don't have to make 3 trips to move your books?

    Mmmkayyy.. (backs away slowly)

    Have you ever heard of a dolly?

  • by SystemFork (578511) on Thursday May 09, 2002 @05:04PM (#3493081)
    Perhaps the original poster should subscribe to the O'Reilly books they've purchased (for a month) and then save each chapter locally. Even at Safari's upper subscription levels of $100/mo you get access to 200 books. There's no way you could get a quality scanner with a feeder and OCR software for less than $100. Re-inventing the wheel is instructive, but silly. ------
  • by binaryDigit (557647) on Thursday May 09, 2002 @05:05PM (#3493089)
    I think you may be underestimating the sheer enormity of your task. Getting sheets to all feed right (a little skew and you're skrewed) and in order (feeder issues, what happens when one page mis-scans/feeds, can you go back and insert it into it's proper location), handling front to back issues (though I would assume that decent scanning software would take care of this for you). Also, your plan to use jpg might be problematic. OCR is finicky enough as it is, back when we were scanning documents we always used 300dpi tiff (using group3 or group4 lossless compression) to get the maximum accuracy rates from the ocr package we were using. And speaking of accuracy, keep in mind that OCR software that has a 97% accuracy rate means that it will flub 3 out of every 100 words, in a book that might contain tens/hundreds of thousands or words, that is a whole lot of errors. Now it's been a few years (6-8) since I've done this kind of stuff, so who knows, maybe things are much better now?

    I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

    Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.
  • by Dredd13 (14750) <dredd@megacity.org> on Thursday May 09, 2002 @05:08PM (#3493104) Homepage
    That's nice, but why would he want to pay a monthly fee to rent books he already owns?

    Because there's something very nice to having access to your 30-odd book collection from home, office, conference, at a job-site, etc. etc., without dragging along 40 pounds of books with you everywhere you go.

    It's a convenience you pay for. Considering how many ORA books many people pay for (and keep current as new editions come out), the annualized cost of simply subscribing and NOT buying the dead-tree version at all is very appealing to some folks, especially if their lifestyle has them wanting ready access to the material "from lots of different places".

  • by sphealey (2855) on Thursday May 09, 2002 @05:22PM (#3493195)
    I suppose this will be marked off-topic, since the poster is asking about digitization hardware. But whenever I see coworkers with tons of books on their desk shelf, I wonder to myself why they really need them. Do they actually have time to read them? Or are they more for show?
    Because once you have developed the skill of processing technical books/documentation, you can scan through them and pick up critical information rapidly - far faster than you could click through them as hypertext.

    Case in point: I recently took a position where I had to do some work with Oracle, which I had not used previously. After some skimming at B&N, I purchased 5 good texts. A lot of pages, but when you need to figure something out you can open 2 or 3 of them, mark multiple pages, and get the outline of what you need very quickly.

    sPh

  • by Waffle Iron (339739) on Thursday May 09, 2002 @05:44PM (#3493366)
    Do they actually have time to read them? Or are they more for show?

    Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.

    In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.

    The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.

    I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.

  • by spectecjr (31235) on Thursday May 09, 2002 @05:51PM (#3493407) Homepage
    Yeah, it really sucks having to pay for convenience, doesn't it? Everything should be free (beer) and handy and no company should ever prevent you from misusing a service they offer just because they have a right to.

    Personally, I subscribe to Safari, and I think it's great. I recognize that the 5 (maybe when you subscribed it was only 3, but now the bottom subscription level is 5) book limit and the "you can only change books once a month" provision and the anti-spidering technology was all to protect O'Reilly's considerable investment in their books and yet still allow me the convenience of reading and searching a selection of their books online.

    But yeah, it really sucks when a company tries hard to both cater to internet geeks *and* protect their investments. They should just post all their books online for free and allow me to write everything to my hard drive so I don't have to pay anymore.


    You're not paying for convenience.

    Since when did you fill your bookshelf with books that expired after a month. Or that you had to pay for continuously?

    Just sell me the E-Book version. ONCE. That's all I ask. Embed my name and address in there if you want; just let me buy the book as a file.

    Preferably, for the same price as the physical book, minus cost of printing / distribution / retailer markup.

    Simon
  • I've done this (Score:4, Insightful)

    by brad3378 (155304) on Thursday May 09, 2002 @06:01PM (#3493451)
    To do it, I purchased a used HP scanner with a 50 page Automatic Document Feeder (Search for ADF on Ebay).

    I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.

    It is a problem for two reasons:

    most of my books are double sided.
    My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.

    If books contain more pages than the ADF holds, The first page scanned will still be named page 1.

    If I knew a little perl, I'd write a script to rename the files between scan batches.

    For scanning full bound textbooks, there are two main problems:

    Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.

    You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.

    Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).

    Lastly, be realistic.
    Know how much time you'll need to invest.
    Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.

  • And then... (Score:2, Insightful)

    by Pvt_Waldo (459439) on Thursday May 09, 2002 @08:40PM (#3494149)
    The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.


    And then 3 weeks after you chuck it, go "Damn, I can't read this page!" when you go to look up something and it says, "It is extremely important that you fark dnf2 gib oefll or else you will damage your hard disk."

    Stick with books. There's a reason why they are popular. They work really well. Besides, the trees are already dead so you're not doing them a favor. And you'll just have to kill more trees to get more books to scan more stuff.
  • You are insane (Score:4, Insightful)

    by labradore (26729) on Thursday May 09, 2002 @09:09PM (#3494236)
    Ask yourself this question:
    What is the oldest file that I have?
    and ask:
    What is the oldest useful file that I have?
    For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.

    You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.

Never put off till run-time what you can do at compile-time. -- D. Gries

Working...