Digitizing Your Dead Trees? 367

Posted by Cliff on Thursday May 09, 2002 @04:46PM from the when-you-can't-carry-them-with-you dept.

smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"

"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.

What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"

This discussion has been archived. No new comments can be posted.

Digitizing Your Dead Trees?

Search 367 Comments Log In/Create an Account

Comments Filter:

look online before you scan (Score:5, Informative)

by cheesyfru ( 99893 ) writes: on Thursday May 09, 2002 @04:48PM (#3492957) Homepage

You can find a wealth of PDF/PS/HTML/etc copies of computer texts online. Kazaa is a good place to start. Obviously, only download the books you have physical copies of. :-)

Share
twitter facebook
Go To Kinko's!!!! (Score:4, Informative)

by thedbp ( 443047 ) writes: on Thursday May 09, 2002 @04:49PM (#3492963)

Kinko's offers high-volume scan-to-PDF solutions ... at low volume, it is usually a 10 - 25 per page and the cost of the media to copy it to, but in large volume, sometimes the cost can go down to 1 per page.

Call Kinko's. Ask for the Territory Representative. They'll help you out!!!

Share
twitter facebook
Safari is your friend (Score:5, Informative)

by Dredd13 ( 14750 ) writes: <dredd@megacity.org> on Thursday May 09, 2002 @04:50PM (#3492971) Homepage

If you're like me, a good chunk of your collection is ORA books... in which case, you should check out O'Reilly's Safari [oreilly.com], which is their online book offering. It also includes non-ORA books as well, actually.
Quite useful and handy.
D

Share
twitter facebook
Re:look online before you scan (Score:2, Informative)

by MisterBlister ( 539957 ) writes: on Thursday May 09, 2002 @04:50PM (#3492972) Homepage

Most of the stuff you find online is training stuff, like Learn Photoshop or Learn HTML in 21 days or whatever.
There's a dearth of available electronic copies of programming-type texts, except for those where the author/publish creates their own version (like all of Bruce Eckel's books).

Parent Share
twitter facebook
Talk to the project Gutenberg guy (Score:3, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @04:53PM (#3492995)

Check out project gutenberg. I remember that they have a very nice how-to for scanning in texts

Share
twitter facebook
Question is: Free or Not Free (Score:4, Informative)

by The Ape With No Name ( 213531 ) writes: on Thursday May 09, 2002 @04:53PM (#3492996) Homepage

You could scan it all into PDF/PS but I am not sure about making it all into a document with free tools after that but here is a go at a solution.
Adobe Acrobat (read $$$$) does all of this and works well. But if you are *nix person you could pipe some ghostview tools together and put it all into LaTex then re-export it as a digital book in to PDF. Scanning: look no further than a HP scanner. It doesn't even have to be HQ unless you need the diagrams to be photoquality. After that burn it all to CD or, better, DVD.

Share
twitter facebook
I work in this field (Score:5, Informative)

by JeanBaptiste ( 537955 ) writes: on Thursday May 09, 2002 @04:54PM (#3493006)

My company is a document imaging systems reseller. The drawback to siong this is that it is expensive. We work with many different libraries and we sell them book scanners. They do lots of neat things, including things like not breaking the binding of the book during scanning, binding curve compensation, masking/centering, and so on. Most of these customers then take the tiff images and upload them into a document imaging system, although you could easily make pdfs also.

<plug>
Let me recommend the PS7000 from minolta (www.minolta.com), that is the book scanner we sell the most of.

If you are at all interested in document imaging, check out www.otg.com

and if your in minnesota, wisconsin, or the dakotas, check out my companies web site at www.mid-america.com
</plug>

Share
twitter facebook
check sane (Score:4, Informative)

by walt-sjc ( 145127 ) writes: on Thursday May 09, 2002 @04:54PM (#3493013)

Check the hardware list for sane and then pick one of the fastest scanners you can afford. The DB on Sane's web site is your best bet. You will find that to get good scanning speed you will need scsi as USB is just too slow.

jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.

Share
twitter facebook
We do this all the time at the office...... (Score:4, Informative)

by diorio ( 244324 ) writes: on Thursday May 09, 2002 @04:54PM (#3493014)

.....we use a xerox DC265ST. This digital photocopier scans pages at 65 per minute and posts them to an FTP server inhouse. It can scan at 300 or 600 DPI and you can apply OCR after the scans are done. The DC265 is a workhorse and there are about a million of them out there. The scan back feature is a additional price on the device so not everyone spent the money on that feature....but about 1000 Kinko's have these in house and a Kinko's with a good DTP department might actually even know how to use the feature. (Good Luck!)
.

Share
twitter facebook
Electronic format is nice for storage, but... (Score:2, Informative)

by delphin42 ( 556929 ) writes: on Thursday May 09, 2002 @04:57PM (#3493033) Homepage

if you are anything like the computer guys I know (myself included), you'd end up printing out
portions of the text whenever you wanted to read them anyway!!!

Share
twitter facebook
already scanned (Score:2, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @04:57PM (#3493040)

Yup. There is quite a lot already scanned. The best places to look are usenet (at alt.binaries.e-book, alt.binaries.e-book.technical, alt.binaries.e-books) and IRC at #bookwarez and #bookz on undernet, dalnet, and irc.nullus.net (and most likely other irc nets as well.)

You could try making a request in abeb, but the biggest selection in one place is irc. So as long as you are not scared by the interface, that is where I would look first.

Parent Share
twitter facebook
I want both (Score:2, Informative)

by peterdaly ( 123554 ) writes: <petedaly@ix.RASPnetcom.com minus berry> on Thursday May 09, 2002 @05:02PM (#3493070)

O'Rielly (sp?) has many of their java books available on CD-ROM, although I only own the dead tree versions of the ones I have in that series.

On a regular basis, I haul 2188 pages worth, I just added them up, of QUE's Using Java2 Standard Edition, and Enterprise edition, between home an the office. (Speaking of which, go to the link in my .sig and buy some of my favorite books!) That a lot of weight for two books, and I usually haul around a couple smaller ones as well, O'Riely's perl book, and their EJB 3rd edition.

Not only are all of these books heavy, but I have also yet to find an easy way to card them around, they don't all fit right in any of my bags.

I want all of these books on CD-ROM, but not just CD-ROM. Half the books I have INCLUDED a cd-rom, it just doesn't contain the texxt of the book. With O-Riely, I'd buy the CD-ROM version, but I want to dead tree version too. I want to use the dead tree version, unless I am working from home, I want to haul home the CD's. I don't think I should have to pay any more for it either, I bought the IP (in the property sense), and I am already paying the price for the wood slices, which includes a silver disk.

PUBLISHERS, GIVE ME THE BOOK ON THE CD TOO! I spend $100/month or so on tech books.

-Pete

Share
twitter facebook
Re:Safari is your friend (Score:5, Informative)

by Wanker ( 17907 ) writes: on Thursday May 09, 2002 @05:05PM (#3493091)

I'll second this-- the O'Reilley Safari site is wonderful for anyone with a hoard of tech books.
I bet about half of your books are already online.
Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]
I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder* [unisys.com]) GIF.
From the Project Gutenberg "Making Etexts from Paper Originals" paper" [promo.net]: (You can bet these guys know how to scan...)

A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. If acceptable from a file size perspective use the highest possible quality setting for JPEG.

I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.

Parent Share
twitter facebook
Re:Go To Kinko's!!!! (Score:1, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @05:09PM (#3493114)

They won't. I'm working at a K's right now and company policy won't let us copy anything that's copyrighted without proper permission and to hand place that many pages on a scanner bed would be horrendously time consuming.

Parent Share
twitter facebook
FAQ: Making Etexts from Paper Originals (Score:2, Informative)

by ancarett ( 221103 ) writes: on Thursday May 09, 2002 @05:10PM (#3493119)

Anders Borg [torget.se] wrote this FAQ [promo.net] from Project Gutenberg [promo.net]. Lots of field-tested advice there, such as a suggestion to scan at 300dpi or better.

Share
twitter facebook
Re:are you sure you want to do this? (Score:2, Informative)

by hgh ( 555266 ) writes: on Thursday May 09, 2002 @05:22PM (#3493204)

I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).

Dr Dobbs (and I'm sure others) offers CDs full of all their articles from the past couple years for a pretty good price (less than $100, I believe). They also offer collections of books on CD for about the cost of one original.

Just a thought,
hgh

Parent Share
twitter facebook
Re:Talk to the project Gutenberg guy (Score:5, Informative)

by AyeRoxor! ( 471669 ) writes: on Thursday May 09, 2002 @05:25PM (#3493220) Journal

A quick search on google turned up this site [promo.net] , titled "Making Etexts from
Paper Originals", and seems to be all you need...

If this doesn't get modded up as relevant above the heap, I'm killing myself :P

Parent Share
twitter facebook
Re:are you sure you want to do this? (Score:4, Informative)

by Hallow ( 2706 ) writes: on Thursday May 09, 2002 @05:26PM (#3493236) Homepage

What he's probably looking for is something like PDF. You can leave the image on the front (i.e., it's what shows up in acrobat reader), and adobe's ocr ocr's the document and and indexes it for searches. The problem with this is, you wind up with big pdf's with poor quality.

Where I work we tried to turn a book into PDF that we no longer had an electronic copy of. Keeping the images up front with ocr text behind, about 300 pages alltogether. Even with max compression, and the lowest acceptable DPI (300 I think), the PDF came out to 95MB. It didn't help that we scanned the book page by page and generated the PDF by hand, on a slow hp general consumer model scanner, either. (the initial pdf took over 120hrs to produce, with rescans and ocr'ing and everything).

We wound up taking the acrobat ocr'd text (it was better than the off the shelf ocr package we had at the time) via the adobe accessibility website, and fixing it up. It was a pretty big project.

We recently hired a document imaging company to PDF a lot of smaller historical documents for us, and that has worked out well. It's kind of pricey, but we also paid them to proof the ocr behind the images, and to hand adjust the images for appearance. It's worked out rather well.

Parent Share
twitter facebook
Re:searchable text versus scanned images (Score:2, Informative)

by kalidasa ( 577403 ) writes: on Thursday May 09, 2002 @05:30PM (#3493256) Journal

Acrobat can do this. Just scan it in with Acrobat, then "capture text." Works well with good, clear fonts, and a straight scan (not crooked) from a good scanner, though there's like a 0.05% fail rate per character. Yes, I know that sucks, it's one error a page, but it's survivable.

Parent Share
twitter facebook
Re:100 pounds? (Score:1, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @05:31PM (#3493263)

You haven't seen fat until you see this [rm-f.net].

Its https for some reason. Like someone is going to steal the fat recipies or something...

Parent Share
twitter facebook
4DigitalBooks 900 pages/hour - or do it yourself (Score:4, Informative)

by jukal ( 523582 ) writes: on Thursday May 09, 2002 @05:32PM (#3493268) Journal

I do not have any experience with their products, but the solution offered by this company [4digitalbooks.com] seems simple and functional. Their system consists of an apparatus that turns pages of your book automatically, scans, turns, scans, turns. The result you can naturally pass to OCR.

Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically [mit.edu], see also Kris Mckenzie's automatic page turner [accesswave.ca], still the best start is this document [uconn.edu] which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.

Share
twitter facebook
Funny You should ask. (Score:3, Informative)

by Fapestniegd ( 34586 ) writes: <james@jameswhite . o rg> on Thursday May 09, 2002 @05:36PM (#3493298) Homepage

My current setup consisits of:
4 HP scanners with ADF ~$150 ea. (eBay)
4 Sparc LXs from a property contol auction $50
one flatbed scanner for covers and bad scans. $50 (eBay again)
Barebones System/w scsi from Compgeeks $80

(NFS server), An Amtren Device [amtren.com](courtesy of the office) and away you go. I've found the best way to cut off the binders is to use a box cutter and to use your previous cuts as a guide. Several shell scripts to scan various types of books. It's amazing the page numbering schemes some publisers use. With this setup I can scan approximately 2-3 college textbooks 1000 pgs.(grayscale) or 1 color in an 10 hour period. (including checking for bad scans, sane ain't perfect, so you better check em) also jpg isn't very good for OCR, I store as png, and convert a second set to jpg for web viewing. OCR under linux isn't quite there yet (unless you want to pay through the nose) So I am Archiving the pngs to CD until it is. This also allows me to regenerate the jpgs if I lose a webserver disk. Add a nifty little IMageMagick web viewer and viola! eBookshelf! Oh and a NSM CD changer is nice too get to the CDs nearline.You can pick these up on ebay for $200-$400

Share
twitter facebook
PNG vs JPEG (Score:1, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @05:36PM (#3493299)

First, I'd use PNG (lossless) or Photoshop's format(lossless) over JPEG (lossy). PNG/PSD will be crisper and color pictures will not be degraded.

Second, I'd make them HTML/PDF instead of plain text. Mainly because then you can retain the fonts. (Of course, some of the OCR programs will do this for you if you want to save it as MS-Word file but that's another story. :-/ )

Fourth, a well scanned book is just as easy to read as the book itself. Honest! ;-) But really the problem I've run into is that the back of the pages tend to show through sometimes. You can help to alleviate this by rescanning the pages by hand. Place a blank piece of paper behind the page. This helps to make the page seem whiter to the scanner. If the paper is too bright then use a darker colored piece of paper (like grey or black). This will help the scanner to tone down the bright white of the paper. Only trial and error can tell you what you will need on a book by book basis. This is because each publisher uses a different brand of paper.

Last, use an exacto knife to do the cutting and a good ruler with a metal edge. Exacto + wooden ruler means lots of splinters, badly cut pages, and sore thumbs/fingers. :-) Plastic rulers can also lead to problems. Use a metal one! Save time! Save going to the doctor for stitches! Keep those hard to get out red stains from appearing in your books!

Nuf Said!

Share
twitter facebook
Re:While you're scanning my books... (Score:3, Informative)

by toocoolforsocks ( 534962 ) writes: on Thursday May 09, 2002 @05:39PM (#3493333)

Actually if sign this little buls**t form they have under the counter, they can copy whatever you want. I should know, I work there.

Parent Share
twitter facebook
Re:Somewhat on topic... Historical Papers (Score:2, Informative)

by ancarett ( 221103 ) writes: on Thursday May 09, 2002 @05:52PM (#3493411)

I highly suggest you consult an archivist or a librarian trained in archival management. Nineteenth century paper products are notorious fragile (a result of the switch from rag pulp to acidic, unstable wood pulp). If you don't have the facilities to store these properly, donating them to a local museum or archive is a wonderful idea.

The National Archives and Records Administration [nara.gov] has a FAQ [nara.gov]. Their advice on preserving family papers? --

Paper preservation requires proper storage and safe handling practices. Your family documents will last longer if they are stored in a stable environment, similar to that which we find comfortable for ourselves: 60-70 degrees F; 40-50% relative humidity (RH); with clean air and good circulation. High heat and moisture accelerate the chemical processes that result in embrittlement and discoloration to the paper. Damp environments may also result in mold growth and/or be conducive to pests that might use the documents for food or nesting material. Therefore, the central part of your home provides a safer storage environment than a hot attic or damp basement.

Light is also damaging to paper, especially that which contains high proportions of ultra violet, i.e., fluorescent and natural day light. The effects of light exposure are cumulative and irreversible; they promote chemical degradation in the paper and fade inks. It is not recommended to permanently display valuable documents for this reason. Color photocopies or photographs work well as surrogates.

Parent Share
twitter facebook
Another place to look... (Score:2, Informative)

by zaren ( 204877 ) writes: <fishrocket@gmail.com> on Thursday May 09, 2002 @06:05PM (#3493472) Journal

is http://docs.rinet.ru:8080/ - I ran across this site a few years back. It almost looks like an online library for a Russian ISP's technical support staff.

They've got lots and lots of official books, all HTMLized a chapter or a section at a time. They're all a bit old or out of date, too - I know of one Perl book in particular that they have there was one edition behind what was being sold on the shelf at the time I saw it.

-----
Is Darwin an evolutionary OS? [cafepress.com]

Parent Share
twitter facebook
Re:Somewhat on topic... Historical Papers (Score:3, Informative)

by Seanasy ( 21730 ) writes: on Thursday May 09, 2002 @06:23PM (#3493545)

If you really want to do it right, do it on film. Either pay someone or beg/borrow/steal a medium format camera and try to do it yourself. Film and archive quality prints will probably last longer than CDs and you can get good scans from the negatives if you want digital, too.

I beleive libraries use uncompressed TIFF files for digital archives.

You might find some discussions of this on photo.net

Parent Share
twitter facebook
Re:check sane (Score:3, Informative)

by josepha48 ( 13953 ) writes: on Thursday May 09, 2002 @06:37PM (#3493640) Journal

There is gocr or jocr -> http://jocr.sourceforge.net/
Also there are a few commercial ones. However scanned to text conversion needs at least 600dpi and is only goind to have about a 97% accuracy.

Parent Share
twitter facebook
Re:look online before you scan (Score:3, Informative)

by jonbrewer ( 11894 ) writes: on Thursday May 09, 2002 @06:49PM (#3493702) Homepage

O'Reilly actually sells electronic editions of their books, so please buy them! You can also subscribe and read many of their books online. Also a good idea.

(I personally like my dead tree O'Reilly books, and will stick with them until I have a really hi-res lcd to read electronic versions with.)

Parent Share
twitter facebook
Re:Safari is your friend (Score:1, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @07:09PM (#3493786)

"Perhaps the original poster should subscribe to the O'Reilly books they've purchased (for a month) and then save each chapter locally."

I tried doing this... of course only to read while traveling and while I'm subscribing to that particular book. O'Reilly's 'spidering detection', although well intended, locked out my account multiple times... it took me weeks to get ahold of a rep via email. By the time I did, I was so fed up that I quit the service.

Don't get me wrong, the format of the books on Safari is great. Hyperlinked TOC and indeces.... No search engine AFAIK though. Still, much better than you're going to get by OCR'ing them.

Parent Share
twitter facebook
definition of "dearth" (Score:3, Informative)

by bcrowell ( 177657 ) writes: on Thursday May 09, 2002 @07:16PM (#3493827) Homepage

There are hundreds of them here [theassayer.org]. Very few are the kind of dopey software manuals you're referring to. Is that a "dearth?"

Parent Share
twitter facebook
Use JBIG - not GIF (Score:3, Informative)

by mangu ( 126918 ) writes: on Thursday May 09, 2002 @09:36PM (#3494321)

For bi-level images, the standard to use is JBIG, comes from an ISO group similar to those that created JPEG and MPEG.

It generates much smaller files than GIF for printed text, with none of the inconveniences of JPEG. Grey scale pictures come reasonably well, if done at 300 dpi, dithered.

I don't know exactly why JBIG never caught like those other standards. There doesn't seem to be many JBIG programs around, but, if you are handy with source code, there's jbigkit, a library for reading and writing JBIG files. I wrote my own software with that, and converted a half-ton of old magazines into a 20-pack caselogic of CD's.

Parent Share
twitter facebook
Use DjVu (Score:1, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @10:57PM (#3494531)

For scanned documents, nothing can beat DjVu [sourceforge.net]. Bitonal documents are 3 to 10 times smaller than with TIFF or PNG. Color documents are 5 to 10 times smaller than JPEG or PDF. There is a free online conversion service at Any2DjVu [djvuzone.org].

Parent Share
twitter facebook
Re:You *need* to be aware of OpenDJVu (Score:2, Informative)

by Anonymous Coward writes: on Thursday May 09, 2002 @11:17PM (#3494592)

The open source implementation of DjVu is called DjVuLibre [sourceforge.net]. It includes a viewer and browser plug-in for Unix/X11 (with binaries for Linux, Irix, and Solaris).
There is a free online conversion server at Any2DjVu [djvuzone.org].
Info can be found at DjVuZone [djvuzone.org].

Parent Share
twitter facebook
Re:Go To Kinko's!!!! (Score:2, Informative)

by 3Suns ( 250606 ) writes: on Friday May 10, 2002 @04:10AM (#3495371) Homepage

Don't bother.

If Kinko's does it like all the copy shops I've seen, the pdf's aren't real digitized texts, they're just the scans, in image format, on a pdf. Not exactly the best way to store a book of info.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

look online before you scan (Score:5, Informative)

Go To Kinko's!!!! (Score:4, Informative)

Safari is your friend (Score:5, Informative)

Re:look online before you scan (Score:2, Informative)

Talk to the project Gutenberg guy (Score:3, Informative)

Question is: Free or Not Free (Score:4, Informative)

I work in this field (Score:5, Informative)

check sane (Score:4, Informative)

We do this all the time at the office...... (Score:4, Informative)

Electronic format is nice for storage, but... (Score:2, Informative)

already scanned (Score:2, Informative)

I want both (Score:2, Informative)

Re:Safari is your friend (Score:5, Informative)

Re:Go To Kinko's!!!! (Score:1, Informative)

FAQ: Making Etexts from Paper Originals (Score:2, Informative)

Re:are you sure you want to do this? (Score:2, Informative)

Re:Talk to the project Gutenberg guy (Score:5, Informative)

Re:are you sure you want to do this? (Score:4, Informative)

Re:searchable text versus scanned images (Score:2, Informative)

Re:100 pounds? (Score:1, Informative)

4DigitalBooks 900 pages/hour - or do it yourself (Score:4, Informative)

Funny You should ask. (Score:3, Informative)

PNG vs JPEG (Score:1, Informative)

Re:While you're scanning my books... (Score:3, Informative)

Re:Somewhat on topic... Historical Papers (Score:2, Informative)

Another place to look... (Score:2, Informative)

Re:Somewhat on topic... Historical Papers (Score:3, Informative)

Re:check sane (Score:3, Informative)

Re:look online before you scan (Score:3, Informative)

Re:Safari is your friend (Score:1, Informative)

definition of "dearth" (Score:3, Informative)

Use JBIG - not GIF (Score:3, Informative)

Use DjVu (Score:1, Informative)

Re:You *need* to be aware of OpenDJVu (Score:2, Informative)

Re:Go To Kinko's!!!! (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Re:You need to be aware of OpenDJVu (Score:2, Informative)