
Digitizing Your Dead Trees? 367
smart2000 asks: "I'm tired of lugging around dead trees. I've just moved offices and had to move over 100 pounds of 'essential' technical books. It is clear to me that the dead tree industry is never going to supply the books I want in electronic form, so it's time to do it myself. What hardware and software should I use?"
"The Plan: Take the binding of each book and cut it off. Feed into a scanner with duplex and cut-sheet feeder. Scan as a 300 DPI jpeg with compression. Then OCR them overnight. I don't expect the OCR to be perfect, just good enough to use as a searchable index.
What are the suitable scanner choices for Linux? Any recommendations for OCR software that will write in an open format? Has anyone done this before?"
look online before you scan (Score:5, Informative)
Re:look online before you scan (Score:2, Informative)
There's a dearth of available electronic copies of programming-type texts, except for those where the author/publish creates their own version (like all of Bruce Eckel's books).
Re:look online before you scan (Score:2, Insightful)
Re:look online before you scan (Score:3, Informative)
(I personally like my dead tree O'Reilly books, and will stick with them until I have a really hi-res lcd to read electronic versions with.)
definition of "dearth" (Score:3, Informative)
already scanned (Score:2, Informative)
You could try making a request in abeb, but the biggest selection in one place is irc. So as long as you are not scared by the interface, that is where I would look first.
Re:look online before you scan (Score:2)
'cause Elcomsoft thought they could do the same (minus the scanning part) and they were wrong [slashdot.org]. I don't think you need to copy an electronic version to be a pirate. You can scan a paper copy and become one.
But then again, IANAL...
Another place to look... (Score:2, Informative)
They've got lots and lots of official books, all HTMLized a chapter or a section at a time. They're all a bit old or out of date, too - I know of one Perl book in particular that they have there was one edition behind what was being sold on the shelf at the time I saw it.
-----
Is Darwin an evolutionary OS? [cafepress.com]
Re:look online before you scan (Score:2)
"A wealth" of ebooks? Yeah right. If you're a total freakin' nerd. There's 1) Programming boooks 2) Sci Fi and Fiction (only from the most popular/oldest authors including Harry Potter) and 3) How to get laid for Dummmies (No joke). And there's absolutely nothing in Spanish (which is a thing of mine since I live here in Spain and want stuff to practice on).
I've thought of doing EXACTLY what this guy is doing. I hope there's some good advice... I can't wait until ebooks are as popular on Gnutella as MP3s.
-Russ
Re:look online before you scan (Score:2)
anyone got the isbn?
An easier solution. (Score:4, Funny)
Go To Kinko's!!!! (Score:4, Informative)
Call Kinko's. Ask for the Territory Representative. They'll help you out!!!
Re:Go To Kinko's!!!! (Score:4, Interesting)
Re:Go To Kinko's!!!! (Score:2)
While you're scanning my books... (Score:2)
I just wanna be able to look at the dollar bills on my computer instead of having to carry them with me. Is that so bad?
Re:While you're scanning my books... (Score:3, Informative)
monkeys (Score:4, Funny)
Re:monkeys (Score:2, Funny)
Free the monkeys! (Score:4, Funny)
cat
Safari is your friend (Score:5, Informative)
Quite useful and handy.
D
Re:Safari is your friend (Score:2)
That being said, the $9.99/month (or so) would probably be worth it, considering all the work tearing apart and OCRing all the books would take, just to get somewhat inaccurate digital versions.
Re:Safari is your friend (Score:2)
That is, access them on their web site. You can put them on your own private webspace, on a CD, etc. It's no different than mixing your own music CDs from CDs you legally own.
But yes, O'Reilly's fees are much less than what you'll pay to scan it all yourself.
Re:Safari is your friend (Score:2, Insightful)
Re:Safari is your friend (Score:5, Informative)
I bet about half of your books are already online.
Also, for your compression you should NOT use JPEG. JPEG is optimized for smooth tones and will badly blur hard edges like text. On the other hand, JPEG performs relatively poorly at compressing large areas of the same color (i.e. white backgrounds.) [Note for the nit-pickers, both of these JPEG issues will be reduced/eliminated in JPEG2000.]
I scan documents to either compressed TIFF (tend to be large), PNG, or (*shudder* [unisys.com]) GIF.
From the Project Gutenberg "Making Etexts from Paper Originals" paper" [promo.net]: (You can bet these guys know how to scan...)
I suggest never using JPEG. The quality loss for printed words is just terrible relative to the compression you get. Also, just substitute PNG for GIF and the above works.Re:Safari is your friend (Score:2)
And the tragedy is, the National Geographic Magazine collection on CD-ROM consists entirely of JPEG pictures of the pages (well, plus some (Win/Mac) indexing software). Okay, the photos are probably what attracts most people to National G, but the articles are damn hard to read.
The folks (Tinker's Guild [tinkersguild.com]) that did the complete collection of The Amateur Scientist columns from Scientific American (admittedly a less ambitious undertaking than National Geo.) converted all the articles to HTML (illustrations in GIF). And the indexing software is in Java. Kudos to them.
Use JBIG - not GIF (Score:3, Informative)
It generates much smaller files than GIF for printed text, with none of the inconveniences of JPEG. Grey scale pictures come reasonably well, if done at 300 dpi, dithered.
I don't know exactly why JBIG never caught like those other standards. There doesn't seem to be many JBIG programs around, but, if you are handy with source code, there's jbigkit, a library for reading and writing JBIG files. I wrote my own software with that, and converted a half-ton of old magazines into a 20-pack caselogic of CD's.
Re:Safari is your friend (Score:2, Interesting)
for starters, I could only have access to three books at any givin time, I decided to just choose 3 books right when i signed up and later decided i wanted to trade one of the books in for another which they allowed me to do just fine. However, I then decided I wanted to check out another book and it said, sorry, you can only switch a selection once per month.. oh, isnt that handy, so
thanks oreilly, I love your books but you can keep your safari club.
Re:Safari is your friend (Score:2, Insightful)
Personally, I subscribe to Safari, and I think it's great. I recognize that the 5 (maybe when you subscribed it was only 3, but now the bottom subscription level is 5) book limit and the "you can only change books once a month" provision and the anti-spidering technology was all to protect O'Reilly's considerable investment in their books and yet still allow me the convenience of reading and searching a selection of their books online.
But yeah, it really sucks when a company tries hard to both cater to internet geeks *and* protect their investments. They should just post all their books online for free and allow me to write everything to my hard drive so I don't have to pay anymore.
You're not paying for convenience.
Since when did you fill your bookshelf with books that expired after a month. Or that you had to pay for continuously?
Just sell me the E-Book version. ONCE. That's all I ask. Embed my name and address in there if you want; just let me buy the book as a file.
Preferably, for the same price as the physical book, minus cost of printing / distribution / retailer markup.
Simon
Re:Safari is your friend (Score:4, Insightful)
Because there's something very nice to having access to your 30-odd book collection from home, office, conference, at a job-site, etc. etc., without dragging along 40 pounds of books with you everywhere you go.
It's a convenience you pay for. Considering how many ORA books many people pay for (and keep current as new editions come out), the annualized cost of simply subscribing and NOT buying the dead-tree version at all is very appealing to some folks, especially if their lifestyle has them wanting ready access to the material "from lots of different places".
Re:Safari is your friend (Score:2)
Comment removed (Score:5, Funny)
Re:As Krow always says... (Score:2)
Great (Score:2, Insightful)
Remeber those passkeys for computer games in the 80's that were black on maroon paper? Or some dial thingy.
Re:Great (Score:3, Funny)
Re:Great (Score:2)
Re:Great (Score:2)
Even back then, every photocopier I ever tried it on could adjust the contrast so that they could be copied legibly.
I also remember trading copied templates of the dials that you could cut out and assemble.
100 pounds? (Score:5, Funny)
Re:100 pounds? (Score:5, Funny)
Girl? On Slashdot?
Woah!
Re:100 pounds? (Score:5, Funny)
To the best of my knowledge, Jesus was not a 12 year old girl.
Re:100 pounds? (Score:2)
You're mad, surely? (Score:2, Insightful)
I suspect that even were this sort of thing really possible, it's a major time investment. I have several dozen technical books I'd like to scan, each with four hundred or so pages... and I'm not sure I want to spend a week's vacation time doing it.
And even were it done... there is just something comforting about having a nice printed book that I can set on the desk next to the computer and consult, without having to read it on the screen. Print still looks way better than monitors.
Re:You're mad, surely? (Score:2)
Do you really need them? (Score:4, Insightful)
Personally, I have about 3 books I consider _essential_, and I've read them cover to cover (mostly while in the crapper
As far as I'm concerned, the most important quality in an engineer is not what you know but what search engine you use to look stuff up.
Re:Do you really need them? (Score:2, Interesting)
I own probably 500 computer books that completely cover an 6ft by 6ft section on my wall. No I haven't read all of them, but I have read 80% of them cover to cover, and I know the table of contents on the rest of the books. It's generally very useful to keep lots of reference material "grey matter indexed". That is, I know which book to find it in and roughly where it is in the book. I have found on-line documentation to be of very low quality personally, and I like to peruse it when I don't have a computer handy
The other consideration is it is nice to know the documentation isn't going to change, or move, or do anything weird. Of course it isn't going to get updated either so, cuts both ways.
Re:Do you really need them? (Score:4, Insightful)
Case in point: I recently took a position where I had to do some work with Oracle, which I had not used previously. After some skimming at B&N, I purchased 5 good texts. A lot of pages, but when you need to figure something out you can open 2 or 3 of them, mark multiple pages, and get the outline of what you need very quickly.
sPh
Re:Do you really need them? (Score:2)
.... at least, until you develope a comparable skill with hypertext. The manner of reading is different but not necessarily inferior. Why does everyone assume that what we've used simply due to technical limits will actually prove to be superior in a new context? You can't grep books -- that already limits them.
Re:Do you really need them? (Score:2)
So the answer is yes, I really need them. And I bet the original poster does too. And see, that's the hard part. He can scan and download and so forth all he likes, but finding a good index replacement is not going to be so easy.
Re:Do you really need them? (Score:5, Insightful)
Back before the Web when I was a hardware designer, books were a kind of currency that engineering salespeople used to entice you to meet with them. Each chip manufacturer printed stacks and stacks of data books covering their various product lines. They'd give these to the sales reps who would cart them in on dollies to hand out to the engineers who showed up to hear their latest pitch.
In a way, huge bookshelves with hundreds of books was a status symbol, showing that you'd been around a while and a lot of people thought it was worthwile to give you books. It was useful to have all of that info available, but few people actually used more than 1% the data that was on their shelves.
The instant the chip companies put their chip data on the web, all of those books became totally useless. Now I'm doing software, everything is online, and I can go for weeks on end without picking up a technical book.
I do sometimes miss the office atmosphere you get from row after row of data books neatly segregated by the corporate logos and color schemes on their spines. It had an important look to it.
Re:Do you really need them? (Red Rubber Ball) (Score:2)
Half of the library in my office is catalogs and equipment data sheets for components. A lot of the rest is more generalized data like stress concentration factors for various object geometries and material characteristics; these are things that CANNOT be derived from theory. Only about 4 of my books (which, admittedly, I do use a great deal) are theoretical books. Physics, Advanced Math, Design of Experiments, and a Mech. Eng. Handbook. When you work with real objects, rather than just theory and pure numbers, you tend to need a lot more detailed reference materials. And I'm sure that at least one Engineer in the red rubber ball industry has himself a Red Rubber Ball Table.
Yes. (Score:2)
Re:Do you really need them? (Score:2)
If worst comes to worst (Score:2, Funny)
Re:If worst comes to worst (Score:2)
When you need to look something up, you scan it as you go, almost like highlighting the text with a bright yellow marker.
Let's face it -- out of most of these 500-page behemoth books often only use small chunks of them, especially when you're talking about using them as reference tools well after your first or second read. This way you wouldn't be wasting time, energy, electricity and disk space with all of the voluminous words you don't really need.
I think this could be the best advice I've seen so far.
Re:If worst comes to worst (Score:2)
Re:If worst comes to worst (Score:2)
Talk to the project Gutenberg guy (Score:3, Informative)
Re:Talk to the project Gutenberg guy (Score:5, Informative)
Paper Originals", and seems to be all you need...
If this doesn't get modded up as relevant above the heap, I'm killing myself
Question is: Free or Not Free (Score:4, Informative)
Adobe Acrobat (read $$$$) does all of this and works well. But if you are *nix person you could pipe some ghostview tools together and put it all into LaTex then re-export it as a digital book in to PDF. Scanning: look no further than a HP scanner. It doesn't even have to be HQ unless you need the diagrams to be photoquality. After that burn it all to CD or, better, DVD.
Essential? (Score:4, Interesting)
Maybe you could donate the bulk of them to a school or something, follow the other suggestions about downloading fair-use versions where possible, digitize the few remaining ones, and start using ebooks or Safari [oreilly.com] (or similar) exclusively from now on.
I work in this field (Score:5, Informative)
<plug>
Let me recommend the PS7000 from minolta (www.minolta.com), that is the book scanner we sell the most of.
If you are at all interested in document imaging, check out www.otg.com
and if your in minnesota, wisconsin, or the dakotas, check out my companies web site at www.mid-america.com
</plug>
check sane (Score:4, Informative)
jpeg also sucks for this. Jpeg is best for full color images like photographs. Better off using tiff or png. Most OCR software will require tiff. Don't know of any OCR software for linux although you might get some windows app to work under WINE. Textbridge from Xerox isn't bad for the money.
Re:check sane (Score:3, Informative)
Also there are a few commercial ones. However scanned to text conversion needs at least 600dpi and is only goind to have about a 97% accuracy.
We do this all the time at the office...... (Score:4, Informative)
.
ooh.. searchable index... (Score:2)
Try one of these... (Score:3, Interesting)
Canon's 90ppm high speed scanner - only problem with high speed scanning is that they need loose leaves. Any decent books you have and want to copy will need a Stanley knife taking to the spine.
Please remember to make decent backups on a long lasting madium with a high chance of recoverability. Failing that place the loose leaf versions with a document recovery firm and take their insurance for the full purchase value of the originals.
Re:Try one of these... (Score:2)
Jason
searchable text versus scanned images (Score:2, Redundant)
Scanned images solve these problems, but have two problems of their own:
Perhaps a hybrid solution exists, but I suspect such a solution will require a lot of manual intervention and tweaking, something you'll want to avoid if your goal is to digitize several books.
Re:searchable text versus scanned images (Score:2)
Re:searchable text versus scanned images (Score:2, Informative)
I like my dead trees (Score:2, Insightful)
Personally, however, I still like printed manuals. Using an online manual means either reducing some windows or switching desktops. With a paper manual I can keep the screen exactly as it is. Higher resolution screens, or the use of multiple screens, are making online manuals much more useful (anyone remember what a pain in the ass it was to try and figure out something with only an online manual on a 640x480 screen?). Occasionally I still manage to fill two 1600x1200 screens with a bunch of stuff I want to keep visible while still reading the manual.
Electronic format is nice for storage, but... (Score:2, Informative)
portions of the text whenever you wanted to read them anyway!!!
printing electronic docs is for amateurs (Score:2)
Seriously, most of the hard-core computer folks I know either open their copy of the ORA book on the subject, steal their neighbors copy and flip it open, or use some form of online docs w/o printing said docs off. The only reason I've ever known anyone to print anything resembling a doc is when someone I knew had assembled binder full of pages on tech specs for a project.
It's just a lot easier to sit at the screen arrowing up and down on the doc than it is to print it, reach over to the printer, pull it out, shuffle through it....and then eventually have to take it out with the trash. I've seen comments about paperless offices vis a vis paperless restrooms, but the fact is that for reference there really isn't a reason to print the online doc.
I want both (Score:2, Informative)
On a regular basis, I haul 2188 pages worth, I just added them up, of QUE's Using Java2 Standard Edition, and Enterprise edition, between home an the office. (Speaking of which, go to the link in my
Not only are all of these books heavy, but I have also yet to find an easy way to card them around, they don't all fit right in any of my bags.
I want all of these books on CD-ROM, but not just CD-ROM. Half the books I have INCLUDED a cd-rom, it just doesn't contain the texxt of the book. With O-Riely, I'd buy the CD-ROM version, but I want to dead tree version too. I want to use the dead tree version, unless I am working from home, I want to haul home the CD's. I don't think I should have to pay any more for it either, I bought the IP (in the property sense), and I am already paying the price for the wood slices, which includes a silver disk.
PUBLISHERS, GIVE ME THE BOOK ON THE CD TOO! I spend $100/month or so on tech books.
-Pete
Let me get this straight... (Score:5, Insightful)
And put them into an inferior visual format you cannot read without the computer being working and on?
And you are going to spend about 100 hours to do this.. and the original books are going to be ruined.
All this just so you don't have to make 3 trips to move your books?
Mmmkayyy.. (backs away slowly)
Have you ever heard of a dolly?
Re:Hell yeah (Score:2)
I have outfitted my $100 Visor Handspring with a Compact Flash springboard module and now I can carry around over 100M of books in my shirt pocket. The darn thing is even backlit so that I can read in the dark. What's more I can search for keywords, and annotate the books to my hearts content.
What really settled it for me was when I started reading Structure and Interpretation of Computer Programs on my Visor and could do the example programs in LispME.
Needless to say I prefer my Visor over the dead tree version for any book that is text heavy.
contact your local school for the blind (Score:2, Interesting)
are you sure you want to do this? (Score:4, Insightful)
I've been wanting to do something similar for years, but with technical magazines, not books. But the sheer amount of manual labor involved has turned me off considerably (not to mention the thought of destroying the original source).
Keep in mind that this is such a common need, that if it were pretty straight forward, much of it would be done already (perhaps someone out there has the time/hardware/software to have done some of this already?) Not to mention the issue that with the web, that much of the information contained in those books are now available online, makes you wonder if it's really worth the time and effort, esp. considering that a great many of the technical books are obsolete two weeks before they hit the shelves.
Re:are you sure you want to do this? (Score:2, Informative)
Dr Dobbs (and I'm sure others) offers CDs full of all their articles from the past couple years for a pretty good price (less than $100, I believe). They also offer collections of books on CD for about the cost of one original.
Just a thought,
hghRe:are you sure you want to do this? (Score:2)
The second and the one that many people don't really think of (and to be honest, care about) are the ad's. Both as a reference (for many old products, the ad can be the only source of information) and for entertainment value (hey, look at the 20MB MFM Seagate for $1200, not including controller). The ads always get lost when companies put their content online, sigh.
Re:are you sure you want to do this? (Score:4, Informative)
Where I work we tried to turn a book into PDF that we no longer had an electronic copy of. Keeping the images up front with ocr text behind, about 300 pages alltogether. Even with max compression, and the lowest acceptable DPI (300 I think), the PDF came out to 95MB. It didn't help that we scanned the book page by page and generated the PDF by hand, on a slow hp general consumer model scanner, either. (the initial pdf took over 120hrs to produce, with rescans and ocr'ing and everything).
We wound up taking the acrobat ocr'd text (it was better than the off the shelf ocr package we had at the time) via the adobe accessibility website, and fixing it up. It was a pretty big project.
We recently hired a document imaging company to PDF a lot of smaller historical documents for us, and that has worked out well. It's kind of pricey, but we also paid them to proof the ocr behind the images, and to hand adjust the images for appearance. It's worked out rather well.
PDF and OCR (Score:2, Interesting)
OCR sucks royally for large documents, documents with images or diagrams, handwritten comments, etc. However scanning the pages to an image and then creating a PDF of the images does not care about any of that.
So, scan all of your books as images that your OCR software can process. Use the OCR output to create an index of pages. If a specific word on a specific page doesn't OCR well who cares. With typed and professionally printed books your OCR software should be about 90% accurate. Take the images and create PDF files.
Now you have your nice clean images but you still have a searchable index. BTW, when you get this done post your procedures, problems, and solutions to a web site somewhere so that you can share your experiences with the rest of the world.
Start with google. (Score:2)
Start with google. There is a lot of technical information online, and google will find it. Not as good as those dead trees, but if you can find it and it is accurate, google is often easier than searching indexes. Best of all, dead trees are limited to the ones you own, while google is limited to whatever someone found useful to put online.
Note the last line of the above: google is limited to what someone else finds useful to put online. So if you can't find it on google, take some time to put it online for the rest of us. If/when you find yourself going back to the same few sites often, link to them from your homepage so google knows you find them useful. In other words, google is interactive, make it work for you and it will work for everyone. The internet is not a one way street.
Finially, some things are just plan eaiser to look up in dead tree format. I would strongly recomend you keep your books intact. Put the information you need on the web (what you can do legally), and keep the books for the rest. If you find you are not using a book anymore because all the information is on the web (including you put it there), then throw it out. My monitor is only 19 inches, not nearly enough to hold all the information I have scattered about my desk.
Blackmask.com (Score:2)
Tons and tons of e-texts. In multiple formats: text, pdf, lit, HTML.
Excellent resource!
Re:Blackmask.com (Score:2)
Why? Because the Blackmask site you refer to has few or no books of the type referred to by the original post. There does seem to be a lot of cool content there, but most of it is stuff you can find just as easily on the Project Guttenburg site or elsewhere.
So basically your post is somewhat off-topic, almost cool, but not really cool enough to merit a mod up despite the off-topicness of it. If I would have wasted a down-mod point on you someone else would have meta-modded it badly because they probably wouldn't know why I modded as I did. And, as I said, I just don't think the link is worth the mod up, despite the fact such a mod would probably survive a meta-mod.
All this points out one interesting fact about meta-modding -- it may work better than its critics give it credit for! At the very least it makes a subset of the moderators (a subset with at least one member, me) think twice before bestowing mod points either way. Note that I often lose mod points when the time runs out because I just don't find anything truly worthy of moderating.
Jack William Bell, who fully expects someone will mod this down as 'Off-topic'...
FAQ: Making Etexts from Paper Originals (Score:2, Informative)
Somewhat on topic... Historical Papers (Score:3, Interesting)
Before doing this, though, we were thinking of scanning/copying all the documents to keep copies for ourselves. In doing so, though, we could use some advice:
What special steps must we take in scanning 150+ year old documents, some very yellowed and fragile?
What is the best format in which to store them (assuming we want them easilly readble in 20+ years for our kids)?
What is the best media upon which to store the data (again, hoping for readability in 20+ years)? (I'm thinking online storage to allow easy conversion to the media of the moment, but I still want something to stash in the safe deposit box)
Does anyone have experience with digital preservation/resoration of archival documents? Should I just try cleaning it up in photoshop or should I find a pro to help out? Maybe I can make it a term of the donation to the museum/library, for that matter.
Thanks in andvance for your advice.
Re:Somewhat on topic... Historical Papers (Score:2, Informative)
The National Archives and Records Administration [nara.gov] has a FAQ [nara.gov]. Their advice on preserving family papers? --
Paper preservation requires proper storage and safe handling practices. Your family documents will last longer if they are stored in a stable environment, similar to that which we find comfortable for ourselves: 60-70 degrees F; 40-50% relative humidity (RH); with clean air and good circulation. High heat and moisture accelerate the chemical processes that result in embrittlement and discoloration to the paper. Damp environments may also result in mold growth and/or be conducive to pests that might use the documents for food or nesting material. Therefore, the central part of your home provides a safer storage environment than a hot attic or damp basement.
Light is also damaging to paper, especially that which contains high proportions of ultra violet, i.e., fluorescent and natural day light. The effects of light exposure are cumulative and irreversible; they promote chemical degradation in the paper and fade inks. It is not recommended to permanently display valuable documents for this reason. Color photocopies or photographs work well as surrogates.
Re:Somewhat on topic... Historical Papers (Score:3, Informative)
If you really want to do it right, do it on film. Either pay someone or beg/borrow/steal a medium format camera and try to do it yourself. Film and archive quality prints will probably last longer than CDs and you can get good scans from the negatives if you want digital, too.
I beleive libraries use uncompressed TIFF files for digital archives.
You might find some discussions of this on photo.net
Hauling Trees around (Score:2, Funny)
Call Paul Bunyan. Cause he's a lumberjack and he's okay!
Electronic versions from the publishers (Score:2)
One good, but old, example is Oracle. Back in the day my company had megs of PDFs of all of Oracle's documentation. There was a main index PDF with links to basically every other possible document. I don't recall Oracle leaving them open for download on the internet. We got them on CD. But it was easy to get since they new we were a customer.
Aargh! Flashbacks! The pain, the pain... (Score:2)
Right then. In 1993/4, this is what I did for a living. The company I worked for [pindar.com] did quite a lot of this, and one contract in particular sticks in my mind - the digitising of all books in the French National Library.
No doubt the equipment we used has moved on in the intervening decade however. We used Bell & Howell [bhscanners.com] scanners fitted with automatic document shredders. Err...feeders. Yes, automatic document feeders. Not shredders at all. No. Honest.
You see, these were high-speed scanners, and some of the books we received were qute old. Me and the other coder on the project got really quite good at doing "pit stops", or changing the rubber wheels that drove the ADF. What I'm saying is no disrespect to the scanner company - it was the quality of the paper we had to put through it that caused the hassle. Some books, like the 18th century Academie Francais records, were so thin we had to photograph them and scan the photos.
We then scaled, OCR'd, deskewed and indexed the results on decent machines - 25Mhz 486SX, 4Mb RAM and Kofax [kofax.com] graphics cards. Everything was then tarred up to DAT.
Hardware moves on, but I'll bet the amount of work remains the same. Do not underestimate the preparation required, and also the ammount of QA.
Oh, and don't use JPEG. Lossy compressionon text? Use TIFF - the image processing industry standard.
Cheers,
Ian
I'd be happy with.. (Score:2)
4DigitalBooks 900 pages/hour - or do it yourself (Score:4, Informative)
Now, if I was to digitize all my books, I would try to create te the 4DigitalBooks kind of solution myself. The only tricky part is to find a cheap enough way to turn pages automatically [mit.edu], see also Kris Mckenzie's automatic page turner [accesswave.ca], still the best start is this document [uconn.edu] which is a proposal and overview on how to create an automatic page turner from pieces, the total cost is $459.
Dual monitors could wean me from dead trees (Score:2)
The endless jumping between windows gets old real fast, especially if I need to copy a code snippet out of a document (like a PDF) that won't let me select & copy text.
But if I had a second monitor right there at eye level, I could just open up the reference doc there. No more switching between windows, and no more neck strain from constantly looking down at a book in my lap and then up at the screen.
Funny You should ask. (Score:3, Informative)
4 HP scanners with ADF ~$150 ea. (eBay)
4 Sparc LXs from a property contol auction $50
one flatbed scanner for covers and bad scans. $50 (eBay again)
Barebones System/w scsi from Compgeeks $80
(NFS server), An Amtren Device [amtren.com](courtesy of the office) and away you go. I've found the best way to cut off the binders is to use a box cutter and to use your previous cuts as a guide. Several shell scripts to scan various types of books. It's amazing the page numbering schemes some publisers use. With this setup I can scan approximately 2-3 college textbooks 1000 pgs.(grayscale) or 1 color in an 10 hour period. (including checking for bad scans, sane ain't perfect, so you better check em) also jpg isn't very good for OCR, I store as png, and convert a second set to jpg for web viewing. OCR under linux isn't quite there yet (unless you want to pay through the nose) So I am Archiving the pngs to CD until it is. This also allows me to regenerate the jpgs if I lose a webserver disk. Add a nifty little IMageMagick web viewer and viola! eBookshelf! Oh and a NSM CD changer is nice too get to the CDs nearline.You can pick these up on ebay for $200-$400
I've done this (Score:4, Insightful)
I started with the easiest books. - Books that could be removed from the binding. Scans go smoothly with the ADF, but it is not as easy as you might think. I find that I spend most of my time naming the files because the default naming comvention is *01.jpg , *02.jpg , *03.jpg, etc.
It is a problem for two reasons:
most of my books are double sided.
My HP scanning software for windows does not let me name files with a 2,4,6,8 or 1,3,5,7 format.
If books contain more pages than the ADF holds, The first page scanned will still be named page 1.
If I knew a little perl, I'd write a script to rename the files between scan batches.
For scanning full bound textbooks, there are two main problems:
Scanning the side of the page along the binding requires carefully holding downward pressure on the book to keep it near the scanner glass.
You cannot scan the book using ADF, so you should expect to spend A LOT of time scanning.
Do not even consider manual scanning hundreds of pages with a parallel port scanner. WAY WAY too slow. USB scanners are cheap now, and will usually scan as fast as the scanner mechanism can move (assuming black & White scans).
Lastly, be realistic.
Know how much time you'll need to invest.
Rule of thumb: If you need to scan manually, expect to scan about 200 pages per hour at top speed. Is it worth investing six hours to scan that 1200 page book of yours? If money allows, I'd suggest purchasing a second book that you can afford to destroy. Cut the binding off with something like a jigsaw, then insert the pages into an ADF scanner. Hope this helps somebody.
You *need* to be aware of OpenDJVu (Score:5, Interesting)
It's truly a brilliant format. Go check it out.
Yours Truly,
Dan Kaminsky
DoxPara Research
http://www.doxpara.com
You are insane (Score:4, Insightful)
What is the oldest file that I have?
and ask:
What is the oldest useful file that I have?
For most people their papers and books are much older than the data they keep and the paper version is always available and easy to read.
You are much more likely to lose or corrupt your data if it is on a disk or a tape than if it is in a book. Your electronic version is going to be of much lesser quality than the books you had and you will have a lot of "adventures" getting your ebooks to be as easy to read as your paper books. What happens to your portable ebook when your reader runs out of batteries? Ebooks have failed because ... THEY SUCK. Let us all know how much time you wasted tweaking your ebook setup and worrying about how to make them sustainable. Also, please tell us when you go back to the store and buy new "dead trees" copies of the ones you destroyed.
Re:Don't use JPEG. (Score:2)
Furthermore, if you want animations, you are overlooking the new, cool computer technology called MNG [libpng.org].
Re:Tech books shouldn't be dead tree only. (Score:2)
Why the hell not? Isn't that what we all do while working?