Scanning with the Plustek OpticBook 3600
- From: "Nick Hodson" <nicholashodson@[redacted]>
- Subject: Scanning with the Plustek OpticBook 3600
- Date: Mon, 3 Jan 2005 11:58:30 -0000
I received my Plustek OpticBook 3600 scanner at the beginning of
December, and have now scanned well over a dozen books with it, some of
which have been OCRed and edited and are now on my website at
www.athelstane.co.uk
The advantage of this device is that it will in most cases scan a page
flat and without shadow, achieved by having the glass of the scanner
come right up to the edge, so that the page being scanned sits on the
scanner, while the other half of the book hangs over the side of the
scanner. Thus the book is never opened out and pressed down on the
scanner glass. This theoretically results in no damage to the book.
This is excellent for most books, and quickly provides a good scan of
every page. I find that at 300 dpi I can scan a page every eleven
seconds. The exceptions are (i) when the gutter (the distance between
the edge of the text and the fold of the book) is less than 1
centimetre; (ii) when the book is large, heavy, and already a little
dilapidated, when it will be hard to hold and will shed small bits that
may get into the scan image.
I use an image-to-pdf program to create a pdf of the book, which enables
me to see if any pages need to be scanned or re-scanned, or deleted. It
takes somewhere between one and two minutes to generate the book's pdf
from the existing scans. I normally need to create the pdf a couple of
times before I am satisfied, doing a little more scanning work if
necessary.
Of course these scans are not perfectly straight, and are not nicely
centred with a small neat margin, so they represent a fairly authentic
view of the book.
Without wishing to enter the lists regarding recent Book People
correspondence, I agree that scans are different each time you do them,
and that a 300 dpi scan is different from a 600 dpi one. I can't agree
that straightening should be avoided, because a good OCR program such as
ABBYY Finereader requires it. I do agree that despeckling should be
avoided because it can cause other legibility problems, such as losing
the dot of an "i" or a "j", or worse, a full-stop (period) following a
"t".
Most of the books I scan are nineteenth century children's books, some
of which appear to have been read by a child with a heavy cold. In such
cases I have needed to winch the brightness up as far as 20, which gets
rid of most of the "blobs" but which may affect the "t"-period
situation. But I have software that finds these cases.
I use ABBYY FineReader 7 to perform the OCR. The output I choose to be
in html, which allows me to retain markup such as italics. When I used
TextBridge in the past I sent the output to rtf files, but html is more
convenient. FineReader straightens the scan of each page before doing
OCR on it, and you can get it to write these straightened scans to
another folder. I then process these scans to generate neatly presented
pages. These are then reduced slightly in size and used to generate
screen-sized images of each page that can be used to aid in the editing.
Both the OpticBook 3600 and ABBYY FineReader have slight glitches in the
way they work. I would not call these bugs, but they are certainly
nuisances that the manufacturers could cure easily. I have written
software that overcomes these glitches, and that speeds up the processes
described above. See the last para below.
Having now created a set of html files, one for each page, I have
developed software that assembles these into chapters, and that then
checks for a number of common problems, which it cures in each case. So
you now have a nearly clean text version of each chapter, with markup
for italics, etc, and for the start of each page.
The versions of the chapters are so good that I have them automatically
create speech text files, in which is inserted a markup for each word
that Fonix ISpeak does not speak well without intervention. I have a
lookup file with about 15,000 of these, which is fairly comprehensive.
Finally a play-list is created, and you can start to listen to a pretty
good version of the book, even before you start to edit it.
My editor has a number of automatic stages, of which obviously a
spelling checker is one, at the end of which the book will be very
nearly correct. There are then some checks provided for the book as a
whole, as opposed to doing it chapter by chapter. While all this has
been going on the book has been playing aloud, and you will hear many of
the residual problems, particularly for chapters that have been edited.
When you have run all the editing and whole-book analyses you will enjoy
reading the book. You will find about one glitch per ten pages --
hopefully, if you look hard.
If you will be kind enough to look at
http://www.athelstane.co.uk/review04.htm you will see at the end of that
essay a more detailed list of the processes required to produce scans of
books in the manner described above. There is also reference to a
document that shows you how to use my software, and that refers to a
pack of such software. I wrote this document over Christmas, and have
verified it and improved it somewhat since then. So it is not
vapour-ware, and should be available in the next few days. Warning: the
software is all designed to work from the command line in a dos window.
I run it under both W98 and xp.