Book People Archive

Scanning with the Plustek OpticBook 3600

From: "Nick Hodson" <nicholashodson@[redacted]>
Subject: Scanning with the Plustek OpticBook 3600
Date: Mon, 3 Jan 2005 11:58:30 -0000
I received my Plustek OpticBook 3600 scanner at the beginning of 
December, and have now scanned well over a dozen books with it, some of 
which have been OCRed and edited and are now on my website at 
www.athelstane.co.uk

The advantage of this device is that it will in most cases scan a page 
flat and without shadow, achieved by having the glass of the scanner 
come right up to the edge, so that the page being scanned sits on the 
scanner, while the other half of the book hangs over the side of the 
scanner. Thus the book is never opened out and pressed down on the 
scanner glass. This theoretically results in no damage to the book.

This is excellent for most books, and quickly provides a good scan of 
every page. I find that at 300 dpi I can scan a page every eleven 
seconds. The exceptions are (i) when the gutter (the distance between 
the edge of the text and the fold of the book) is less than 1 
centimetre; (ii) when the book is large, heavy, and already a little 
dilapidated, when it will be hard to hold and will shed small bits that 
may get into the scan image.

I use an image-to-pdf program to create a pdf of the book, which enables 
me to see if any pages need to be scanned or re-scanned, or deleted. It 
takes somewhere between one and two minutes to generate the book's pdf 
from the existing scans. I normally need to create the pdf a couple of 
times before I am satisfied, doing a little more scanning work if 
necessary.

Of course these scans are not perfectly straight, and are not nicely 
centred with a small neat margin, so they represent a fairly authentic 
view of the book.

Without wishing to enter the lists regarding recent Book People 
correspondence, I agree that scans are different each time you do them, 
and that a 300 dpi scan is different from a 600 dpi one. I can't agree 
that straightening should be avoided, because a good OCR program such as 
ABBYY Finereader requires it. I do agree that despeckling should be 
avoided because it can cause other legibility problems, such as losing 
the dot of an "i" or a "j", or worse, a full-stop (period) following a 
"t".

Most of the books I scan are nineteenth century children's books, some 
of which appear to have been read by a child with a heavy cold. In such 
cases I have needed to winch the brightness up as far as 20, which gets 
rid of most of the "blobs" but which may affect the "t"-period 
situation. But I have software that finds these cases.

I use ABBYY FineReader 7 to perform the OCR. The output I choose to be 
in html, which allows me to retain markup such as italics. When I used 
TextBridge in the past I sent the output to rtf files, but html is more 
convenient. FineReader straightens the scan of each page before doing 
OCR on it, and you can get it to write these straightened scans to 
another folder. I then process these scans to generate neatly presented 
pages. These are then reduced slightly in size and used to generate 
screen-sized images of each page that can be used to aid in the editing.

Both the OpticBook 3600 and ABBYY FineReader have slight glitches in the 
way they work. I would not call these bugs, but they are certainly 
nuisances that the manufacturers could cure easily. I have written 
software that overcomes these glitches, and that speeds up the processes 
described above. See the last para below.

Having now created a set of html files, one for each page, I have 
developed software that assembles these into chapters, and that then 
checks for a number of common problems, which it cures in each case. So 
you now have a nearly clean text version of each chapter, with markup 
for italics, etc, and for the start of each page.

The versions of the chapters are so good that I have them automatically 
create speech text files, in which is inserted a markup for each word 
that Fonix ISpeak does not speak well without intervention. I have a 
lookup file with about 15,000 of these, which is fairly comprehensive. 
Finally a play-list is created, and you can start to listen to a pretty 
good version of the book, even before you start to edit it.

My editor has a number of automatic stages, of which obviously a 
spelling checker is one, at the end of which the book will be very 
nearly correct. There are then some checks provided for the book as a 
whole, as opposed to doing it chapter by chapter. While all this has 
been going on the book has been playing aloud, and you will hear many of 
the residual problems, particularly for chapters that have been edited. 
When you have run all the editing and whole-book analyses you will enjoy 
reading the book. You will find about one glitch per ten pages -- 
hopefully, if you look hard.

If you will be kind enough to look at
http://www.athelstane.co.uk/review04.htm you will see at the end of that 
essay a more detailed list of the processes required to produce scans of 
books in the manner described above. There is also reference to a 
document that shows you how to use my software, and that refers to a 
pack of such software. I wrote this document over Christmas, and have 
verified it and improved it somewhat since then. So it is not 
vapour-ware, and should be available in the next few days. Warning: the 
software is all designed to work from the command line in a dos window. 
I run it under both W98 and xp.