Book scanning (or magazine scanning) is the process of
converting physical
books and
magazines
into
digital media such as
images,
electronic text, or
electronic
books (e-books) by using an
image scanner.
Digital books can be easily distributed, reproduced, and
read on-screen. Common file formats are
DjVu,
Portable Document Format (PDF), and
Tagged Image File Format (TIFF). To convert the raw images
optical character recognition (OCR) is used to turn book pages into
a digital text format like
ASCII or
other similar format, which reduces the file size and allows the text to
be reformatted, searched, or processed by other applications.
Image scanners may be manual or automated. In an ordinary commercial
image scanner, the book is placed on a flat glass plate (or platen), and
a light and optical array moves across the book underneath the glass. In
manual book scanners, the glass plate extends to the edge of the
scanner, making it easier to line up the book's spine. Other book
scanners place the book face up in a v-shaped frame, and photograph the
pages from above. Pages may be turned by hand or by automated paper
transport devices. Glass or plastic sheets are usually pressed against
the page to flatten it.
After scanning, software adjusts the document images by lining it up,
cropping it, picture-editing it, and converting it to text and final
e-book form. Human proofreaders usually check the output for errors.
Scanning at 118 dots/centimeter (300
dpi) is adequate for conversion to digital text output, but
for archival reproduction of rare, elaborate or illustrated books, much
higher resolution is used. High-end scanners capable of thousands of
pages per hour can cost thousands of dollars, but
do-it-yourself (DIY), manual book scanners capable of 1200 pages per
hour have been built for 300 USD.
Commercial
book scanners
Sketch of a V-shaped book scanner from Atiz
Sketch of a typical manual book scanner
Commercial book scanners are not like normal
scanners; these book scanners are usually a high quality
digital camera with light sources on either side of the camera
mounted on some sort of frame to provide easy access for a person or
machine to flip the pages of the book. Some models involve V-shaped book
cradles, which provide support for book spines and also center book
position automatically.
The advantage of this type of scanner is that it is very fast,
compared to the productivity of overhead scanners. Compared with
traditional overhead scanners whose prices normally start from
USD$10,000 upwards, this type of digital camera-based book scanner is
much more cost-effective.
Book scanning by organizations on a large scale
Projects like
Project Gutenberg,
Million Book Project,
Google Book, and the
Open Content Alliance scan books on a large scale.
One of the main challenges to this is the sheer volume of books that
must be scanned. In 2010 the total number of works appearing as books in
human history was estimated to be around 130 millions.[1]
All of these must be scanned and then made searchable online for the
public to use as a
universal library. Currently, there are three main ways that large
organizations are relying on: outsourcing, scanning in-house using
commercial book scanners, and scanning in-house using robotic scanning
solutions.
As for outsourcing, books are often shipped to be scanned by low-cost
sources to
India or
China. Alternatively, due to convenience, safety and technology
improvement, many organizations choose to scan in-house by using either
overhead scanners which are time-consuming, or digital camera-based
scanning solutions which are substantially faster, and is a method
employed by Internet Archive as well as Google. Traditional methods have
included cutting off the book's spine and scanning the pages in a
scanner with automatic page-feeding capability, with rebinding of
the loose pages occurring afterwards.
Once the page is scanned, the
data is
either entered manually or via OCR, another major cost of the book
scanning projects.
Due to
copyright issues, most scanned books are those that are out of
copyright; however,
Google Book Search is known to scan books still protected under
copyright unless the
publisher specifically excludes them.
Destructive
scanning
For book scanning on a low budget, the least expensive method to scan
a book or magazine is to cut off the binding. This converts the book or
magazine into a sheaf of looseleaf papers, which can then be loaded into
a standard
automatic document feeder and scanned using inexpensive and common
scanning technology. While this is definitely not a desirable solution
for very old and uncommon books, it is a useful tool for book and
magazine scanning where the book is not an expensive collector's item
and replacement of the scanned content is easy. There are two technical
difficulties with this process, first with the cutting and second with
the scanning.
Cutting
One method of cutting a stack of 500 to 1000 pages in one pass is
accomplished with a
guillotine paper cutter. This is a large steel table with a paper
vise that
screws down onto the stack and firmly secures it before cutting. The cut
is accomplished with a large sharpened steel blade which moves straight
down and cuts the entire length of each sheet all at once. A lever on
the blade permits several hundred pounds of force to be applied to the
blade for a quick one-pass cut.
A clean cut through a thick stack of paper cannot be made with a
traditional inexpensive sickle-shaped hinged
paper cutter. These cutters are only intended for a few sheets, with
up to ten sheets being the practical cutting limit. A large stack of
paper applies torsional forces on the hinge, pulling the blade away from
the cutting edge on the table. The cut becomes more inaccurate as the
cut moves away from the hinge, and the force required to hold the blade
against the cutting edge increases as the cut moves away from the hinge.
The guillotine cutting process dulls the blade over time, requiring
that it be resharpened.
Coated paper such as slick magazine paper dulls the blade more
quickly than plain book paper, due to the
kaolinite
clay coating. Additionally, removing the binding of an entire
hardcover book causes excessive wear due to cutting through the cover's
stiff backing material. Instead the outer cover can be removed and only
interior pages need be cut.
Scanning
Once the paper is liberated from the spine, it can be scanned one
sheet at a time using a traditional flatbed scanner or automatic
document feeder (ADF).
Pages with a decorative riffled edging or curving in an arc due to a
non-flat binding can be difficult to scan using an ADF. An ADF is
designed to scan pages of uniform shape and size, and variably sized or
shaped pages can lead to improper scanning. The riffled edges or curved
edge can be guillotined off to render the outer edges flat and smooth
before the binding is cut.
The coated paper of magazines and bound textbooks can make them
difficult for the rollers in an ADF to pick up and guide along the paper
path. An ADF which uses a series of rollers and channels to flip sheets
over may jam or misfeed when fed coated paper. Generally there are fewer
problems by using as straight of a paper path as is possible, with few
bends and curves. The clay can also rub off the paper over time and coat
sticky pickup rollers, causing them to loosely grip the paper. The ADF
rollers may need periodic cleaning to prevent this slipping.
Magazines can pose a bulk-scanning challenge due to small nonuniform
sheets of paper in the stack, such as magazine subscription cards and
fold out pages. These need to be removed before the bulk scan begins,
and are either scanned separately if they include worthwhile content, or
are simply left out of the scan process.
A Test Case: PGP
In 1995,
Phil Zimmerman published PGP Source Code and Internals as a
$60 hardbound book, which under the
First Amendment could legally be shipped abroad. The buyer could
either display it in a library or destructively scan it so that the
source code could be compiled via freely available GNU software into the
Pretty Good Privacy (PGP) cryptosystem that the U.S. government
regarded as a restricted munition. Zimmerman was being prosecuted for
distributing PGP software and wanted to test the law in the courts. It
was not directly tested, but export restrictions have eased: it is
legal to export PGP anywhere but the seven countries and specified
groups and individuals to which nothing can be exported from the U.S.
Non-destructive scanning
An example of a DIY non-destructive book scanner/digitizer,
with the book downwards design, allowing gravity to flatten
pages.
In recent years, software driven machines and robots have been
developed to scan books without the need of disbinding them in order to
preserve both the contents of the document and create a digital image
archive of its current state. This recent trend has been due in part to
ever improving imaging technologies that allow a high quality digital
archive image to be captured with little or no damage to a rare or
fragile book in a reasonably short period of time.
Some high-end scanning systems employ vacuum and air and static
charges to turn pages while imaging is performed automatically, usually
from a high resolution camera located over an adjustable v-shaped
cradle. Images are then shuttled from the imaging device into various
editing suites which can further process the images for either an
archival-quality file such as
TIFF or
JPEG
2000, or a web-friendly output such as
JPEG or
PDF.
Google's patent 7508978 shows an
infrared camera technology which allows to detect and automatically
adjust the three-dimensional shape of the page.[2]
Researchers from the University of Tokyo have an experimental
non-destructive book scanner[3]
that includes a 3D surface scanner to allow images of a curved page to
be straightened in software. Thus the book or magazine can be scanned as
quickly as the operator can flip through the pages; about 200 pages per
minute.
See also