Book scanning

Book scanning (or magazine scanning) is the process of converting physical books and magazines into digital media such as images, electronic text, or electronic books (e-books) by using an image scanner.

Digital books can be easily distributed, reproduced, and read on-screen. Common file formats are DjVu, Portable Document Format (PDF), and Tagged Image File Format (TIFF). To convert the raw images optical character recognition (OCR) is used to turn book pages into a digital text format like ASCII or other similar format, which reduces the file size and allows the text to be reformatted, searched, or processed by other applications.

Image scanners may be manual or automated. In an ordinary commercial image scanner, the book is placed on a flat glass plate (or platen), and a light and optical array moves across the book underneath the glass. In manual book scanners, the glass plate extends to the edge of the scanner, making it easier to line up the book's spine. Other book scanners place the book face up in a v-shaped frame, and photograph the pages from above. Pages may be turned by hand or by automated paper transport devices. Glass or plastic sheets are usually pressed against the page to flatten it.

After scanning, software adjusts the document images by lining it up, cropping it, picture-editing it, and converting it to text and final e-book form. Human proofreaders usually check the output for errors.

Scanning at 118 dots/centimeter (300 dpi) is adequate for conversion to digital text output, but for archival reproduction of rare, elaborate or illustrated books, much higher resolution is used. High-end scanners capable of thousands of pages per hour can cost thousands of dollars, but do-it-yourself (DIY), manual book scanners capable of 1200 pages per hour have been built for 300 USD.

Commercial book scanners

Sketch of a V-shaped book scanner from Atiz

Sketch of a typical manual book scanner

Commercial book scanners are not like normal scanners; these book scanners are usually a high quality digital camera with light sources on either side of the camera mounted on some sort of frame to provide easy access for a person or machine to flip the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also center book position automatically.

The advantage of this type of scanner is that it is very fast, compared to the productivity of overhead scanners. Compared with traditional overhead scanners whose prices normally start from USD$10,000 upwards, this type of digital camera-based book scanner is much more cost-effective.

Book scanning by organizations on a large scale

Projects like Project Gutenberg, Million Book Project, Google Book, and the Open Content Alliance scan books on a large scale.

One of the main challenges to this is the sheer volume of books that must be scanned. In 2010 the total number of works appearing as books in human history was estimated to be around 130 millions.^[1] All of these must be scanned and then made searchable online for the public to use as a universal library. Currently, there are three main ways that large organizations are relying on: outsourcing, scanning in-house using commercial book scanners, and scanning in-house using robotic scanning solutions.

As for outsourcing, books are often shipped to be scanned by low-cost sources to India or China. Alternatively, due to convenience, safety and technology improvement, many organizations choose to scan in-house by using either overhead scanners which are time-consuming, or digital camera-based scanning solutions which are substantially faster, and is a method employed by Internet Archive as well as Google. Traditional methods have included cutting off the book's spine and scanning the pages in a scanner with automatic page-feeding capability, with rebinding of the loose pages occurring afterwards.

Once the page is scanned, the data is either entered manually or via OCR, another major cost of the book scanning projects.

Due to copyright issues, most scanned books are those that are out of copyright; however, Google Book Search is known to scan books still protected under copyright unless the publisher specifically excludes them.

Destructive scanning

For book scanning on a low budget, the least expensive method to scan a book or magazine is to cut off the binding. This converts the book or magazine into a sheaf of looseleaf papers, which can then be loaded into a standard automatic document feeder and scanned using inexpensive and common scanning technology. While this is definitely not a desirable solution for very old and uncommon books, it is a useful tool for book and magazine scanning where the book is not an expensive collector's item and replacement of the scanned content is easy. There are two technical difficulties with this process, first with the cutting and second with the scanning.

Cutting

One method of cutting a stack of 500 to 1000 pages in one pass is accomplished with a guillotine paper cutter. This is a large steel table with a paper vise that screws down onto the stack and firmly secures it before cutting. The cut is accomplished with a large sharpened steel blade which moves straight down and cuts the entire length of each sheet all at once. A lever on the blade permits several hundred pounds of force to be applied to the blade for a quick one-pass cut.

A clean cut through a thick stack of paper cannot be made with a traditional inexpensive sickle-shaped hinged paper cutter. These cutters are only intended for a few sheets, with up to ten sheets being the practical cutting limit. A large stack of paper applies torsional forces on the hinge, pulling the blade away from the cutting edge on the table. The cut becomes more inaccurate as the cut moves away from the hinge, and the force required to hold the blade against the cutting edge increases as the cut moves away from the hinge.

The guillotine cutting process dulls the blade over time, requiring that it be resharpened. Coated paper such as slick magazine paper dulls the blade more quickly than plain book paper, due to the kaolinite clay coating. Additionally, removing the binding of an entire hardcover book causes excessive wear due to cutting through the cover's stiff backing material. Instead the outer cover can be removed and only interior pages need be cut.

Scanning

Once the paper is liberated from the spine, it can be scanned one sheet at a time using a traditional flatbed scanner or automatic document feeder (ADF).

Pages with a decorative riffled edging or curving in an arc due to a non-flat binding can be difficult to scan using an ADF. An ADF is designed to scan pages of uniform shape and size, and variably sized or shaped pages can lead to improper scanning. The riffled edges or curved edge can be guillotined off to render the outer edges flat and smooth before the binding is cut.

The coated paper of magazines and bound textbooks can make them difficult for the rollers in an ADF to pick up and guide along the paper path. An ADF which uses a series of rollers and channels to flip sheets over may jam or misfeed when fed coated paper. Generally there are fewer problems by using as straight of a paper path as is possible, with few bends and curves. The clay can also rub off the paper over time and coat sticky pickup rollers, causing them to loosely grip the paper. The ADF rollers may need periodic cleaning to prevent this slipping.

Magazines can pose a bulk-scanning challenge due to small nonuniform sheets of paper in the stack, such as magazine subscription cards and fold out pages. These need to be removed before the bulk scan begins, and are either scanned separately if they include worthwhile content, or are simply left out of the scan process.

A Test Case: PGP

In 1995, Phil Zimmerman published PGP Source Code and Internals as a $60 hardbound book, which under the First Amendment could legally be shipped abroad. The buyer could either display it in a library or destructively scan it so that the source code could be compiled via freely available GNU software into the Pretty Good Privacy (PGP) cryptosystem that the U.S. government regarded as a restricted munition. Zimmerman was being prosecuted for distributing PGP software and wanted to test the law in the courts. It was not directly tested, but export restrictions have eased: it is legal to export PGP anywhere but the seven countries and specified groups and individuals to which nothing can be exported from the U.S.

Non-destructive scanning

An example of a DIY non-destructive book scanner/digitizer, with the book downwards design, allowing gravity to flatten pages.

In recent years, software driven machines and robots have been developed to scan books without the need of disbinding them in order to preserve both the contents of the document and create a digital image archive of its current state. This recent trend has been due in part to ever improving imaging technologies that allow a high quality digital archive image to be captured with little or no damage to a rare or fragile book in a reasonably short period of time.

Some high-end scanning systems employ vacuum and air and static charges to turn pages while imaging is performed automatically, usually from a high resolution camera located over an adjustable v-shaped cradle. Images are then shuttled from the imaging device into various editing suites which can further process the images for either an archival-quality file such as TIFF or JPEG 2000, or a web-friendly output such as JPEG or PDF.

Google's patent 7508978 shows an infrared camera technology which allows to detect and automatically adjust the three-dimensional shape of the page.^[2] Researchers from the University of Tokyo have an experimental non-destructive book scanner^[3] that includes a 3D surface scanner to allow images of a curved page to be straightened in software. Thus the book or magazine can be scanned as quickly as the operator can flip through the pages; about 200 pages per minute.

le risorse e i servizi linguistici presentati all'interno della cartella di sito denominata ELINGUE (www.englishgratis.com/elingue) , d'ora in poi definita "ELINGUE", sono accessibili solo previa sottoscrizione di un abbonamento a pagamento e si possono utilizzare esclusivamente per uso personale e non commerciale con tassativa esclusione di ogni condivisione comunque effettuata. Tutti i diritti sono riservati. La riproduzione anche parziale è vietata senza autorizzazione scritta.
si precisa altresì che il nome del sito EnglishGratis, che ospita ELINGUE, è esclusivamente un marchio di fantasia e un nome di dominio internet che fa riferimento alla disponibilità sul sito di un numero molto elevato di risorse gratuite e non implica dunque in alcun modo una promessa di gratuità relativamente a prodotti e servizi nostri o di terze parti pubblicizzati a mezzo banner e link, o contrassegnati chiaramente come prodotti a pagamento (anche ma non solo con la menzione "Annuncio pubblicitario"), o comunque menzionati nelle pagine del sito ma non disponibili sulle pagine pubbliche, non protette da password, del sito stesso. In particolare sono esclusi dalle pretese di gratuità i seguenti prodotti a pagamento: il nuovo abbonamento ad ELINGUE, i corsi 20 ORE e le riviste English4Life. L'utente che abbia difficoltà a capire il significato del marchio English Gratis o la relazione tra risorse gratuite e risorse a pagamento è pregato di contattarci per le opportune delucidazioni PRIMA DI UTILIZZARE IL SITO onde evitare spiacevoli equivoci.
ELINGUE è riservato in linea di massima ad utenti singoli (privati o aziendali). Qualora si sia interessati ad abbonamenti multi-utente si prega di contattare la redazione per un'offerta ad hoc.
l'utente si impegna a non rivelare a nessuno i dati di accesso che gli verranno comunicati (nome utente e password)
coloro che si abbonano accettano di ricevere le nostre comunicazioni di servizio (newsletter e mail singole) che sono l'unico tramite di comunicazione tra noi e il nostro abbonato, e servono ad informare l'abbonato della scadenza imminente del suo abbonamento e a comunicargli in anticipo eventuali problematiche tecniche e di manutenzione che potrebbero comportare l'indisponibilità transitoria del sito.
Nel quadro di una totale trasparenza e cortesia verso l'utente, l'abbonamento NON si rinnova automaticamente. Per riabbonarsi l'utente dovrà di nuovo effettuare la procedura che ha dovuto compiere la prima volta che si è abbonato.
Le risorse costituite da codici di embed di YouTube e di altri siti che incoraggiano lo sharing delle loro risorse (video, libri, audio, immagini, foto ecc.) sono ovviamente di proprietà dei rispettivi siti. L'utente riconosce e accetta che 1) il sito di sharing che ce ne consente l'uso può in ogni momento revocare la disponibilità della risorsa 2) l'eventuale pubblicità che figura all'interno delle risorse non è inserita da noi ma dal sito di sharing 3) eventuali violazioni di copyright sono esclusiva responsabilità del sito di sharing mentre è ovviamente nostra cura scegliere risorse solo da siti di sharing che pratichino una politica rigorosa di controllo e interdizione delle violazioni di copyright.
Nel caso l'utente riscontri nel sito una qualsiasi violazione di copyright, è pregato di segnalarcelo immediatamente per consentirci interventi di verifica ed eventuale rimozione del contenuto in questione. I contenuti rimossi saranno, nel limite del possibile, sostituiti con altri contenuti analoghi che non violano il copyright.
I servizi linguistici da noi forniti sulle pagine del sito ma erogati da aziende esterne (per esempio, la traduzione interattiva di Google Translate e Bing Translate realizzata rispettivamente da Google e da Microsoft, la vocalizzazione Text To Speech dei testi inglesi fornita da ReadSpeaker, il vocabolario inglese-italiano offerto da Babylon con la sua Babylon Box, il servizio di commenti sociali DISQUS e altri) sono ovviamente responsabilità di queste aziende esterne. Trattandosi di servizi interattivi basati su web, possono esserci delle interruzioni di servizio in relazione ad eventi di manutenzione o di sovraccarico dei server su cui non abbiamo alcun modo di influire. Per esperienza, comunque, tali interruzioni sono rare e di brevissima durata, saremo comunque grati ai nostri utenti che ce le vorranno segnalare.
Per quanto riguarda i servizi di traduzione automatica l'utente prende atto che sono forniti "as is" dall'azienda esterna che ce li eroga (Google o Microsoft). Nonostante le ovvie limitazioni, sono strumenti in continuo perfezionamento e sono spesso in grado di fornire all'utente, anche professionale, degli ottimi suggerimenti e spunti per una migliore traduzione.
In merito all'utilizzabilità del sito ELINGUE su tablet e cellulari a standard iOs, Android, Windows Phone e Blackberry facciamo notare che l'assenza di standard comuni si ripercuote a volte sulla fruibilità di certe prestazioni tipiche del nostro sito (come il servizio ReadSpeaker e la traduzione automatica con Google Translate). Mentre da parte nostra è costante lo sforzo di rendere sempre più compatibili il nostro sito con il maggior numero di piattaforme mobili, non possiamo però assicurare il pieno raggiungimento di questo obiettivo in quanto non dipende solo da noi. Chi desidera abbonarsi è dunque pregato di verificare prima di perfezionare l'abbonamento la compatibilità del nostro sito con i suoi dispositivi informatici, mobili e non, utilizzando le pagine di esempio che riproducono una pagina tipo per ogni tipologia di risorsa presente sul nostro sito. Non saranno quindi accettati reclami da parte di utenti che, non avendo effettuato queste prove, si trovino poi a non avere un servizio corrispondente a quello sperato. In tutti i casi, facciamo presente che utilizzando browser come Chrome e Safari su pc non mobili (desktop o laptop tradizionali) si ha la massima compatibilità e che il tempo gioca a nostro favore in quanto mano a mano tutti i grandi produttori di browser e di piattaforme mobili stanno convergendo, ognuno alla propria velocità, verso standard comuni.
Il sito ELINGUE, diversamente da English Gratis che vive anche di pubblicità, persegue l'obiettivo di limitare o non avere affatto pubblicità sulle proprie pagine in modo da garantire a chi studia l'assenza di distrazioni. Le uniche eccezioni sono 1) la promozione di alcuni prodotti linguistici realizzati e/o garantiti da noi 2) le pubblicità incorporate dai siti di sharing direttamente nelle risorse embeddate che non siamo in grado di escludere 3) le pubblicità eventualmente presenti nei box e player che servono ad erogare i servizi linguistici interattivi prima citati (Google, Microsoft, ReadSpeaker, Babylon ecc.).
Per quanto riguarda le problematiche della privacy, non effettuiamo alcun tracciamento dell'attività dell'utente sul nostro sito neppure a fini statistici. Tuttavia non possiamo escludere che le aziende esterne che ci offrono i loro servizi o le loro risorse in modalità sharing effettuino delle operazioni volte a tracciare le attività dell'utente sul nostro sito. Consigliamo quindi all'utente di utilizzare browser che consentano la disattivazione in blocco dei tracciamenti o l'inserimento di apposite estensioni di browser come Ghostery che consentono all'utente di bloccare direttamente sui browser ogni agente di tracciamento.
Le risposte agli utenti nella sezione di commenti sociali DISQUS sono fornite all'interno di precisi limiti di accettabilità dei quesiti posti dall'utente. Questi limiti hanno lo scopo di evitare che il servizio possa essere "abusato" attraverso la raccolta e sottoposizione alla redazione di ELINGUE di centinaia o migliaia di quesiti che intaserebbero il lavoro della redazione. Si prega pertanto l'utente di leggere attentamente e comprendere le seguenti limitazioni d'uso del servizio:
- il servizio è moderato per garantire che non vengano pubblicati contenuti fuori tema o inadatti all'ambiente di studio online
- la redazione di ELINGUE si riserva il diritto di editare gli interventi degli utenti per correzioni ortografiche e per chiarezza
- il servizio è erogato solo agli utenti abbonati registrati gratuitamente al servizio di commenti sociali DISQUS
- l'utente non può formulare più di un quesito al giorno
- un quesito non può contenere, salvo eccezioni, più di una domanda
- un utente non può assumere più nomi, identità o account di Disqus per superare i limiti suddetti
- nell'ambito del servizio non sono forniti servizi di traduzione
- la redazione di ELINGUE gestisce la priorità delle risposte in modo insindacabile da parte dell'utente
- in tutti i casi, la redazione di ELINGUE è libera in qualsiasi momento di de-registrare temporaneamente l'utente abbonato dal
servizio DISQUS qualora sussistano fondati motivi a suo insindacabile giudizio. La misura verrà comunque attuata solo in casi di
eccezionale gravità.
L'utente, inoltre, accetta di tenere Casiraghi Jones Publishing SRL indenne da qualsiasi tipo di responsabilità per l'uso - ed eventuali conseguenze di esso - delle informazioni linguistiche e grammaticali contenute sul sito, in particolare, nella sezione Disqus. Le nostre risposte grammaticali sono infatti improntate ad un criterio di praticità e pragmaticità che a volte è in conflitto con la rigidità delle regole "ufficiali" che tendono a proporre un inglese schematico e semplificato dimenticando la ricchezza e variabilità della lingua reale. Anche l'occasionale difformità tra le soluzioni degli esercizi e le regole grammaticali fornite nella grammatica va concepita come stimolo a formulare domande alla redazione onde poter spiegare più nei dettagli le particolarità della lingua inglese che non possono essere racchiuse in un'opera grammaticale di carattere meramente introduttivo come la nostra grammatica online.

ELINGUE è un sito di Casiraghi Jones Publishing SRL
Piazzale Cadorna 10 - 20123 Milano - Italia
Tel. 02-36553040 - Fax 02-3535258 email: robertocasiraghi@iol.it
Iscritta al Registro Imprese di MILANO - C.F. e PARTITA IVA: 11603360154
Iscritta al R.E.A. di al n. 1478561 • Capitale Sociale Euro 10.400,00 interamente versato

Contents