- Great Painters
- Accounting
- Fundamentals of Law
- Marketing
- Shorthand
- Concept Cars
- Videogames
- The World of Sports

- Blogs
- Free Software
- Google
- My Computer

- PHP Language and Applications
- Wikipedia
- Windows Vista

- Education
- Masterpieces of English Literature
- American English

- English Dictionaries
- The English Language

- Medical Emergencies
- The Theory of Memory
- The Beatles
- Dances
- Microphones
- Musical Notation
- Music Instruments
- Batteries
- Nanotechnology
- Cosmetics
- Diets
- Vegetarianism and Veganism
- Christmas Traditions
- Animals

- Fruits And Vegetables


  1. Adobe Reader
  2. Adware
  3. Altavista
  4. AOL
  5. Apple Macintosh
  6. Application software
  7. Arrow key
  8. Artificial Intelligence
  9. ASCII
  10. Assembly language
  11. Automatic translation
  12. Avatar
  13. Babylon
  14. Bandwidth
  15. Bit
  16. BitTorrent
  17. Black hat
  18. Blog
  19. Bluetooth
  20. Bulletin board system
  21. Byte
  22. Cache memory
  23. Celeron
  24. Central processing unit
  25. Chat room
  26. Client
  27. Command line interface
  28. Compiler
  29. Computer
  30. Computer bus
  31. Computer card
  32. Computer display
  33. Computer file
  34. Computer games
  35. Computer graphics
  36. Computer hardware
  37. Computer keyboard
  38. Computer networking
  39. Computer printer
  40. Computer program
  41. Computer programmer
  42. Computer science
  43. Computer security
  44. Computer software
  45. Computer storage
  46. Computer system
  47. Computer terminal
  48. Computer virus
  49. Computing
  50. Conference call
  51. Context menu
  52. Creative commons
  53. Creative Commons License
  54. Creative Technology
  55. Cursor
  56. Data
  57. Database
  58. Data storage device
  59. Debuggers
  60. Demo
  61. Desktop computer
  62. Digital divide
  63. Discussion groups
  64. DNS server
  65. Domain name
  66. DOS
  67. Download
  68. Download manager
  69. DVD-ROM
  70. DVD-RW
  71. E-mail
  72. E-mail spam
  73. File Transfer Protocol
  74. Firewall
  75. Firmware
  76. Flash memory
  77. Floppy disk drive
  78. GNU
  79. GNU General Public License
  80. GNU Project
  81. Google
  82. Google AdWords
  83. Google bomb
  84. Graphics
  85. Graphics card
  86. Hacker
  87. Hacker culture
  88. Hard disk
  89. High-level programming language
  90. Home computer
  91. HTML
  92. Hyperlink
  93. IBM
  94. Image processing
  95. Image scanner
  96. Instant messaging
  97. Instruction
  98. Intel
  99. Intel Core 2
  100. Interface
  101. Internet
  102. Internet bot
  103. Internet Explorer
  104. Internet protocols
  105. Internet service provider
  106. Interoperability
  107. IP addresses
  108. IPod
  109. Joystick
  110. JPEG
  111. Keyword
  112. Laptop computer
  113. Linux
  114. Linux kernel
  115. Liquid crystal display
  116. List of file formats
  117. List of Google products
  118. Local area network
  119. Logitech
  120. Machine language
  121. Mac OS X
  122. Macromedia Flash
  123. Mainframe computer
  124. Malware
  125. Media center
  126. Media player
  127. Megabyte
  128. Microsoft
  129. Microsoft Windows
  130. Microsoft Word
  131. Mirror site
  132. Modem
  133. Motherboard
  134. Mouse
  135. Mouse pad
  136. Mozilla Firefox
  137. Mp3
  138. MPEG
  139. MPEG-4
  140. Multimedia
  141. Musical Instrument Digital Interface
  142. Netscape
  143. Network card
  144. News ticker
  145. Office suite
  146. Online auction
  147. Online chat
  148. Open Directory Project
  149. Open source
  150. Open source software
  151. Opera
  152. Operating system
  153. Optical character recognition
  154. Optical disc
  155. output
  156. PageRank
  157. Password
  158. Pay-per-click
  159. PC speaker
  160. Peer-to-peer
  161. Pentium
  162. Peripheral
  163. Personal computer
  164. Personal digital assistant
  165. Phishing
  166. Pirated software
  167. Podcasting
  168. Pointing device
  169. POP3
  170. Programming language
  171. QuickTime
  172. Random access memory
  173. Routers
  174. Safari
  175. Scalability
  176. Scrollbar
  177. Scrolling
  178. Scroll wheel
  179. Search engine
  180. Security cracking
  181. Server
  182. Simple Mail Transfer Protocol
  183. Skype
  184. Social software
  185. Software bug
  186. Software cracker
  187. Software library
  188. Software utility
  189. Solaris Operating Environment
  190. Sound Blaster
  191. Soundcard
  192. Spam
  193. Spamdexing
  194. Spam in blogs
  195. Speech recognition
  196. Spoofing attack
  197. Spreadsheet
  198. Spyware
  199. Streaming media
  200. Supercomputer
  201. Tablet computer
  202. Telecommunications
  203. Text messaging
  204. Trackball
  205. Trojan horse
  206. TV card
  207. Unicode
  208. Uniform Resource Identifier
  209. Unix
  210. URL redirection
  211. USB flash drive
  212. USB port
  213. User interface
  214. Vlog
  215. Voice over IP
  216. Warez
  217. Wearable computer
  218. Web application
  219. Web banner
  220. Web browser
  221. Web crawler
  222. Web directories
  223. Web indexing
  224. Webmail
  225. Web page
  226. Website
  227. Wiki
  228. Wikipedia
  229. WIMP
  230. Windows CE
  231. Windows key
  232. Windows Media Player
  233. Windows Vista
  234. Word processor
  235. World Wide Web
  236. Worm
  237. XML
  238. X Window System
  239. Yahoo
  240. Zombie computer

This article is from:

All text is available under the terms of the GNU Free Documentation License: 

Optical character recognition

From Wikipedia, the free encyclopedia


Optical character recognition, usually abbreviated to OCR, is computer software designed to translate images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text, or to translate pictures of characters into a standard encoding scheme representing them (e.g. ASCII or Unicode). OCR began as a field of research in pattern recognition, artificial intelligence and machine vision. Though academic research in the field continues, the focus on OCR has shifted to implementation of proven techniques.

Optical character recognition (using optical techniques such as mirrors and lenses) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because very few applications survive that use true optical techniques, the optical character recognition term has now been broadened to cover digital character recognition as well.

Early systems required "training" (essentially, the provision of known samples of each character) to read a specific font. Currently, though, "intelligent" systems that can recognize most fonts with a high degree of accuracy are now common. Some systems are even capable of reproducing formatted output that closely approximates the original scanned page including images, columns and other non-textual components.


In 1929, G. Tauschek obtained a patent on OCR in Germany, followed by Handel who obtained a US patent on OCR in USA in 1933 (U.S. Patent 1,915,993). Tauschek was in 1935 also granted a US patent on his method (U.S. Patent 2,026,329).

Tauschek's machine was a mechanical device that used templates. A photodetector was placed so that when the template and the character to be recognised was lined up for an exact match, and a light was directed towards it, no light would reach the photodetector.

In 1950, David Shepard, a cryptanalyst at the Armed Forces Security Agency in the United States, was asked by Frank Rowlett, who had broken the Japanese PURPLE diplomatic code, to work with Dr. Louis Tordella to recommend data automation procedures for the Agency. This included the problem of converting printed messages into machine language for computer processing. Shepard decided it must be possible to build a machine to do this, and, with the help of Harvey Cook, a friend, built "Gismo" in his attic during evenings and weekends. This was reported in the Washington Daily News on April 27, 1951 and in the New York Times on December 26, 1953 after his U.S. Patent Number 2,663,758 was issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the world's first several OCR systems used in commercial operation. While both Gismo and the later IMR systems used image analysis, as opposed to character matching, and could accept some font variation, Gismo was limited to reasonably close vertical registration, whereas the following commercial IMR scanners analyzed characters anywhere in the scanned field, a practical necessity on real world documents.

The first commercial system was installed at the Readers Digest in 1955, which, many years later, was donated by Readers Digest to the Smithsonian, where it was put on display. The second system was sold to the Standard Oil Company of California for reading credit card imprints for billing purposes, with many more systems sold to other oil companies. Other systems sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Shepard's OCR patents.

The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post Office or GPO. In 1965 it began planning an entire banking system, the National Giro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since 1971. OCR systems read the name and address of the addressee at the first mechanized sorting center, and print a routing bar code on the envelope based on the postal code. After that the letters need only be sorted at later centers by less expensive sorters which need only read the bar code. To avoid interference with the human-readable address field which can be located anywhere on the letter, special ink is used that is clearly visible under ultraviolet light. This ink looks orange in normal lighting conditions. Envelopes marked with the machine readable bar code may then be processed.

Current state of OCR technology

Typewritten OCR

The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem.

Recognition of hand printing, cursive handwriting, and even the printed typewritten versions of some other scripts (especially those with a very large number of characters), are still the subject of active research.

Hand print OCR

Systems for recognizing hand-printed text on the fly have enjoyed commercial success in recent years. Among these are the input device for personal digital assistants such as those running Palm OS. The Apple Newton pioneered this technology. The algorithms used in these devices take advantage of the fact that the order, speed, and direction of individual lines segments at input are known. Also, the user can be retrained to use only specific letter shapes. These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem. Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited contexts. This variety of OCR is now commonly known in the industry as "ICR" (intelligent character recognition).

Cursive OCR

Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written out number) is an example where using a smaller dictionary can increase recognition rates greatly. Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script.

Music OCR

Main article: Music OCR

Early research into recognition of printed sheet music was performed in the mid 1970s at MIT and other institutions. Successive efforts were made to localize and remove musical staff lines leaving symbols to be recognized and parsed. The first proprietary music-scanning program, MIDISCAN, was released in 1991. Three proprietary products are now available but music OCR software does not recognize handwritten scores.


One area where accuracy and speed of computer input of character information exceeds that of humans is in the area of magnetic ink character recognition, where the error rates range around one read error for every 20,000 to 30,000 checks.

Other research areas

A particularly difficult problem for computers and humans is that of old church baptismal and marriage records containing mostly names. The pages may be damaged by age, water or fire and the names may be obsolete or contain rare spellings. Another research area is cooperative approaches, where computers assist humans and vice-versa. Computer image processing techniques can assist humans in reading extremely difficult texts such as the Archimedes Palimpsest or the Dead Sea Scrolls.

Generally, for more complex recognition problems neural networks are commonly used as they generally can be made indifferent to both affine and non-linear transformations.[1]

A related area is raster to vector conversion, converting bitmap images (for example, maps including drawings, text, and map symbols) into vector graphics that are easier to work with.

Optical Character Recognition in Unicode

In Unicode, Optical Character Recognition symbol characters are placed in the hexadecimal range 0x24400x245F, as shown below (see also Unicode Symbols):


Proprietary software

  • Abbyy FineReader - growing in the market. In recent years is the default OCR software bundled with many scanner brands.
  • Cuneiform - famous and indicated by many as the most accurate OCR algorithm.
  • Intelliant OCR is a commandline OCR utility, based on Tiger OCR.
  • OCR Document Readers Highest performance readers from Adaptive Recognition Hungary
  • OmniPage - for years the most recognized OCR and market leader software suite. Owns the current PC Magazine Editor's Choice awarded in 2003.
  • Readiris - reads European languages, Arabic, Hebrew, Asian languages.
  • RecoStar A high performance OCR Engine
  • SimpleOCR a relatively simple freeware (supports English, French and Dutch language recognition)
  • SmartZone OCR - offers developers the ability to perform zonal OCR.
  • TeleForm - for capturing data from handwritten forms.
  • TextBridge - bundled with many scanners, simpler and with less resources than its sister product Omnipage.

Free and open source software

  • Simple OCR Royalty Free
  • GOCR - included in Debian and other distributions
  • ISRI Software - some experimental OCR tools
  • GNU Ocrad "is an OCR [...] program based on a feature extraction method".
  • OCRchie - dormant since 1996
  • OOCR OOCR is an OCR program still in development, under the GPL.
  • phpOCR A base implementation for an OCR tool in PHP
  • Tesseract is an open source OCR, initially developed by HP, and released under the Apache License, Version 2.0. It can be compiled using MSVC 6.0 or GCC.

See also

  • Automatic number plate recognition
  • Barcode and barcode scanners
  • Captcha
  • Computer vision
  • Digital image processing
  • ICR
  • Machine learning
  • Machine vision
  • Magnetic ink character recognition (MICR)
  • Mapping of Unicode characters
  • Optical mark recognition (OMR)
  • Pattern recognition
  • Raymond Kurzweil
  • Speech recognition
  • SmartScore

External links

  • ICDAR ICDAR is one of the most comprehensive conferences on all aspects of document recognition, including OCR, and is held every two years.
  • phpOCR A base implementation for an OCR tool in PHP
  • GNU Ocrad "is an OCR [...] program based on a feature extraction method".
  • DRR SPIE DRR is an annual conference on OCR and document retrieval.
  • Reference OCR Engine An open-source OCR project.
  • OOCR OOCR is an OCR program still in development, under the GPL.
  • GOCR GOCR is an OCR program, developed under the GPL.
  • Tesseract Tesseract is an open source OCR, initially developed by HP, and released under the Apache License, Version 2.0. It can be compiled using MSVC 6.0 or GCC.
Retrieved from ""