- Great Painters
- Accounting
- Fundamentals of Law
- Marketing
- Shorthand
- Concept Cars
- Videogames
- The World of Sports

- Blogs
- Free Software
- Google
- My Computer

- PHP Language and Applications
- Wikipedia
- Windows Vista

- Education
- Masterpieces of English Literature
- American English

- English Dictionaries
- The English Language

- Medical Emergencies
- The Theory of Memory
- The Beatles
- Dances
- Microphones
- Musical Notation
- Music Instruments
- Batteries
- Nanotechnology
- Cosmetics
- Diets
- Vegetarianism and Veganism
- Christmas Traditions
- Animals

- Fruits And Vegetables


  1. Adobe Reader
  2. Adware
  3. Altavista
  4. AOL
  5. Apple Macintosh
  6. Application software
  7. Arrow key
  8. Artificial Intelligence
  9. ASCII
  10. Assembly language
  11. Automatic translation
  12. Avatar
  13. Babylon
  14. Bandwidth
  15. Bit
  16. BitTorrent
  17. Black hat
  18. Blog
  19. Bluetooth
  20. Bulletin board system
  21. Byte
  22. Cache memory
  23. Celeron
  24. Central processing unit
  25. Chat room
  26. Client
  27. Command line interface
  28. Compiler
  29. Computer
  30. Computer bus
  31. Computer card
  32. Computer display
  33. Computer file
  34. Computer games
  35. Computer graphics
  36. Computer hardware
  37. Computer keyboard
  38. Computer networking
  39. Computer printer
  40. Computer program
  41. Computer programmer
  42. Computer science
  43. Computer security
  44. Computer software
  45. Computer storage
  46. Computer system
  47. Computer terminal
  48. Computer virus
  49. Computing
  50. Conference call
  51. Context menu
  52. Creative commons
  53. Creative Commons License
  54. Creative Technology
  55. Cursor
  56. Data
  57. Database
  58. Data storage device
  59. Debuggers
  60. Demo
  61. Desktop computer
  62. Digital divide
  63. Discussion groups
  64. DNS server
  65. Domain name
  66. DOS
  67. Download
  68. Download manager
  69. DVD-ROM
  70. DVD-RW
  71. E-mail
  72. E-mail spam
  73. File Transfer Protocol
  74. Firewall
  75. Firmware
  76. Flash memory
  77. Floppy disk drive
  78. GNU
  79. GNU General Public License
  80. GNU Project
  81. Google
  82. Google AdWords
  83. Google bomb
  84. Graphics
  85. Graphics card
  86. Hacker
  87. Hacker culture
  88. Hard disk
  89. High-level programming language
  90. Home computer
  91. HTML
  92. Hyperlink
  93. IBM
  94. Image processing
  95. Image scanner
  96. Instant messaging
  97. Instruction
  98. Intel
  99. Intel Core 2
  100. Interface
  101. Internet
  102. Internet bot
  103. Internet Explorer
  104. Internet protocols
  105. Internet service provider
  106. Interoperability
  107. IP addresses
  108. IPod
  109. Joystick
  110. JPEG
  111. Keyword
  112. Laptop computer
  113. Linux
  114. Linux kernel
  115. Liquid crystal display
  116. List of file formats
  117. List of Google products
  118. Local area network
  119. Logitech
  120. Machine language
  121. Mac OS X
  122. Macromedia Flash
  123. Mainframe computer
  124. Malware
  125. Media center
  126. Media player
  127. Megabyte
  128. Microsoft
  129. Microsoft Windows
  130. Microsoft Word
  131. Mirror site
  132. Modem
  133. Motherboard
  134. Mouse
  135. Mouse pad
  136. Mozilla Firefox
  137. Mp3
  138. MPEG
  139. MPEG-4
  140. Multimedia
  141. Musical Instrument Digital Interface
  142. Netscape
  143. Network card
  144. News ticker
  145. Office suite
  146. Online auction
  147. Online chat
  148. Open Directory Project
  149. Open source
  150. Open source software
  151. Opera
  152. Operating system
  153. Optical character recognition
  154. Optical disc
  155. output
  156. PageRank
  157. Password
  158. Pay-per-click
  159. PC speaker
  160. Peer-to-peer
  161. Pentium
  162. Peripheral
  163. Personal computer
  164. Personal digital assistant
  165. Phishing
  166. Pirated software
  167. Podcasting
  168. Pointing device
  169. POP3
  170. Programming language
  171. QuickTime
  172. Random access memory
  173. Routers
  174. Safari
  175. Scalability
  176. Scrollbar
  177. Scrolling
  178. Scroll wheel
  179. Search engine
  180. Security cracking
  181. Server
  182. Simple Mail Transfer Protocol
  183. Skype
  184. Social software
  185. Software bug
  186. Software cracker
  187. Software library
  188. Software utility
  189. Solaris Operating Environment
  190. Sound Blaster
  191. Soundcard
  192. Spam
  193. Spamdexing
  194. Spam in blogs
  195. Speech recognition
  196. Spoofing attack
  197. Spreadsheet
  198. Spyware
  199. Streaming media
  200. Supercomputer
  201. Tablet computer
  202. Telecommunications
  203. Text messaging
  204. Trackball
  205. Trojan horse
  206. TV card
  207. Unicode
  208. Uniform Resource Identifier
  209. Unix
  210. URL redirection
  211. USB flash drive
  212. USB port
  213. User interface
  214. Vlog
  215. Voice over IP
  216. Warez
  217. Wearable computer
  218. Web application
  219. Web banner
  220. Web browser
  221. Web crawler
  222. Web directories
  223. Web indexing
  224. Webmail
  225. Web page
  226. Website
  227. Wiki
  228. Wikipedia
  229. WIMP
  230. Windows CE
  231. Windows key
  232. Windows Media Player
  233. Windows Vista
  234. Word processor
  235. World Wide Web
  236. Worm
  237. XML
  238. X Window System
  239. Yahoo
  240. Zombie computer

This article is from:

All text is available under the terms of the GNU Free Documentation License: 

Speech recognition

From Wikipedia, the free encyclopedia


Speech recognition (in many contexts also known as 'automatic speech recognition', computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last years include voice dialing (e.g., Call home), call routing (e.g., I would like to make a collect call), simple data entry (e.g., entering a credit card number), and preparation of structured documents (e.g., a radiology report).

Voice Verification or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.

Speech recognition technology

In terms of technology, most of the technical text books nowadays emphasize the use of Hidden Markov Model as the underlying technology. The dynamic programming approach, the neural network-based approach and the knowledge-based learning approach have been studied intensively in the 1980s and 1990s.

Performance of speech recognition systems

The performance of a speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is measured with the word error rate, whereas speed is measured with the real time factor.

Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the term speech recognition and dictation.

Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy (getting one to two words out of one hundred wrong) if operated under optimal conditions. These optimal conditions usually means the test subjects have 1) matching speaker characteristics with the training data, 2) proper speaker adaptation, and 3) clean environment (e.g. office space). (This explains why some users, especially accented, might actually find that the recognition rate could be perceptually much lower than the expected 98% to 99%).

Other, limited vocabulary, systems requiring no training can recognize a small number of words (for instance, the ten digits) from most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.

Noisy channel formulation of statistical speech recognition

Many modern approaches such as HMM-based and ANN-based speech recognition are based on noisy channel formulation (See also Alternative formulation of speech recognition). In that view, the task of a speech recognition system is to search for the most likely word sequence given the acoustic signal. In other words, the system is searching for the most likely word sequence \tilde{W} among all possible word sequences W * from the acoustic signal A (what some will call the observation sequence according to the Hidden Markov Model terminology).

\tilde{W} = arg max_{W \in W^*} \Pr(W | A)

Based on Bayes' rule, the above formulation could be rewritten as

\tilde{W} = arg max_{W \in W^*} \frac{\Pr(A |W) \Pr(W)}{\Pr(A)}

Because the acoustic signal is common regardless of which word sequence chosen, the above could be usually simplified to

\tilde{W} = arg max_{W \in W^*} \Pr(A |W) \Pr(W)

The term \Pr(A|W) is generally called acoustic model. The term \Pr(W) is generally known as language model.

Both acoustic modeling and language modeling are important studies in modern statistical speech recognition. In this entry, we will focus on explaining the use of hidden Markov model (HMM) because notably it is very widely used in many systems. (Language modeling has many other applications such as smart keyboard and document classification; please refer to the corresponding entries.)

Approaches of statistical speech recognition

Hidden Markov model (HMM)-based speech recognition

Modern general-purpose speech recognition systems are generally based on hidden Markov models (HMMs). This is a statistical model which outputs a sequence of symbols or quantities.

One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic processes (known as states).

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very simplest setup possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

The above is a very brief introduction to some of the more central aspects of speech recognition. Modern speech recognition systems use a host of standard techniques which it would be too time consuming to properly explain, but just to give a flavor, a typical large-vocabulary continuous system would probably have the following parts. It would need context dependency for the phones (so phones with different left and right context have different realizations); to handle unseen contexts it would need tree clustering of the contexts; it would of course use cepstral normalization to normalize for different recording conditions and depending on the length of time that the system had to adapt on different speakers and conditions it might use cepstral mean and variance normalization for channel differences, vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use LDA followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform (MLLT)). A serious company with a large amount of training data would probably want to consider discriminative training techniques like maximum mutual information (MMI), MPE, or (for short utterances) MCE, and if a large amount of speaker-specific enrollment data was available a more wholesale speaker adaptation could be done using MAP or, at least, tree-based maximum likelihood linear regression. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, but there is a choice between dynamically creating combination hidden Markov models which includes both the acoustic and language model information, or combining it statically beforehand (the AT&T approach, for which their FSM toolkit might be useful). Those who value their sanity might consider the AT&T approach, but be warned that it is memory hungry.

Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.

Dynamic time warping (DTW)-based speech recognition

Main article: Dynamic time warping



Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analysized with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

Knowledge-based speech recognition

This method uses a stored data base of commands that compares simple words with ones in the data base.

For further information

Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). Keep an eye on government sponsored competitions such as those organised by DARPA (the telephone speech evaluation was most recently known as Rich Transcription). In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting (if you are very brave). You could also search for Carnegie Mellon University's SPHINX toolkit.

Applications of speech recognition

  • Command recognition - Voice user interface with the computer
  • Dictation
  • Interactive Voice Response
  • Automotive speech recognition
  • Medical Transcription [1]
  • Pronunciation Teaching in computer-aided language learning applications
  • Automatic Translation

See also

  • Guided Speech IVR
  • Speech processing
  • Audio visual speech recognition
  • Speech verification
  • Speaker identification
  • Speech synthesis
  • Speech Analytics
  • Keyword spotting
  • VoiceXML
  • Macfarlane's Law - the conflict between typing and reading speed anticipated the importance of speech recognition


  • "Survey of the State of the Art in Human Language Technology (1997) by Ron Cole et all"


  • Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff, April 2006--Researchers and developers in industry and academia with different backgrounds but a common interest in multilingual speech processing will find an excellent overview of research problems and solutions detailed from theoretical and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics / CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH 5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7: Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH 9: Other Challenges /

External links

  • NIST Speech Group
  • Sphinx Open Source Speech Recognition Engine
  • Entropic/Cambridge Hidden Markov Model Toolkit
  • Julius Open Source Speech Recognition Engine
  • The SPRACHcore software package
  • Open CV library, especially the multi-stream speech and vision combination programs
  • Xvoice: Speech control of X applications
  • LT-World: Portal to information and resources on the internet
  • LDC The Linguistic Data Consortium
  • Evaluations and Language resources Distribution Agency
  • OLAC Open Language Archives Community
  • BAS Bavarian Archive for Speech Signals
  • VoxForge - Free GPL Speech Corpus and Acoustic Model repository
Retrieved from ""