Speech Recognition

This learning resource is about automatic conversion of spoken language into text, that can be stored as documents or processed as commands to control devices e.g. for handicapped people or elderly people or in a commercial setting allows to order goods and services by audio commands. The learning resource is based on the Open Community Approach so the used tools are Open Source to assure that learner have access to the tools.

Learning Tasks

From analog audio via mircophone audio interface to digital audio - step 1 before speech analysis

(Applications of Speech Recognition) Analyse the possible applications of speech recognition and identify challenges of the application!
(Human Speech Recognition) Compare human comprehension of speech with the algorithmic speech recognition approach. What are the similarities and differences of human and algorithmic speech recognition?
(Speech and Detection of Emotions) Speech contains more information than the encoded text. Is it possible to detect emotions in the speech with methods developed in computer science?
- What are similarities and difference between text and emotion recognition in speech analysis?
- What are possible application areas in digital assitants for both speech recognition and emotion recognition?
- Analyze the different types of information systems and identify different areas of application of speech recognition and include mobile devices in your consideration!
(History) Analyse the history of speech recognition and compare the steps of development with current applications. Identify the major steps that are required for the current applications of speech recognition!
(Risk Literacy) Identify possible areas of risks and possible risk mitigation strategies if speech recognition is implemented in mobile devices, or with voice control for Internet of Things in general? What are required capacity building measures for business, research and development!
(Commercial Data Harvesting) Apply the concept of speech recognition to commercial data harvesting. What are potential benefits for generation of tailored advertisments for the users according to their generated profile? How is speech recognition contributing to user profile? What is the difference between offline and online speech recognition systems due to submission of recognized text or audio files submitted to remote servers for speech recognition?
(Context Awareness of Speech Recognition) The word "Fire" with a candle in your hand and with burning house in the background creates a different context and different expectations of people listening to what someone is going to tell you. Exlain why context awareness can be helpful to optimize the recognition correctness? How can a speech recognition system detect a context to the speech recognition. I.e. detecting the context without a user setting that switches to a dictation mode e.g. for medical report for X-Ray images.
(Audio-Video-Compression) Go to the learning resource about Audio-Video-Compression and explain how Speech Recognition can be used in conjunction with Speech Synthesis to reduce the consumption of bandwidth for Video conferencing.
(Performance) Explain why the performance of speech recognition and accurancy is relevant in many applications. Discuss application in cars or in general in vehicles. Which voice commands can be applied in a traffic situation and which command (not accurately recognized) could cause trouble or even an accident for the driver. Order the theortical application of speech recognition (e.g. "turn right at crossing", "switch on/off music",...) in terms of required performance and accuracy resp. to current available technologies to perform the command in an acceptable way.
(HTML5 Speech Recognition) Analyze the source code of the OpenSource web application demo with PocketSphinx (use browser Firefox/Chromium or Chrome).
- Explain how the recognized words are encoded for speech recognition in the demo application (digits, cities, operating systems).
- Explain how the concept of speech recognition can support handicapped people^[1] with navigating in a WebApp or offline AppLSAC for digital learning environments.
(Size of Vocabulary) Explain how the size of the recognized vocabulary determines the precision of recognition.
(People with Disabilities)^[2] Explore the available frameworks Open Source offline infrastructure for speech recognition without sending audio streams to a remote server for processing. Identify options to control robots or in the context of Ambient Assisted Living with voice recognition^[3].
(Version Control) Explore the concept of Version Control and apply that specifically to the Open Community Approach:
- Collaborative development of the Open Source code base of the speech recognition infrastructure,
- Application on the collaborative development of a domain specific vocabulary for speech recognition for specific application scenarios.
- Application on Open Educational Resources that support learners in using speech recognition and Open Source developers in integrating Open Source frameworks into learning environments.

Definition

Speech recognition is the interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

Training of Speech Recognition Algorithms

Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker independent"^[4] systems. Systems that use training are called "speaker dependent".

Applications

Speech recognition applications include voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search (e.g. find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics,^[5] speech-to-text processing (e.g., word processors emails, and generating a string-searchable transcript from an audio track), and aircraft (usually termed direct voice input).

The term voice recognition^[6]^[7]^[8] or speaker identification^[9]^[10] refers to identifying the speaker, rather than what they are saying. Recognizing the speaker can simplify the task of translating speech in systems that have been trained on a specific person's voice or it can be used to authenticate or verify the identity of a speaker as part of a security process.

From the technology perspective, speech recognition has a long history with several waves of major innovations. Most recently, the field has benefited from advances in deep learning and big data. The advances are evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide industry adoption of a variety of deep learning methods in designing and deploying speech recognition systems.

Models, methods, and algorithms

Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation.

Neural Networks

End-to-End Automated Speech Recognition

Learning Task: Applications

The following learning tasks focus on different applications of Speech Recognition. Explore the different applications.

Usage in education and daily life

For language learning, speech recognition can be useful for learning a second language. It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills.^[11]

Students who are blind (see Blindness and education) or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard.^[12]

Students who are physically disabled or suffer from Repetitive strain injury/other injuries to the upper extremities can be relieved from having to worry about handwriting, typing, or working with scribe on school assignments by using speech-to-text programs. They can also utilize speech recognition technology to freely enjoy searching the Internet or using a computer at home without having to physically operate a mouse and keyboard.^[12]

Speech recognition can allow students with learning disabilities to become better writers. By saying the words aloud, they can increase the fluidity of their writing, and be alleviated of concerns regarding spelling, punctuation, and other mechanics of writing.^[13] Also, see Learning disability.

Use of voice recognition software, in conjunction with a digital audio recorder and a personal computer running word-processing software has proven to be positive for restoring damaged short-term-memory capacity, in stroke and craniotomy individuals.

Further applications

Aerospace (e.g. space exploration, spacecraft, etc.) NASA's Mars Polar Lander used speech recognition technology from Sensory, Inc. in the Mars Microphone on the Lander^[14]
Automatic subtitling with speech recognition
Automatic emotion recognition^[15]
Automatic translation
Court reporting (Real time Speech Writing)
eDiscovery (Legal discovery)
Hands-free computing: Speech recognition computer user interface
Home automation
Interactive voice response
Mobile telephony, including mobile email
Multimodal interaction
Pronunciation evaluation in computer-aided language learning applications
Real Time Captioning[citation needed]
Robotics
Speech to text (transcription of speech into text, real time video captioning, Court reporting )
Telematics (e.g. vehicle Navigation Systems)
Transcription (digital speech-to-text)
Video games, with Tom Clancy's EndWar and Lifeline as working examples
Virtual assistant (e.g. Apple's Siri)

Further information

Conferences and journals

Popular speech recognition conferences held each year or two include SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and the IEEE ASRU. Conferences in the field of natural language processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions on Audio, Speech and Language Processing and since Sept 2014 renamed IEEE/ACM Transactions on Audio, Speech and Language Processing—after merging with an ACM publication), Computer Speech and Language, and Speech Communication.

Books

Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek and "Spoken Language Processing (2001)" by Xuedong Huang etc. More up to date are "Computer Speech", by Manfred R. Schroeder, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The recently updated textbook Speech and Language Processing (2008) by Jurafsky and Martin presents the basics and the state of the art for ASR. Speaker recognition also uses the same features, most of the same front-end processing, and classification techniques as is done in speech recognition. A most recent comprehensive textbook, "Fundamentals of Speaker Recognition" is an in depth source for up to date details on the theory and practice.^[16] A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

A good and accessible introduction to speech recognition technology and its history is provided by the general audience book "The Voice in the Machine. Building Computers That Understand Speech" by Roberto Pieraccini (2012).

The most recent book on speech recognition is Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer) written by D. Yu and L. Deng and published near the end of 2014, with highly mathematically oriented technical detail on how deep learning methods are derived and implemented in modern speech recognition systems based on DNNs and related deep learning methods.^[17] A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical but more methodology-focused overview of DNN-based speech recognition during 2009–2014, placed within the more general context of deep learning applications including not only speech recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning.^[18]

Software

In terms of freely available resources, Carnegie Mellon University's Sphinx toolkit is one place to start to both learn about speech recognition and to start experimenting. Another resource (free but copyrighted) is the HTK book (and the accompanying HTK toolkit). For more recent and state-of-the-art techniques, Kaldi toolkit can be used^[19]. In 2017 Mozilla launched the open source project called Common Voice^[20] to gather big database of voices that would help build free speech recognition project DeepSpeech (available free at GitHub)^[21] using Google open source platform TensorFlow^[22].

A demonstration of an on-line speech recognizer is available on Cobalt's webpage.^[23]

For more software resources, see List of speech recognition software.

References

↑ Pacheco-Tallaj, Natalia M., and Claudio-Palacios, Andrea P. "Development of a Vocabulary and Grammar for an Open-Source Speech-driven Programming Platform to Assist People with Limited Hand Mobility". Research report submitted to Keyla Soto, UHS Science Professor.
↑ Stodden, Robert A., and Kelly D. Roberts. "The Use Of Voice Recognition Software As A Compensatory Strategy For Postsecondary Education Students Receiving Services Under The Category Of Learning Disabled." Journal Of Vocational Rehabilitation 22.1 (2005): 49--64. Academic Search Complete. Web. 1 Mar. 2015.
↑ Zaman, S., & Slany, W. (2014). Smartphone-Based Online and Offline Speech Recognition System for ROS-Based Robots. Information Technology and Control, 43(4), 371-380.
↑ "Speaker Independent Connected Speech Recognition- Fifth Generation Computer Corporation". Fifthgen.com. Archived from the original on 11 November 2013. Retrieved 15 June 2013.
↑ P. Nguyen (2010). "Automatic classification of speaker characteristics".
↑ "British English definition of voice recognition". Macmillan Publishers Limited. Archived from the original on 16 September 2011. Retrieved 21 February 2012.
↑ "voice recognition, definition of". WebFinance, Inc. Archived from the original on 3 December 2011. Retrieved 21 February 2012.
↑ "The Mailbag LG #114". Linuxgazette.net. Archived from the original on 19 February 2013. Retrieved 15 June 2013.
↑ Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models". IEEE Transactions on Speech and Audio Processing 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-6676. OCLC 26108901. Archived from the original on 8 March 2014. https://web.archive.org/web/20140308001101/http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf. Retrieved 21 February 2014.
↑ "Speaker Identification (WhisperID)". Microsoft Research. Microsoft. Archived from the original on 25 February 2014. Retrieved 21 February 2014. When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound.
↑ Cerf, Vinton; Wrubel, Rob; Sherwood, Susan. "Can speech-recognition software break down educational language barriers?". Curiosity.com. Discovery Communications. Archived from the original on 7 April 2014. Retrieved 26 March 2014.
↑ ^12.0 ^12.1 "Speech Recognition for Learning". National Center for Technology Innovation. 2010. Archived from the original on 13 April 2014. Retrieved 26 March 2014.
↑ Follensbee, Bob; McCloskey-Dale, Susan (2000). "Speech recognition in schools: An update from the field". Technology And Persons With Disabilities Conference 2000. Archived from the original on 21 August 2006. Retrieved 26 March 2014.
↑ "Projects: Planetary Microphones". The Planetary Society. Archived from the original on 27 January 2012.
↑ Caridakis, George; Castellano, Ginevra; Kessous, Loic; Raouzaiou, Amaryllis; Malatesta, Lori; Asteriadis, Stelios; Karpouzis, Kostas (19 September 2007). Multimodal emotion recognition from expressive faces, body gestures and speech (in en). 247. Springer US. 375–388. doi:10.1007/978-0-387-74161-1_41. ISBN 978-0-387-74160-4.
↑ Beigi, Homayoon (2011). Fundamentals of Speaker Recognition. New York: Springer. ISBN 978-0-387-77591-3. Archived from the original on 31 January 2018. https://web.archive.org/web/20180131140911/http://www.fundamentalsofspeakerrecognition.org/.
↑ Yu, D.; Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer).
↑ Deng, Li; Yu, Dong (2014). "Deep Learning: Methods and Applications". Foundations and Trends in Signal Processing 7 (3–4): 197–387. doi:10.1561/2000000039. Archived from the original on 22 October 2014. https://web.archive.org/web/20141022161017/http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf.
↑ Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
↑ https://voice.mozilla.org
↑ https://github.com/mozilla/DeepSpeech
↑ https://www.tensorflow.org/tutorials/sequences/audio_recognition
↑ https://demo-cubic.cobaltspeech.com/

External links

Signer, Beat and Hoste, Lode: SpeeG2: A Speech- and Gesture-based Interface for Efficient Controller-free Text Entry, In Proceedings of ICMI 2013, 15th International Conference on Multimodal Interaction, Sydney, Australia, December 2013
Speech Technology at the Open Directory Project

Page Information

This page was based on the following wikipedia-source page:

[1] Pacheco-Tallaj, Natalia M., and Claudio-Palacios, Andrea P. "Development of a Vocabulary and Grammar for an Open-Source Speech-driven Programming Platform to Assist People with Limited Hand Mobility". Research report submitted to Keyla Soto, UHS Science Professor.

[2] Stodden, Robert A., and Kelly D. Roberts. "The Use Of Voice Recognition Software As A Compensatory Strategy For Postsecondary Education Students Receiving Services Under The Category Of Learning Disabled." Journal Of Vocational Rehabilitation 22.1 (2005): 49--64. Academic Search Complete. Web. 1 Mar. 2015.

[3] Zaman, S., & Slany, W. (2014). Smartphone-Based Online and Offline Speech Recognition System for ROS-Based Robots. Information Technology and Control, 43(4), 371-380.

[4] "Speaker Independent Connected Speech Recognition- Fifth Generation Computer Corporation". Fifthgen.com. Archived from the original on 11 November 2013. Retrieved 15 June 2013.

[5] P. Nguyen (2010). "Automatic classification of speaker characteristics".

[Macmillan_Brit._def_of_voice_recognition-6] "British English definition of voice recognition". Macmillan Publishers Limited. Archived from the original on 16 September 2011. Retrieved 21 February 2012.

[Voice_rec,_definition-7] "voice recognition, definition of". WebFinance, Inc. Archived from the original on 3 December 2011. Retrieved 21 February 2012.

[mail_bag,_gazette-8] "The Mailbag LG #114". Linuxgazette.net. Archived from the original on 19 February 2013. Retrieved 15 June 2013.

[9] Reynolds, Douglas; Rose, Richard (January 1995). "Robust text-independent speaker identification using Gaussian mixture speaker models". IEEE Transactions on Speech and Audio Processing 3 (1): 72–83. doi:10.1109/89.365379. ISSN 1063-6676. OCLC 26108901. Archived from the original on 8 March 2014. https://web.archive.org/web/20140308001101/http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf. Retrieved 21 February 2014.

[10] "Speaker Identification (WhisperID)". Microsoft Research. Microsoft. Archived from the original on 25 February 2014. Retrieved 21 February 2014. When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound.

[11] Cerf, Vinton; Wrubel, Rob; Sherwood, Susan. "Can speech-recognition software break down educational language barriers?". Curiosity.com. Discovery Communications. Archived from the original on 7 April 2014. Retrieved 26 March 2014.

[brainline-12] 12.0 ^12.1 "Speech Recognition for Learning". National Center for Technology Innovation. 2010. Archived from the original on 13 April 2014. Retrieved 26 March 2014.

[13] Follensbee, Bob; McCloskey-Dale, Susan (2000). "Speech recognition in schools: An update from the field". Technology And Persons With Disabilities Conference 2000. Archived from the original on 21 August 2006. Retrieved 26 March 2014.

[Planetary_Society_article-14] "Projects: Planetary Microphones". The Planetary Society. Archived from the original on 27 January 2012.

[15] Caridakis, George; Castellano, Ginevra; Kessous, Loic; Raouzaiou, Amaryllis; Malatesta, Lori; Asteriadis, Stelios; Karpouzis, Kostas (19 September 2007). Multimodal emotion recognition from expressive faces, body gestures and speech (in en). 247. Springer US. 375–388. doi:10.1007/978-0-387-74161-1_41. ISBN 978-0-387-74160-4.

[auto-16] Beigi, Homayoon (2011). Fundamentals of Speaker Recognition. New York: Springer. ISBN 978-0-387-77591-3. Archived from the original on 31 January 2018. https://web.archive.org/web/20180131140911/http://www.fundamentalsofspeakerrecognition.org/.

[ReferenceA-17] Yu, D.; Deng, L. (2014). Automatic Speech Recognition: A Deep Learning Approach (Publisher: Springer).

[BOOK2014-18] Deng, Li; Yu, Dong (2014). "Deep Learning: Methods and Applications". Foundations and Trends in Signal Processing 7 (3–4): 197–387. doi:10.1561/2000000039. Archived from the original on 22 October 2014. https://web.archive.org/web/20141022161017/http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf.

[19] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., ... & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.

[20] ttps://voice.mozilla.org

[21] ttps://github.com/mozilla/DeepSpeech

[22] ttps://www.tensorflow.org/tutorials/sequences/audio_recognition

[23] ttps://demo-cubic.cobaltspeech.com/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]