The Holy Grail of Speech Recognition

A Microsoft Research team has been working on a research breakthrough that improves the potential of real-time, speaker-independent, automatic speech recognition. Dong Yu, researcher at Microsoft Research Redmond, and Frank Seide, senior researcher and research manager with Microsoft Research Asia, have been spearheading this work.

Speech recognition has been an active research area for more than five decades. In the current commercially available speech-recognition technology, voice-to-text typically achieves accuracy by having the user train the software during setup and by adapting more closely to the user’s speech patterns over time. However, automated voice services that interact with multiple speakers do not allow for individual speaker training because they must be usable instantly by any user. Therefore, they either handle only a small vocabulary or strongly restrict the words or patterns that users can say. This research suggests using artificial neural networks for large-vocabulary speech recognition in order to achieve the ubiquitous speaker-independent speech recognition delivered out-of-the-box.

Artificial neural networks (ANNs) are mathematical models of the low-level circuits in the human brain. The notion of using ANNs to improve speech-recognition performance has been around since the 1980s, and a model known as the ANN-Hidden Markov Model (ANN-HMM) showed promise for large-vocabulary speech recognition. However, performance issues hindered commercial adoption.

A speech recognizer is essentially a model of fragments of sounds of speech. State-of-the-art speech recognizers use short fragments, numbering in the thousands, called senones. Dong Yu proposed modelling the thousands of senones directly with Deep neural networks (DNNs). This allows for a significant leap in accuracy while achieving state-of-the-art performance. The architectural model is made feasible for neural networks by harnessing the computational power of modern graphic cards.

Using a novel way by employing artificial neural networks for speaker-independent speech recognition, Microsoft Research has brought fluent speech-to-speech applications much closer to reality. The research paper – Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition – describes the first hybrid context-dependent DNN-HMM model applied successfully to large-vocabulary speech-recognition problems.

Published by

Abhishek Baxi

Abhishek Baxi is an independent technology columnist for several international publications and a digital consultant. He speaks incessantly on Twitter (@baxiabhishek) and dons the role of Editor-in-Chief here at Techie Buzz.