Speech Recognition

Overview of A Typical Speech Recognition System

Automatic Speech recognition is the process of automatically converting an acoustic signal, captured by a microphone or telephone, to a set of words. The recognized words can be the final results or can serve as the inputs to further linguistic processing in order to achieve speech understanding. Speech recognition has found numerous applications in diverse areas such as commands and control, data entry, and document preparation.

Speech Recognition Stages

A typical speech recognition system consists of two stages: The first speech recognition stage is called pre-processing or feature extraction and the second stage is post-processing divided into acoustic, lexical and language modeling.

Feature Extraction
In the feature extraction step, the speech waveform which is sampled at a rate between 6.6 to 20 kHz is processed to produce a new representation as a sequence of vectors containing values of features or parameters. The vectors typically comprise of 10 to 39 parameters and are usually computed every 10 or 20 msec.

Acoustic Models
The parameter values extracted from the raw speech are used to build acoustic models that approximate the probability that the portion of waveform just analyzed corresponds to a particular phonetic event that occurs in the phone-sized or whole-word reference unit being postulated.
A number of procedures have been developed for acoustic modeling including Dynamic Time Warping (DTW), Hidden Markov Model (HMM), Vector Quantization, and Neural Networks. HMM is the dominant recognition paradigm where variations in speech are modeled statistically. Hybrid ANN/HMM systems have also been used for the recognition task. In hybrid systems, neural networks are used to estimate frame based phone probabilities. These phone probabilities are then converted into the most probable word string using HMM.

Lexical and Language Models
Lexical models define the vocabulary for ASR systems whereas language models capture the properties of a language and predict the next word in a speech sequence. The sequence of phone probabilities obtained from acoustic modeling is used for the search of the most likely word by using the constraints imposed by lexical and language models. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

The Performance of Speech Recognition Systems

The performance of a speech recognition system is generally measured in terms of word error rate, which is the ratio between misclassified words, and the total number of tested words. ASR research has been focused on minimizing the recognition error to zero in real-time independent of vocabulary size, noise, speaker characteristics, and accent. Despite several decades of research in ASR, the error rate for commercial systems is around 10% when the recognition task is constrained in some way.

The performance of ASR systems can be characterized by many parameters some of them are vocabulary size, speech type, and Speaker dependency.

Vocabulary Size

Speech recognition is generally easy when vocabularies are small, but the word error rate (WER) increases as the vocabulary size grows. For example, in the digit recognition task the vocabulary consists of ten digits “zero” to “nine”; and can be recognized nearly perfectly whereas for large vocabulary ASR systems with a vocabulary of more than 100,000 words the WER increases to somewhere between 5% and 10%.

Speech Type

An ASR system can be designed to recognize either isolated words or connected speech. An isolated-word speech recognition system requires that a speaker pause briefly between words as opposed to a continuous speech recognition system. Isolated-word speech recognition is relatively easy because word boundaries are detectable and the words tend to be clearly pronounced. Furthermore, in continuous speech recognition the word boundaries are corrupted by co-articulation, or slurring of speech sounds, e.g. a phrase like “could you” to sound like “could jou”.

Speaker Dependency

The recognition task can be either speaker-dependent or speaker-independent. Speaker-dependent systems require a speaker to provide a sample of his or her speech before using the system, whereas speaker-independent systems recognize the speech for a variety of speakers. Speaker independent recognition is more difficult to implement than a speaker-dependent system. This is because the internal representation of speech must be global enough to cover all types of voices and all the ways of pronouncing words, and should be able to discriminate between various words of the vocabulary. Intermediate between speaker-dependent and independent systems there are also multi-speaker systems intended for a small group of people. Speaker adaptive systems tune themselves to any speaker by taking a small amount of their speech as enrollment data.