VoxForge collect transcribed speech for use with Free and Open Source Speech Recognition Engines

一些概念解释的很清晰，这里汇总一下。

An acoustic model is a file that contains statistical representations of each of the distinct sounds that makes up a word. Each of these statistical representations is assigned a label called a phoneme. The English language has about 40 distinct sounds that are useful for speech recognition, and thus we have 40 different phonemes.

An acoustic model is created by taking a large database of speech (called a speech corpus) and using special training algorithms to create statistical representations for each phoneme in a language. These statistical representations are called Hidden Markov Models ("HMM"s). Each phoneme has its own HMM.

For example, if the system is set up with a simple grammar file to recognize the word "house" (whose phonemes are: "hh aw s"), here are the (simplified) steps that the speech recognition engine might take:

The speech decoderlistens for the distinct sounds spoken by a user and then looks for a matching HMM in the Acoustic Model. In our example, each of the phonemes in the word house has its own HMM:
- hh
- aw
- s
When it finds a matching HMM in the acoustic model, the decoder takes note of the phoneme. The decoder keeps track of the matching phonemes until it reaches a pause in the users speech.

When a pause is reached, the decoder looks up the matching series of phonemes it heard (i.e. "hh aw s") in its Pronunciation Dictionary to determine which word was spoken. In our example, one of the entries in the pronunciation dictionary is HOUSE:
- ...
- HOUSAND [HOUSAND] hh aw s ax n d
- HOUSDEN [HOUSDEN] hh aw s d ax n
- HOUSE [HOUSE] hh aw s
- HOUSE'S [HOUSE'S] hh aw s ix z
- HOUSEAL [HOUSEAL] hh aw s ax l
- HOUSEBOAT [HOUSEBOAT] hh aw s b ow t
- ...

The decoder then looks in the Grammar file for a matching word or phrase. Since our grammar in this example only contains one word ("HOUSE"), it returns the word "HOUSE" to the calling program.

This get a little more complicated when you start using Language Models (which contain the probabilities of a large number of different word sequences), but the basic approach is the same.　

Speech Recognition Engines ("SRE"s) are made up of the following components:

Language Model or Grammar - Language Models contain a very large list of words and their probability of occurrence in a given sequence. They are used in dictation applications. Grammars are a much smaller file containing sets of predefined combinations of words. Grammars are used in IVR or desktop Command and Control applications. Each word in a Language Model or Grammar has an associated list of phonemes (which correspond to the distinct sounds that make up a word).
Acoustic Model - Contains a statistical representation of the distinct sounds that make up each word in the Language Model or Grammar. Each distinct sound corresponds to a phoneme.
Decoder - Software program that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent sounds. When a match is made, the Decoder determines the phoneme corresponding to the sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It then searches the Language Model or Grammar file for the equivalent series of phonemes. If a match is made it returns the text of the corresponding word or phrase to the calling program.

A Speech Recognition System ('SRS') on a desktop computer does what a typical user of speech recognition would expect it to do: you speak a command into your microphone and the computer does something, or you dictate something to the computer and it types it out the corresponding text on your screen.

Grammar

A recognition Grammar essentially defines constraints on what the SRE can expect as input. It is a list of words and/or phrases that the SRE listens for. When one of these predefined words or phrases is heard, the SRE returns the word or phrase to the calling program - usually a Dialog Manager (but could also be a script written in Perl, Python, etc.). The Dialog Manager then does some processing based on this word or phrase.

The example in the HTK book is that of a voice-operated interface to for phone dialling. If the SRE hears the sequence of words: 'Call Steve Young', it returns the textual representation of this phrase to the Dialog Manager, which then looks up Steve's telephone number and then dials the number.

It is very important to understand that the words that you can use in your Grammar are limited to the words that you have 'trained' in your Acoustic Model. The two are tied very closely together.

Acoustic Model

An Acoutic Model is a file that contains a statistical representation of each distinct sound that makes up a spoken word. It must contain the sounds for each word used in your grammar. The words in your grammar give the SRE the sequence of sounds it must listen for. The SRE then listens for the sequence of sounds that make up a particular word, and when it finds a particular sequence, returns the textual representation of the word to the calling program (usually a Dialog Manager). Thus, when an SRE is listening for words, it is actually listening for the sequence of sounds that make up one of the words you defined in your Grammar. The Grammar and the Acoustic Model work together.

Therefore, when you train your Acoustic Model to recognize the phrase 'call Steve Young', the SRE is actually listening for the phoneme sequence "k", "ao", "l", "s", "t", "iy", "v", "y", "ah" and "ng". If you say each of these phonemes aloud in sequence, it will give you an idea of what the SRE is looking for.

Commercial SREs use large databases of speech audio to create their Acoustic Models. Because of this, most common words that might be used in a Grammar are already included in their Acoustic Model.

When creating your own Acoustic Models and Grammars, you need to make sure that all the phonemes that make up the words in your Grammar are included in your Acoustic Model.

A Statistical Language Model is a file used by a Speech Recognition Engine to recognize speech. It contains a large list of words and their probability of occurrence. It is used in dictation applications.

Language models are used to constrain search in a decoder by limiting the number of possible words that need to be considered at any one point in the search. The consequence is faster execution and higher accuracy.

Language models constrain search either absolutely (by enumerating some small subset of possible expansions) or probabilistically (by computing a likelihood for each possible successor word). The former will usually have an associated grammar this is compiled down into a graph, the latter will be trained from a corpus.

Statistical language models (SLMs) are good for free-form input, such as dictation or spontaneous speech, where it's not practical or possible to a priori specify all possible legal word sequences.

Trigram SLMs are probably the most common ones used in ASR and represent a good balance between complexity and robust estimation. A trigram model encodes the probability of a word (w3) given its immediate two-word history, ie p(w3 | w1 w2). In practice trigam models can be "backed-off" to bigram and unigram models, allowing the decoder to emit any possible word sequence (provided that the acoustic and lexical evidence is there).