Speech recognition: background

Note

Before explaining exactly how you can create new scenarios with Simon, this section introduces some fundamental basics to speech recognition in general.

Speech recognition systems take voice input (often from a microphone) and try to translate it into written text. To do that, they rely on statistical representations of human voice. To put it into simple terms: The computer learns how words - or more correctly the sounds that make up those words - sound.

A speech model consists of two distinct parts:

  • Language Model

  • Acoustic Model

Language Model

The language model defines the vocabulary and the grammar you want to use.

Vocabulary

The vocabulary defines what words the recognition process should recognize. Every word you want to be able to use with Simon should be contained in your vocabulary.

One entry in the vocabulary defines exactly one word. In contrast to the common use of the word word, in Simon word means one unique combination of the following:

  • Wordname

    (The written word itself)

  • Category

    (Grammatical category; for example: Noun, Verb, etc.)

  • Pronunciation

    (How the word is pronounced; Simon accepts any kind of phonetic as long as it does not use special characters or numbers)

That means that plurals or even different cases are different words to Simon. This is an important design decision to allow more control when using a sophisticated grammar.

In general, it is advisable to keep your vocabulary as sleek as possible. The more words, the higher the chance that Simon might misunderstand you.

Example vocabulary (please note that the categories here are deliberately set to Noun / Verb to help the understanding; please to refer to the grammar section why this might not be the best idea):

Table 4.1. Sample Vocabulary

WordCategoryPronunciation
ComputerNounk ax m p y uw t er
InternetNounih n t er n eh t
MailNounm ey l
closeVerbk l ow s


Active Dictionary

The vocabulary used for the recognition is referred to as active dictionary or active vocabulary.

Shadow Dictionary

As said above, the user should keep his vocabulary / dictionary as lean as possible. However, as a word in your vocabulary has to also have information about its pronunciation, it would also be good to have a large dictionary where you could look up the pronunciation and other characteristics of the words.

Simon provides this functionality. We refer to this large reference dictionary as shadow dictionary. This shadow dictionary is not created by the user but can be imported from various sources.

As Simon is a multi-language solution we do not ship shadow dictionaries with Simon. However, it is very easy to import them yourself using the import dictionary wizard. This is described in the Import Dictionary section.

Language profile

Additionally to a shadow dictionary, Simon can use a language profile to provide help with transcribing words.

A language profile consists of rules how words are pronounced in the target language. It can be likened to the way that humans can often pronounce a word they never heard just because they know some implicit "pronunciation rules" of the language.

Just as with humans, this process is not perfect but can provide a solid starting ground.

This automatic deduction of a phoneme transcription from a written word is called "grapheme to phoneme conversion".

Simon requires the Sequitur G2P grapheme to phoneme converter to be installed and set up for language profiles to work.

If you have selected a pre-built language profile or built your own, Simon will automatically transcribe new words with it when they are not found in your shadow dictionary.

Grammar

The grammar defines which combinations of words are correct.

Let's look at an example: You want to use Simon to launch programs and close those windows when you are done. You would like to use the following commands:

  • Computer, Internet to open a browser

  • Computer, Mail

    To open a mail client

  • Computer, close

    To close the current window

Following English grammar, your vocabulary would contain the following:

Table 4.2. Sample Vocabulary

WordCategory
ComputerNoun
InternetNoun
MailNoun
closeVerb


To allow the sentences defined above Simon would need the following grammar:

  • Noun Noun for sentences like Computer Internet

  • Noun Verb for sentences like Computer close

While this would work, it would also allow the combinations Computer Computer, Internet Computer, Internet Internet, etc. which are obviously bogus. To improve the recognition accuracy, we can try to create a grammar that better reflects what we are trying to do with Simon.

It is important to remember that you define your own language when using Simon. That means that you are not bound to grammar rules that exist in whatever language you want to use Simon with. For a simple command and control use-case it would for example be advisable to invent new grammatical rules to eliminate the differences between different commands imposed by grammatical information not relevant for this use case.

In the example above it is for example not relevant that close is a verb or that Computer and Internet are nouns. Instead, why not define them as something that better reflects what we want them to be:

Table 4.3. Improved Sample Vocabulary

WordCategory
ComputerTrigger
InternetCommand
MailCommand
closeCommand


Now we change the grammar to the following:

  • Trigger Command

This allows all the combinations described above. However, it also limits the possibilities to exactly those three sentences. Especially in larger models a well thought grammar and vocabulary can mean a huge difference in recognition results.

Acoustic Model

The acoustic model represents your pronunciation in a machine readable format.

Let's look at the following sample vocabulary:

Table 4.4. Sample Vocabulary

WordCategoryPronunciation
ComputerNounk ax m p y uw t er
InternetNounih n t er n eh t
MailNounm ey l
closeVerbk l ow s


The pronunciation of each word is composed of individual sounds which are separated by spaces. For example, the word close consists of the following sounds:

  • k

  • l

  • ow

  • s

The acoustic model uses the fact that spoken words are composed of sounds much like written words are composed of letters. Using this knowledge, we can segment words into sounds (represented by the pronunciation) and assemble them back when recognizing. These building blocks are called phonemes.

Because the acoustic model actually represents how you speak the phonemes of the words, training material is shared among all words that use the same phonemes.

That means if you add the word clothes to the language model, your acoustic model already has an idea how the clo part is going to sound as they share the same phonemes (k, l, ow) at the beginning.

To train the acoustic model (in other words to tell him how you pronounce the phonemes) you have to train words from your language model. That means that Simon displays a word which you read out loud. Because the word is listed in your vocabulary, Simon already knows what phonemes it contains and can thus learn from your pronunciation of the word.