Required Resources for a Working Simon Setup

Note

For background information about speech models, please refer to the Speech Recognition: Background section.

To get Simon to recognize speech and react to it you need to set up a speech model.

Speech models describe how your voice sounds, what words exist, how they sound and what word combination (sentences or structures) exist.

A speech model basically consists of two parts:

  • Language model: Describes all existing words and what sentences are grammatically correct

  • Acoustic model: Describes how words sound

You need both these components to get Simon to recognize your voice.

In Simon, the language model will be created from your active scenarios and the acoustic model will be either built solely through your voice recordings (training) or with the help of a base model.

Scenarios

One scenario makes up one complete use case of Simon. To control Firefox, for example, the user just installs the Firefox scenario.

In other words, scenarios tell Simon what words and phrases to listen for and what to do when they are recognized.

Because scenarios do not contain information about how these words and phrases actually sound, they can be shared and exchanged between different Simon users without problems. To accommodate this community based repository pool, a category for Simon scenarios has been created on the KDE Store where the scenarios, which are just simple text files (XML format), can be exchanged easily.

In most cases scenarios are tailored to work best with a specific base model to avoid issues with the phoneme set.

For information on how to use scenarios in Simon, please refer to the Scenario section in the Use Simon chapter.

Acoustic model

As mentioned above, you need an acoustic model to activate Simon.

You can either create your own or use and even adapt a base model. Base models are already generated, most often speaker independent, acoustic models that can be used with Simon.

The following table shows what is required, depending on your Simon configuration:

Table 2.1. Ways to an acoustic model

 Training requiredBase model requiredModel creation backend required
Static base modelNoYesNo
Adapted base modelYesYesYes
User-generated modelYesNoYes


Backends

Simon uses external software to build acoustic models and to recognize speech.

Usually, these backends can be split into two distinct components: The "model compiler" or "model generation" backend used to create or adapt acoustic models and the "recognizer" used to recognize speech with the help of these models.

Not all operation modes of Simon will require a model compiler backend. Please refer to the next section about details on when this is the case.

Two different backends are supported:

  • Julius / HTK

    Models will be created with the HTK. Julius will be used as recognizer.

    To use this backend, please make sure that you have an up-to-date version of both these tools installed.

  • CMU SPHINX

    This backend, also often simply referred to as "SPHINX backend", uses the PocketSphinx recognizer and the SphinxTrain model generation backend. Please refer to the CMU SPHINX website for more details.

    The CMU SPHINX backend requires that Simon is built with the optional SPHINX support. If you have not compiled Simon from source, please refer to your distribution for more information.

If you are using base models, Simon will automatically select the appropriate backend for you. However, if you want to build your own models from scratch (user-generated model, see below) and have a certain preference, please refer to the Simond configuration for more information.

Base models created for one backend are not compatible with any other backend. Please refer to the compatibility matrix for details.

Types of base models

There are three types of base models:

  • Static base model

  • Adapted base model

  • User-generated model

For information on how to use base models in Simon, please refer to the Base Models section in the Use Simon chapter.

Static base model

Static base models simply use a pre-compiled acoustic model without modifying it.

Any training data collected through Simon will not be used to improve the recognition accuracy.

This type of model does not require the model creation backend to be installed.

Adapted base model

By adapting a pre-compiled acoustic model you can improve accuracy by adapting it to your voice.

Collected training data will be compiled in an adaption matrix which will then be applied to the selected base model.

This type of model does require the model creation backend to be installed.

User-generated model

When using user-generated models, the user is responsible for training his own model. No base model will be used.

The training data will be used to compile your own acoustic model allowing you to create a system which directly reflects your voice.

This type of model does require the model creation backend to be installed.

Requirements

To build, adapt or use acoustic models of different types, certain software needs to be installed.

Table 2.2. Base model requirements

 CMU SPHINXJulius / HTK
Static base modelPocketSphinxJulius
Adapted base modelSphinxTrain, PocketSphinxHTK, Julius
User-generated modelSphinxTrain, PocketSphinxHTK, Julius


All four tools, HTK, Julius, PocketSphinx and SphinxTrain, can safely be installed at the same time.

SPHINX support in Simon must be enabled during compile time and might not be available on your platform. Please refer to your distribution.

Note

The Simon Windows installer includes Julius, PocketSphinx and SphinxTrain but not the HTK. Please refer to the installation section for information on how to install it should you find the need for it.

Where to get base models

Simon base models are packaged as .sbm files. If you happen to have raw model files for your backend, you can package them into a compatible SBM container within Simon. Please refer to the speech model configuration for details.

Not all SBM models may work for you. Please refer to the model backends section for details.

To keep this list of available base models up to date, please refer to the list in our online wiki.

Phoneme set issues

In order for base models to work, both your scenarios and your base model need to use the same set of phonemes.

In practice, this often just means that you need to match scenarios to your base model. The name of Simon base models will most likely start with a tag like "[EN/VF/JHTK]". Try to download scenarios that start with the same tag.

You can not use scenarios designed for different phoneme set (different base model). If Simon recognizes this error, it will try to disable affected words by removing them from the created speech model. These words will be marked with a red background in the vocabulary of the scenario. To re-enable them, transcribe them with the proper phoneme set or use a user-generated model.

Hint

If you design a new scenario it is therefore a good idea to use the dictionary that was used to create the base model as shadow dictionary. This way Simon will suggest the correct phonemes when adding the words automatically.