Simon was designed with high configurability in mind. Because of this, there are plentiful parameters that can be fine-tuned to your specific requirements.
You can access Simon's configuration dialog through the application's main menu: → .
The general configuration page lists some basic settings.
If you want to show the first run assistant again, deselect .
Please note that the option to start Simon at login will work on both Microsoft Windows and when you are using KDE on Linux. Support for other desktop environments like Gnome, XFCE, etc. might require manually placing Simon in the session autostart (please refer to the respective manuals of your desktop environment).
When the option to start Simon minimized is selected, Simon will minimize to the system tray immediately after starting.
Deselecting the option to warn when there are problems with samples deactivates the sample quality assurance.
Simon uses fairly sophisticated internal sound processing to enable complex multi-device setups.
The sound device configuration allows you to choose which sound device(s) to use, configure them and define additional recording parameters.
Use the button if you have plugged in additional sound devices since you started Simon.
Most of the time you will want to use 1 channel and 16kHz (which is also the default) because the recognition only works on mono input and works best at 16kHz (8kHz and 22kHz being other viable options). Some low-cost sound cards might not support this particular mode in which case you can enable automatic Resampling in the device's advanced configuration.
Only change the channel and the samplerate if you really know what you are doing. Otherwise the recognition will most likely not work.
You can use Simon with more than one sound device at the same time. Use to add a new device to the configuration and to remove it from your configuration. The first device in your sound setup cannot be removed.
For each device you can determine for what you want the device to be used: Training or recognition (last one only applicable for input devices).
If you use more than one device for training, you will create multiple sound files for each utterance. When using multiple devices for recognition each one feeds a separate sound input stream to the server resulting in recognition results for each stream.
If you use multiple output devices the playback of the training samples will play on all configured audio devices.
When using different sample rates for your input devices, the output will only play on matching output devices. If you for example have one input device configured to use 16kHz and the other to use 48kHz, the playback of samples generated by the first one will only play on 16kHz outputs, the other one only on 48kHz devices.
If you set up this device to be used for recognition and (any of) it's activation requirements are not met, the device will not record. This can be used to augment or even replace the traditional voice activity detection with context information.
For example, add a face detection condition to the recording devices activation requirements to only enable the recognition when you're looking at the webcam.
The recognition is done one the Simond server. See the architecture section for more details.
The sound stream is not continuous but is segmented by the Simon client. This is done by something called “voice activity detection”.
Here you can configure this segmentation through the following parameters:
Everything below this level is considered “silence” (background noise).
Cache for as long as head margin to start consider it a real sample. During this whole time the input level needs to be above the cutoff level.
After the recording went below the cutoff level, Simon will wait for as long as tail margin to consider the current recording a finished sample.
Skip samples shorter than
Samples that are shorter than this value are not considered for recognition. (coughs, etc.)
When the option is selected, Simon will, when training, automatically start- and stop the recording when displaying and hiding (respectively) the recording prompt. This option only sets the default value of the option, the user can change it at any time before beginning a training session.
The configurable font here refers to the text that is recorded to train the acoustic model (through explicit training or when adding a word).
This option has been introduced after we have worked with a few clients suffering spastic disability. While we used the mouse to control Simon during the training, they had to read what was on the screen. At first this was very problematic as the regular font size is relatively small and they had trouble making out what to read. This is why we made the font and the font size of the recording prompt configurable.
Here you can also define the required signal to noise ratio for Simon to consider a training sample to be correct. See the Sample Quality Assurance section for more details.
On this configuration page you can also set the parameters for the volume calibration.
It can be deactivated for both the add word dialog and the training wizard by unchecking the group box itself.
The calibration itself uses the voice activity recognition to score your sound configuration.
The prompted text can be configured by entering text in the input field below. If the edit is empty a default text will be used.
All recorded (training) and imported (through the import training data) samples can be processed using a series of postprocessing commands. Postprocessing chains are an advanced feature and shouldn't be needed by the average user.
The postprocessing commands can be seen as a chain of filters through which the recordings have to pass through. Using these “filters” one could define commands to suppress background noise in the training data or normalize the recordings.
Given the program process_audio which takes the input- and output files as its arguments (e.g.: process_audio in.wav out.wav) the postprocessing command would be:
process_audio %1 %2. The two placeholders %1 and %2 will be replaced by the input filename and the output filename respectively.
The switch to “apply filters to recordings recorded with Simon” enables the postprocessing chains for samples recorded during the training (including the initial training while adding the word). If you don't select this switch the postprocesing commands are only applied to imported samples (through the import training data wizard).
Every sample recorded with Simon is assigned a sample group.
When creating the acoustic model from the training samples Simon can take the current situation into account to only use a subset of all gathered training data.
For example, in a system where multiple, very different speakers use one shared setup, context conditions can be set up to automatically build separate models for both users depending on the current situation.
The above screenshot, for example, shows a setup where, given that all samples of "peter" were tagged "peters_samples" and all samples of "mathias" were tagged "mathias_samples" (refer to the device configuration for more information on how to set up sample groups), the active acoustic model will only contain the current user's own samples as long as the file
/home/bedahr/.username contains either "peter" or "mathias".
Another example use-case would be to switch to a more noise-resistant acoustic model when the user starts playing music.
Here you can adjust the parameters of the speech model.
You can optionally use base models to limit / circumvent the training or to avoid installing a model creation backend. Please refer to the general base model section for more details about base models.
Simon base models are packaged in
To add base models to the selection, you can either import local models ( → ), download them from an online repository ( → ) or create new ones from raw files ( → ).
If you have raw model files produced by either supported model creation backend, you can package them into SBM container for use with Simon.
This section allows to configure the training samples.
The samplerate set here is the target samplerate of the acoustic model. It has nothing to do with the recording samplerate and it is the responsibility of the user to ensure that the samples are actually made available in that format (usually by recording in that exact samplerate or by defining postprocessing commands that resample the files; see the sound configuration section for more details).
Usually either 16kHz or 8kHz models are built / used. 16kHz models will have higher accuracy over 8kHz models. Going higher than 16kHz is not recommended as it is very cpu-intensive and in practice probably won't result in higher recognition rates.
Moreover, the path to the training samples can be adjusted. However, be sure that the previously gathered training samples are also moved to the new location. If you use automatic synchronization the Simond would alternatively also provide Simon with the missing sample but copying them manually is still recommended for performance reasons.
In the language profile section you can select a previously built or downloaded language profile to aid with the transcription of new words.
Here you can configure the base URL that is going to be used for the automatic bomp import. The default points to the copy on the Simon listens server.
Here you can configure the recognition and model synchronization with the Simond server.
Using the server configuration you can set parameters of the connection to Simond.
The Simon main application connects to the Simond server (see the architecture section for more information).
To identify individual users of the system (one Simond server can of course serve multiple Simon clients), Simon and Simond use users. Every user has his own speech model. The username / password combination given here is used to log in to Simond. If Simond does not know the username or the password is incorrect, the connection will fail. See the Simond manual on how to setup users for Simond.
The recognition itself - which is done by the server - might not be available at all times. For example it would not be possible to start the recognition as long as the user does not have a compiled acoustic and language model which has to be created first (during synchronization when all the ingredients - vocabulary, grammar, training - are present). Using the option to start the recognition automatically once it is available, Simon will request to start the recognition when it receives the information that it is ready (all required components are available).
Using the Connect to server on startup option, Simon will automatically start the connection to the configured Simond servers after it has finished loading the user interface.
Simon connects to Simond using TCP/IP.
As of yet (Simon 0.4), encryption is not yet supported.
The timeout setting specifies, how long Simon will wait for a first reply when contacting the hosts. If you are on a very, very slow network and/or use “connect on start” on a very slow machine, you may want to increase this value if you keep getting timeout errors and can resolve them by trying again repeatedly.
Simon supports to be configured to use more than one Simond. This is very useful if you for example are going to use Simon on a laptop which connects to a different server depending where you are. You could for example add the server you use when you are home and the server used when you are at work. When connecting, Simon will try to connect to each of the servers (in order) until it finds one server that accepts the connection.
To add a server, just enter the host name or IP address and the port (separated by “:”) or use the dialog that appears when you select the blue arrow next to the input field.
Here you can configure the model synchronization and restore older versions of your speech model.
Simon creates the speech input files which are then compiled and used by the Simond server (see the section architecture for more details).
The process of sending the speech input files, compiling them and receiving the compiled versions is called “synchronization”. Only after the speech model is synchronized the changes take effect and a new restore point is set. This is why per default Simon will always synchronize the model with the server when it changes. This is called and is the recommended setting.
However, if you want more control you can instruct Simon to ask you before starting the synchronization after the model has changed or to rely on manual synchronization all together. When selecting the manual synchronization you have to manually use the → menu item of the Simon main window every time you want to compile the speech model.
The Simon server will maintain a copy of the last five iterations of model files. However, this only includes the “source files” (the vocabulary, grammar, etc.) - not the compiled model. However, the compiled model will be regenerated from the restored source files automatically.
After you have connected to the server, you can select one of the available models and restore it by clicking on .
In the actions configuration you can configure the reactions to recognition results.
The recognition of Simon computes not only the most likely result but rather the top ten results.
Each of the results are assigned a confidence score between 0 and 1 (were 1 is 100% sure).
Using the Minimum confidence you can set a minimum confidence for recognition results to be considered valid.
If more than one recognition results are rated higher than the minimum confidence score, Simon will provide a popup listing the most likely options for you to choose from.
This popup can be disabled using the check box.
Many plugins of Simon have a graphical user interface.
The fonts of these interfaces can be configured centrally and independent of the systems font settings here.
Here you can find the global list element configuration. This serves as a template for new scenarios but is also directly used for the popup for ambiguous recognition results.
Some parts of Simon, most notably the dialog command plugin employ text-to-speech (or "TTS") to read text aloud.
Multiple external TTS solutions can be used to allow Simon to talk. Multiple backends can be enabled at the same time and will be queried in the configured order until one is found that can synthesize the requested message.
The following backends are available:
Instead of an engine to convert arbitrary text into speech, text-snippets can be pre-recorded and will be simply played back.
Uses the Jovie TTS system. This requires a valid Jovie set-up.
The webservice backend can be used to talk to any TTS engine that has a web front-end that returns
Instead of using an external TTS engine, you can also record yourself or other speakers reading the texts aloud. Simon can then play back these pre-recorded snippet when they are requested of it's text-to-speech engine.
These recorded sound bites are organized into "sets" of different speakers which can also be imported and exported to share them with other Simon users.
Through the webservice backend, Simon can use web-based TTS engines like MARY.
You can provide any Url. Simon will replace any instance of "%1" within the configured Url with the text to synthesize. The backend expects the queried webservice to return a
.wav file that will be streamed and outputted through Simon's sound layer - respecting the sound device configuration.
For this we use KDEs social desktop facilities and our own category for Simon scenarios on kde-files.org.
If you already have an account on opendesktop.org you can input the credentials there. If you don't, you can register directly in the configuration module.
The registration is of course free of charge.
In Webcam configuration, you can configure frame per second (fps) and select the webcam to use when multiple webcams are connected to your system.
Frame per second is the rate at which webcam will produce unique consecutive images called frames. The optimal value of fps is between 5-15 for proper performance.
Simon is targeted towards end-users. It's interface is designed to allow even users without any background in speech technology to design their own language and acoustic models by providing reasonable default values for simple uses.
In special cases (severe speech impairments for example), special configuration might be needed. This is why the raw configuration files for the recognition are also respected by Simon and can of course be modified to suit your needs.
There are basically two parts of the Julius configuration that can be adjusted:
This is the configuration of the Simon client of the Soundstream sent from Simon to the Simond. This file is directly read by the adinstreamer.
Simon ships with a default adin.jconf without any special parameters. You can change this system wide configuration which will affect all users if there are different user accounts on your machine who all use Simon. To just change the configuration of one of those users copy the file to the user path (see below) and edit this copy.
This is a configuration of the Simond server and directly influences the recognition. This file is parsed by libjulius and libsent directly.
Simond ships with a default julius.jconf. Whenever there is a new user added to the Simond database, Simond will automatically copy this system wide configuration to the new user. After that the user is of course free to change it but it won't affect the other users. This way the “template” (the system wide configuration) can be changed without affecting other users.
The path to the Julius configuration files will depend on your platform:
Table 3.1. Julius Configuration Files
|adin.jconf (system)||(installation path)\share\apps\simon\adin.jconf||`kde4-config --prefix`/share/apps/simon/adin.jconf|
|julius.jconf (template)||(installation path)\share\apps\simond\default.jconf||`kde4-config --prefix`/share/apps/simond/default.jconf|