Sphinx-4 Frequently Asked Questions |
General
Sphinx-4 was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).I have a question about Sphinx-4. How can I get it answered?
First, check this FAQ, many questions are answered here. If your question is not in the FAQ, you can post it to the Sphinx4 Open Discussion Forum on SourceForge. Many of the Sphinx-4 developers monitor this forum and answer technical questions.How can I contact the Sphinx-4 team?
You can contact the Sphinx-4 team by sending email to cmusphinx-contacts at sourceforge dot net.How well does Sphinx-4 perform compared to other speech recognizers?
Comparing speech recognizers is often difficult. Speed and accuracy data for commercial recognizers is not typically available. We have compared Sphinx-4 with the Sphinx 3.3 recognizer. Results of this comparison are here: Sphinx-4 Performance ComparisonIsn't the Java Platform too slow to be used for speech recognition?
No, rumors of the poor performance of the Java platform are unfounded. Sphinx-4 runs faster than Sphinx 3.3 (CMUs fast recognizer) for many tests. For a good discussion of Java platform performance in speech engines look at FreeTTS - A Performance Case Study a technical paper that compares the performance of speech synthesis engine written in the Java programming language to its native-C counterpart.Which Sphinx-4 distribution should I use?
Download the binary distribution if:Does Sphinx-4 support the Java Speech API (JSAPI)?Download the source distribution if you want to do everything above, plus:
- You just want to check out Sphinx-4 by running the demos.
- You want to build applications using Sphinx-4, but you don't want to touch the source code of Sphinx-4.
- You want to get all the source code of Sphinx-4, so that you can understand how Sphinx-4 works, and do your experimentation with Sphinx-4.
- You want to build Sphinx-4 from the ground up.
- You want to run the unit tests.
- You want to run the regression tests.
Currently, Sphinx-4 does not support the full Java Speech API. Instead, Sphinx-4 uses a lower-level API. However, Sphinx-4 does support Java Speech Grammar Format (JSGF) grammars.Where can I learn more about the Java Speech Grammar Format (JSGF)?
A complete description of the JSGF can be found in the JSGF Grammar Format SpecificationCan I use Sphinx-4 in a J2ME device such as a phone or a PDA?
Probably not. Sphinx-4 requires version 1.4 of the Java platform. This is typically not avaiable on smaller devices. Also, Sphinx-4 requires more memory than is typically available on a J2ME device. Even simple digits recognition will require a 16Mb heap. Sphinx-4 uses extensive floating point math. Most J2ME devices do not have adequate floating point performance for Sphinx-4.Why can't I use Java versions prior to 1.4?
Sphinx-4 uses many language and API features of version 1.4 of the Java platform including the logging API, the regular expressions API, XML parsing APIs and the assert facility.I am having microphone troubles under linux. What can I do?
There seems to be a significant difference in how different versions of the JDK determine which audio resources are available on Linux. This difference seems to affect different machines in different ways. We are working with the Java Sound folks to get to the root cause of the problem. In the mean time, if you are having trouble getting the demos to work on your Linux box try the following:How do I select a different microphone (e.g., a USB headset) on my machine?
- Try a native sound recording application (such as gnome-sound-recorder) to ensure that you can actually capture audio on your system.
- Try the AudioTool demo to see if you can record audio from a Java application.
- Check to see if any sound daemons like esd, gstreamer or artsd are running. These daemons may have exclusive access to the sound device. If any of these are running, kill them and try running again.
- Try switching to another version of the JDK. If JDK 1.4 doesn't work, try 1.5 and vice versa.
Where can I find a speech synthesizer for the Java platform?By default, Sphinx-4 will use the getLine method of the Java Sound AudioSystem class to obtain a TargetDataLine (i.e., the object used to interface to your microphone). This method grabs a line from any of the available Mixers known to the AudioSystem. As such, when using the AudioSystem to obtain the TargetDataLine, you have little control over which line is chosen if more than one line matches the requirements of the front end. For example, if you plug in a USB headset into a Macintosh PowerBook, the getLine method of the AudioSystem class will typcially never select a line from the USB device.
This behavior can be frustrating, especially when you have a nice USB microphone you'd like to use.
To override the default behavior, you can set the selectMixer property of the Microphone class. In Java Sound, a Mixer is an audio device with one or more lines. In practice, a Mixer tends to be mapped to a particular system audio device. For example, on the Mac, there's a Mixer associated with the built-in audio hardware. Furthermore, when you plug in a USB headset, a new Mixer will appear for that headset. The selectMixer property allows you to specify which specific Mixer Sphinx-4 will use to select the TargetDataLine.
The value of the selectMixer property can be "default," which means let the AudioSystem decide which line to use from all the available Mixers, "last," which means select the last Mixer supported by the AudioSystem (USB headsets tend to be associated with the last Mixer), or an integer value that represents the index of the Mixer.Info that is returned by AudioSystem.getMixerInfo().
To get the list of Mixer.Info objects available on your system, along with their integer index values, run the AudioTool application with a command line argument of "-dumpMixers".
To set the selectMixer property of the Microphone, you need to have a component in your config file that defines the microphone. In the examples in Sphinx-4, this component is aptly named "microphone." In the configuration for the microphone component, you can the set the selectMixer property in the config file for the application. For example:
<property name="selectMixer" value="last"/>You can also set the selectMixer property from the command line. For example:
java -Dmicrophone[selectMixer]=last -jar bin/AudioTool.jarIn both of these examples, the last Mixer discovered by the Java Sound AudioSystem class will be used to select the TargetDataLine for the microphone.
The Speech Integration group of Sun Labs has released FreeTTS, a speech synthesis system written in the Java programming language.I want to add speech recognition to my application. Where do I start?
First, look at the sourcecode for Sphinx-4 demos to get a feel for how to write a Sphinx-4 application. After that, read the Sphinx-4 Application Programmer's Guide for description of how to write a Sphinx-4 application.How can I decode/transcribe .wav files?
Take a look at the Hello Wave Demo program a command line application that transcribes audio in a '.wav' file. Additionally, the Transcriber Demo demonstrates how Sphinx-4 can be used to transcribe a continuous audio file with multiple utterances.How can I get the recognizer to return partial results while a recognition is in process?
How can I get the N-Best list?It is possible to configure Sphinx-4 to generate partial results, that is, to inform you periodically as to what it thinks is the best possible hypothesis so far, even before the user has stopped speaking.
To get this information, add a result listener to the recognizer. Your listener will receive a result (which may or not be a final result). The hypothesis text can be extracted from the text.
There is a good example of this in sphinx4/tests/live/Live.java
You can control how often the result listener is fired by setting the configuration variable 'featureBlockSize' in the decoder. The default setting of 50 indicates that the listener will be called after every 50 frames. Since each frame represents 10MS of speech, the listener is called every 500ms.
How can I detect and ignore out-of-grammar utterances?The method 'Results.getResultTokens()' returns a list of all the tokens associated with paths that have reached the end of sentence state.
This list is not a traditional N-best list of results. Some good results are not represented in this list. We also support full word lattices that can provide full N-Best lists. We currently do not have any user documentation for this, however, we will be providing some shortly.
See also: How can I obtain confidence scores for the recognition result?
How can I change my language models or grammars at runtime?An out-of-grammar utterance occurs when a speaker says something that is not represented by the speech grammar. Usually, the recognizer will try to force a match between what was said and the grammar. Many applications need to detect when the user has spoken something unexpected. This is called out-of-grammar detection.
The FlatLinguist and the DynamicFlatLinguist can be configured to detect out-of-grammar utterances. To do so, set the following properties of either linguists:
- addOutOfGrammarBranch property to true
- outOfGrammarProbability to a small value (e.g. 1E-20), a smaller probability makes it less likely to be recognized as out-of-grammar
- phoneInsertionProbability to a small value (e.g. 1E-10)
- phoneLoopAcousticModel to the acoustic model you are using, typically the Wall Street Journal (WSJ) model. WSJ has a wide enough range of phones to ensure that rejection works well
When configured this way, the search will look for out-of-grammar utterances. If an out of grammar utterance is detected, Sphinx-4 will return a result that contains a single <unk> word. Moreover, if you want to know the exact sequence of phones that the unknown word is comprised of, you can call the method:
Result.getBestToken().getWordUnitPath()
The JSGFGrammar class provides methods that allow for swapping in a new JSGF grammar or modifying the currently active RuleGrammar grammar used by a given Recognizer. The JSGFDemo gives an example of how to do this.How can I perform word-spotting?To handle more complex problems, such as switching between n-Gram grammars, you can configure more than one Recognizer (one grammar per Recognizer) and switch between those Recognizers. The Dialog Demo provides an example of how to do this.
There is no support for word-spotting right now.Can I use Sphinx-4 to recognize telephone audio?
Where can I get the audio data used in the regression tests?The issue with telephone audio is that it has limited range of frequences. Unlike usual microphone recording that includes frequences from 1 Hz to 8000 kHz, telephone audio is passed through frequency filters. As a result telephone audio contains frequences from 200 Hz to 3500 Hz. That makes it impossible to recognize telephone audio with usual microphone acoustic model. You need to use specialized models to recognize it.
There are few common telephone models distributed which you can use. Most notably, Communicator models, WSJ_8k model from sphinx4 and Voxforge English model.
To configure sphinx4 with 8kHz model change two things mel filter parameters and model itself:
<component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"> <property name="numberFilters" value="31"/> <property name="minimumFrequency" value="200"/> <property name="maximumFrequency" value="3500"/> </component> <component name="sphinx3Loader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/> <property name="unitManager" value="unitManager"/> <property name="location" value="the path to the model folder"/> </component> <component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/> <property name="unitManager" value="unitManager"/> <property name="location" value="resource:/WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz"/> <property name="modelDefinition" value="etc/WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.4000.mdef"/> <property name="dataLocation" value="cd_continuous_8gau/"/> </component>
Much of the data audio data used in the regression tests is obtained from the Linguistic Data Consortium.How do I use the Result object?
Does Sphinx-4 support speaker identification?A search result typically consists of a number of hypothesis. Each hypothesis is represented by a path through the search space. Each path is represented by a single token. The token corresponds to the end point of the path. Using the token.getPredecessor() method it is possible for an application to trace back through the entire path to the beginning of the utterance.
Each token along the path contains numerous interesting data that can be used by the application including:
The method getScore() returns the path score for the path represented by a particular token This is the total score (which includes the language, acoustic and insertion components). The getAcousticScore returns the acoustic score for a token. This score represents how well the associated search state matches the input feature for the frame associated with the token. This is typically only for 'emitting' states. The getLanguageScore() returns the language component of the score The getInsertionProbability() returns the insertion component of the score. So the method getScore returns (all values are in the log domain):
- the total path score up to this point (retrieved by getScore())
- A frame number indicating which input frame this token is associated with
- A pointer to a state in the search graph corresponding to this token (A token may correspond to a word, unit, HMM, HMM state or other things). This pointer allows the application to retrieve the word, unit, or hmm information associated with the token.
entryScore + getAcousticScore() + getLanguageScore() + getInsertionProbability()(where entryScore is token.getPredecessor().getScore() )
How can I obtain confidence scores for the recognition result?Sphinx-4 currently does not support speaker identification, which is the process of identifying who is speaking. However, the architecture of Sphinx-4 is flexible enough for someone to add such capabilities. To learn more about speaker identification: http://www.speech.cs.cmu.edu/comp.speech/Section6/Q6.6.html
For those interested to implement it it makes sense to investigate the software specifically targetted to do speaker identification first like MISTRAL
Some experimental work has been done to support confidence scores. As this work is still experimental, please use it with precaution. Please refer to the Confidence Score Demo for example code of how to do this. Note that currently this only works for configurations using the LexTreeLinguist and the WordPruningBreadthFirstSearchManager .How to decode without predefined vocabulary?
How can I train my own acoustic models?To have efficient error rate, decoder still should have some form of vocabulary. To decode arbitrary words, subword-based decoding is used. It might be phone-based vocabulary which is easy to construct or vocabulary built from automatically-selected subwords of large size (typically, sequences of 4-6 phones). In both cases, it's required to build a subword dictionary and subword language model. With phone-based decoding phone error rate will be significant like (40-60%), so it's only usable for research purposes. Subword-based decoding with large subword units is often used in practice. For more details on this see the articles like:
Kenney Ng, Victor W. Zue. Subword-based Approaches for Spoken Document Retrieval (1999).
Sphinx-4 loads Sphinx-3 acoustic models. These can be trained with the Sphinx-3 Trainer called SphinxTrain.How do I use models trained by SphinxTrain in Sphinx-4?
Please refer to the document Using SphinxTrain Models in Sphinx-4.Does the Sphinx-4 front end generate the same features as the SphinxTrain wave2feat program?
The features that SphinxTrain generate are called cepstrum. Cepstrum are usually 13-dimensional. The features that Sphinx-4 generates are more than cepstrum. It is 39-dimensional, and consists of the cepstrum, the delta of the cepstrum, and the double delta of the cepstrum (thus 3X the size). To make Sphinx-4 generate the same cepstrum as SphinxTrain wave2feat, you should remove the last two steps in the front end, so that it looks like:How can I create my own language models?<component name="mfcFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>streamDataSource</item> <item>premphasizer</item> <item>windower</item> <item>fft</item> <item>melFilterBank</item> <item>dct</item> </propertylist> </component>
N-Gram language models can be created with the CMU Statistical Language Modeling (SLM) Toolkit. For more information see this Example of building a Language Model.How can I create my own dictionary?
Sphinx-4 currently supports dictionaries in the CMU dictionary format. The CMU dictionary format is described in the FullDictionary javadocs.I've created by own language model. How do I create the binary (DMP) form?Each line of the dictionary specifies the word, followed by spaces or tab, followed by the pronuncation (by way of the list of phones) of the word. Each word can have more than one pronunciations. For example, a digits dictionary will look like:
ONE HH W AH N ONE(2) W AH N TWO T UW THREE TH R IY FOUR F AO R FIVE F AY V SIX S IH K S SEVEN S EH V AH N EIGHT EY T NINE N AY N ZERO Z IH R OW ZERO(2) Z IY R OW OH OWIn the above example, the words "one" and "zero" have two pronunciations each.
Some more details on the format of the dictionary can be found at the CMU Pronouncing Dictionary page.
Note that the phones used to define the pronunciation for a word can be arbitrary strings. It is important however, that they match the units in the acoustic model. If you unpack an acoustic model you will find among the many files a file with the suffix ".mdef". This file contains a mapping of units to senones (tied gaussian mixtures). The first column in this file represent the unit names (phone) used by the acoustic model.
Your dictionary should use these units to define the pronunciation for a word.
How to get support on recognition rate problems?Use
sphinx_lm_convert
tool from sphinxbase package. This tool allows you to convert from ARPA language model format to DMP format and back. Please note that some LM tools create non-standard language model that require additional preprocessing. For example usesphinx_lm_sort
tool from sphinxbase to convert SRILM model to proper ARPA model that is handled by CMUSphinx tools and libraries.Note that the format of the output from the CMU/CU SLM Toolkit program
idngram2lm
(using the-binary
option) is different from the DMP format, and therefore cannot be read by Sphinx-4.
CMUSphinx team is always willing to help you with your problems using CMUSphinx. But please understand that without detailed descriptoin it's hard to help you since there are too many unknowns. Like you might have done minor mistake somewhere and that minor type can break the whole system.
If you are going to ask on forum about accuracy issues, especially with your own trained model please understand that we can't guess what you have done. You need to provide detailed description of your work, provide the data you are using and describe your expectations.
The easiest way to provide the data is to pack everything into archive and upload it to public hosting like RapidShare, Mediafire or some other resource. Then just give a link on forum on the data archive. Pack into the archive everything you are using, without single file support becomes way more complicated. If you are training acoustic database, pack whole training folder. If you are running modified sphinx4 demo, pack into archive demo jar, sphinx4 jar, demo source code with your modifications, audio recordings.
The more information you'll provide the bigger chance you'll get the helpful response.