Overview of Text To Speech Subsystem
Traditional IVR systems use Text To Speech subsystems of two types.
Type 1. Simple sequence of prompts.
This is very a simple and moderately flexible approach.
The main drawback of this approach is that phrases are hardcoded into the code of application. This is no problem for single language IVR applications. But if IVR is to speak several languages it usually becomes necessary to rewrite parts of the application for each language. The maintenance of such applications quickly becomes a serious headache with the growth of application and number of supported languages.
Type 2. Speech synthesis systems.
This is the most simple from the point of view of IVR developer. As the programmer has to only supply phrases to speech synthesizer and simply get back the prompts.
However this is the least vital solution for now as there are several big problems with this approach:
- very low quality of speech. There are synthesizers of acceptable quality but they are commercial.
- speech synthesizers are available for very small set of languages and creation of such synthesizer is non-trivial and very expensive task
VApp's approach to TTS
To fight the problems mentioned above, the new approach was designed that made use of the best ideas from first two approaches. This new approach gives the ability to easily add new languages and change complex phrases almost without touching the source code (only string constants are to be changed, not the application logic). So let's come to the basic ideas:
1. Clear text.
All phrases must be in explicit written form in order to be pronounced. This idea has been borrowed from the speech synthesizer approach. So the IVR program now will never look like:
play("prompt1")
playNumber(num)
play("prompt2")
This was quite puzzling wasn't it? But now it looks pretty readable:
say(_tts("There are %n messages currently in your mailbox"), [num])
2. Plural forms and localization.
Localization is an old feature and there are several toolsets to address this application domain. The gettext software was chosen as most popular open source localization package.
As you may have noticed the previous example is not complete. If the num equals 1 then the phrase should sound "There is one message currently in your mailbox". Luckily this problem is addressed by gettext:
say(_ntts("There is %n message currently in your mailbox",\
"There are %n messages currently in your mailbox", num), [num])
Isn't that simple, huh? Especially if you imagine how the code for the simple sequence approach would look like:
if (num == 1):
play("there_is")
else:
play("there_are")
playNumber(num)
if (num == 1):
play("message")
else:
play("messages")
play("prompt3")
The gettext helps to solve one more problem. The translated text may have order of words that completely different from English. So Russian translation may look like this:
"%n сообщений сейчас в вашем ящике"
As you can see the number is at the start of the phrase now. Using the simple sequence approach would lead to code like this:
if (lang == RUSSIAN):
playNumber(num)
play("ru/prompt2")
elif (lang == ENGLISH):
play("prompt1")
playNumber(num)
play("prompt2")
and this is even without applying plural forms! (Not to mention that Russian has 3 plural forms, not 2 as in English).
3. Numbers, dates, etc.
Lets consider the previous example one more time:
say(_tts("There are %n messages currently in your mailbox"), [num])
Here you see the tag '%n'. This is how the new IVR application platform handles the variable numbers, dates and so on. This example works similar to the well known printf() function, but differs in that the '%n' is replaced with the verbal representation of the number num:
num = 2: "There are two messages currently in your mailbox"
num = 123: "There are one hundred twenty three messages currently in your mailbox"
Currently there are four supported conversions:
%n - numbers and digits
%d - durations
%D - datetime
%s - simple string substitution
Actually the full form of conversion is %(ident)[flags]n'.
The 'ident' is used for binding the values. For example:
say(_tts("There are %(new)n new and %(old)n old messages"), kw = { 'old':num_old, 'new':num_new })
The 'flags' are additional parameters that can be supplied for conversion.
- %n flags:
- N - say number, not digits (default)
- D - say digits, not number
- H - use ordinal form
Examples:
say(_tts("%[D]n"), [ 123 ]) will say "one two three" say(_tts("%[H]n message"), [ 1 ]) will say "first message"
- %d flags
- H - don't say hours
- M - don't say minutes
- %D flags
- D - don't say date
- T - don't say time
- S - don't say seconds
Flags can also have language specific values. For example using Russian you can specify the case (падеж) for the numerics. For more information on language specific flags please consult the documentation in the sources of appropriate TextSynth language module.
4. Speech Synthesis.
OK, now we have the clear text phrases for our messages, so what? But how to play them?
For this purpose a mapping system is created that converts phrases into sequences of prompts. And this is the 'vapp.SpeechSynth.Chunked' class what for. Prompt to phrase mappings are located in 'prompt_map.txt' files. These files consist of lines that are as simple as:
<prompt path>|<phrase>
The vapp.SpeechSynth.Chunked class matches longest parts of the phrases. The VApp software also contains the vapp_prompt_util.py utility that helps to see if all the necessary prompts are present, what phrases should be recorded, what prompt files are missing, and several other useful features.
Please note that the implementation is very simple for now and will not cope with such situation:
say("Press one for English")
1|Press
2|one
3|for English
4|Press one for
The application will fail as it finds the prompt for the 'Press one for' and then cannot find the prompt for the phrase 'English'.
