Build a talking clock for your constructed language. The purpose of this exercise is to flesh out enough of your language to define a phonetic inventory, define some words (including counting words) and enough syntax to make semantically useful sentences. You also get to build the first speech synthesizer ever in your language.
You will need
We will use the Festival Speech Synthesis System's new language templates to build this, there are still a few things beyond the actual information you have defined in the first part that are required for this. We will also depend on some cross lingual boot strapping techniques (unless you already have access to an hour of phonetically balanced recordings and prompts in your new language).
In order to illustrate this process we will use a new constructed language called "Eth" that was spoken by northern European fishermen over 5000 years ago, that it is completely understandable to modern Japanese speakers is purely coincidental.
The whole packed voice is available for download eth_clock.tar.gz.
You must install the FestVox voice building tools. Follow the installation instructions for the CMU Wilderness dataset, cloning the repository, check the prerequisties and running ./bin/do_found make_dependencies. If this works, it will generate a file festvox_env_settings. You need to add that to your .bashrc, or source this file evach time you want to build/run a voice.
If you have a Windows 10 machine, you should use the Windows Subsystem for Linux. You will need a computer to run it on, Linux, Mac OSX or Windows 10 (Linux Subsystem). You will also need to record audio, tools are included but they probably wont works so use audacity to do it, you'll need to record your prompts as 16KHz, mono, Microsoft RIFF format files with a small amount of silence (around 200ms) at the beginning and end of each file.
Festvox voices are built in a new directory, with all the files generated kept in various subdirectories. All the commands expect you to be in the main voice directory (not sub-directories) when run. But you will want to look at, play and edit some of the files in the subdirectories.
First make a new directory where you want to build the voice and change directory into it
mkdir eth_clock cd eth_clockWe now want to copy in a number of basic template files, and make the directory structure.
$FESTVOXDIR/src/unitsel/setup_clunits cmu eth awbThe three arguments at the end are the institution making the voice, (use upitt if appropriate), the language name, if you already have an ISO two of three letter name for your language use that otherwise use something meaningful. The thrid argument is name of the person who was recorded.
You now need to modify the templated language files for your language. First add your prompt list by putting the file in etc/txt.done.data. It must be of the format, one utterance per line and
( FILEID_0001 "text in your language ..." ) ( FILEID_0002 "more text in your language ..." ) ...You can use anything for your fileids, I use "time_0001", but don't use spaces in the file name, and make the filename size uniform over your prompts. You should start each line with a '(' followed by a space then the file id, likewise have a space before the final ')' on the line. There should be no blank lines in the file. My example is eth_clock/etc/txt.done.data
The text within double quotes can be in any encoding, though you are encouraged to use ascii or utf8/unicode. There are ways to add support for the white space characters in your language, but for simplicity you should just used standard ascii spaces between words. If you need have a double quote or a backslash within this quoted string it must be quoted with a blackslash. (cf Python, C strings)
Next you have to define your phoneme set. You have to edit the file in 'festvox/cmu_eth_awb_phoneset.scm' to add your phoneme list. That file has a deliberate error in it to identify where you have to change things. At line 43 delete the '(error ...' command (and the foloowing line). Then go to the line
( pau - - 0 0 0 0 0 0 - - - )and add you phoneme definition after this silence 'pau' definition. The phonetic features are described above there (in order). The features clst, asp and nuk should always be '-', the others are described. You might want to look the 'radio_phones.scm' file in the Festival distribution for inspiration about what features to use. (look at $ESTDIR/../festival/lib/radio_phones.scm).
My phones for Eth happen to all have direct equivalents in English (this is not necessary for you language), though I deliberately use upper case to show they are different -- but you don't need to do that. You phoneme names should be short symbols, starting with alphabetic letters, and not have digits in their last position. You should avoid using cute characters here, it'll just make youy life harder than you want it be. If you stick to (the base 26) upper and lower case letter everything is more likely to work.
Here are my additional phone definitions
( A - + l 3 3 - 0 0 0 - - ) ;; aa ( AI - + d 3 2 - 0 0 0 - - ) ;; ay ( CH - - 0 0 0 0 a p - - - ) ;; ch ( E - + l 2 1 0 0 0 0 - - ) ;; eh ( D - - 0 0 0 0 s a + - - ) ;; d ( G - - 0 0 0 0 s v + - - ) ;; g ( H - - 0 0 0 0 f g - - - ) ;; hh ( I - - l 1 1 - 0 0 0 - - ) ;; iy ( JH - - 0 0 0 0 a p + - - ) ;; jh ( K - - 0 0 0 0 s v - - - ) ;; k ( M - - 0 0 0 0 n l + - - ) ;; m ( N - - 0 0 0 0 n a + - - ) ;; n ( O - + l 2 3 + 0 0 0 - - ) ;; ow ( P - - 0 0 0 0 s l + - - ) ;; p ( R - - 0 0 0 0 r a + - - ) ;; r ( S - - 0 0 0 0 f a + - - ) ;; s ( T - - 0 0 0 0 s a - - - ) ;; t ( U - + l 1 3 + 0 0 0 - - ) ;; uw ( Y - - 0 0 0 0 r p + - - ) ;; y ( Z - - 0 0 0 0 f a - - - ) ;; zMy updated phoneset definition file is in eth_clock/festvox/cmu_eth_awb_phoneset.scm
In comments at the end of each phone definition I add the equivalent (radio) phoneme name, we will come to that below.
Lexical entries in Festival consist of the lexical head word, a part of speech tag, and a phonetic pronunciation which includes syllabification and a stress/tone marker. There is no default morphological analyser built-in to Festival so for this exercise you should list each word individually. There is a non-trivial lexical subsystem for building very large lexicons (with crowdsourcing) and efficient ways to hold these lexicons including statistical machine learning methods for letter to sound rules, but we're not going to use that here.
For example in Eth the word for 7 is 'nana'.
(lex.add.entry '( "nana" nil (((N A) 0) ((N A) 0))))We can add all the lexical entries, I have 18 words to the file 'festvox/cmu_eth_awb_lexicon.scm' We do this after the default punctuation lexical entries at line 69. Add your entries immediate before the single ')' on line 70.
My updated lexical entries are in eth_clock/festvox/cmu_eth_awb_lexicon.scm
In order to build a voice we need to know the start and end times of all the phonemes in the recordings you made. It is possible to do this by hand, but its tedious (and will take too lone to install the appropriate software that would help you do it). So we are going use a cross-lingual alignment techniques, that should work. We will generate approximates of your spoken forms in English and use them to help find the phone boundaries in your language. But my language X so you can't do that, welll actually that doesn't matter if you have phones that aren't in English we can still do this assuming that you still have some notion of vowels and consonants in you language. The fact that we are doing alignment, not recognition means that this almost always works. What you need to do is give a mapping for each of your phonemes in to an approximate English equivalent. For example if you have asperated K, unasperated K and retroflex K, you can map all of these to English K. As your recordings are phonetically correct all we want to do is align an English equivalent so we can find the phone boundaries for your different types of K.
We have done this for many languages. However normally when we have sufficient recordings (30 minutes of speech) we can usually do this directly from the language itself without any cross lingual bootstrapping, but for talking clocks this should be sufficient.
Again in the file 'eth_clock/festvox/cmu_eth_awb_lexicon.scm' you have to add the mapping, a function to do the mapping, and a call to that mapping function. Until you have built a voice the system (wrongly) defaults to using the English diphone synthesizer which is exactly what we want here.
My mapping, and mapping function look like this
(set! eng_map '( ( A aa ) ( AI ay ) ( CH ch ) ( E eh ) ( D d ) ( G g ) ( H hh ) ( I iy ) ( JH jh ) ( K k ) ( M m ) ( N n ) ( O ow ) ( P p ) ( R r ) ( S s ) ( T t ) ( U uw ) ( Y y ) ( Z z ) ( pau pau) )) (define (do_eng_map utt) (mapcar (lambda (s) (set! x (assoc_string (item.name s) eng_map)) (if x (item.set_feat s "us_diphone" (cadr x)) (format t "no eng_map for %s\n" (item.name s)) ) ) (utt.relation.items utt 'Segment)) utt)I added this immediately after the single ')' line after adding my lexical entries. I also have to add a function to call this mapping function as a post-lexical rule. Where it maps the lexical phone into another which will be used by the diphone synthesizer.
In the function cmu_eth_awb::select_lexicon, I changed
;; Post lexical rules (set! postlex_rules_hooks (list cmu_eth::postlex_rule1))to
;; Post lexical rules (set! postlex_rules_hooks (list do_eng_map cmu_eth::postlex_rule1))
Now we can start building the voice, first we want to generate the (fake) synthesized waveform files of your prompts, but labeled with your phones.
./bin/do_build build_prompts_wavesThis should generate waveform files in prompt-wav/*.wav with a bad English rendition of your intended prompts. You should listen to a few to ensure they are correct. Also look at the label files in prompt-lab/*.lab, these identify where the phones end in the prompt-wav/*.wav files. Again check to see if the phone lists there are correct.
These are not going to be directly used in your talking clock, but will be used to align the phone strings to your acoustic recordings.
If everything is set up wonderfully you can record your prompts with the command
./bin/prompt_them etc/txt.done.dataIt will display the text, play the fake synthesizer prompt, estimate how long it'll take for you to speak it and record for that time. However recording on arbitrary machines seems to be orders of magnitude harder than synthesizing natural sounding speech, so this might not work. Therefor use some other recording tool, I recommend audacity to record your prompts. Record each prompt in a separate file, and export with a small amount of silence at the beginning and end (around 250ms) to a file with the right name. You should export it as 16KHz, mono, riff format (sometimes called .wav). Note there are multiple incompatible riff formats that might make your life harder.
If you successfully use prompt_them your files will be in the right format in the right directory, if you used audacity, put your exported waveform files in the 'recording/' directory. Then use the command
./bin/get_wavs recording/*.wavThis should put a powernormalized version, in the right format in the 'wav/' directory. Play some of these to ensure it worked.
./bin/make_labs prompt-wav/*.wavThis should produce phone label files in lab/*.lab. You should check these to ensure the look plausible. When alignment fails often all of the phones are labeled at the end of a file instead of spaced neatly through the file.
Next we have to take the phone labels the DTW algorithm found and integrate them into the utterance structure that identifies the words, syllables, phones, building an utterance structure from which we can extract information for our statistical voice build.
In order to build a unit selection speech synthesizer we have to analyse the speech. We first find the pitch periods, find out when your glottis open and closed while you spoke. Then we have to find the spectral properties of the speech at each pitch period causes by the vocal tract shape.
./bin/do_build do_pm ./bin/do_build do_mcepThere are many options to optimize these, but for short databases the defaults should do.
There is one more modification we have to do the default set up for our language. One of the files that descritpions the features used in building the clusters has to know about your phone set. You need to replace the list of English phones with a list of your phones. Edit the files festival/clunits/all.desc and replace the lines with
( p.name 0 "aa" "ae" "ah" "ao" "aw" "ax" "axr" "ay" "b" "brth" "ch" "d" "dh" "dx" "eh" "el" "em" "en" "er" "ey" "f" "g" "h#" "hh" "hv" "ih" "iy" "jh" "k" "l" "m" "n" "ng" "nx" "ow" "oy" "p" "pau" "r" "s" "sh" "t" "th" "uh" "uw" "v" "w" "y" "z" "zh" )and
( n.name ignore 0 "aa" "ae" "ah" "ao" "aw" "ax" "axr" "ay" "b" "brth" "ch" "d" "dh" "dx" "eh" "el" "em" "en" "er" "ey" "f" "g" "h#" "hh" "hv" "ih" "iy" "jh" "k" "l" "m" "n" "ng" "nx" "ow" "oy" "p" "pau" "r" "s" "sh" "t" "th" "uh" "uw" "v" "w" "y" "z" "zh" )with
( p.name 0 A AI CH E D G H I JH K M N O P R S T U Y Z pau )and
( n.name ignore 0 A AI CH E D G H I JH K M N O P R S T U Y Z pau )respectively.
Now we can actually build the synthesizer
./bin/do_build build_clunitsIf all went well there should be a voice unit catalogue in
ls -l ./festival/clunits/cmu_eth_awb.catalogue
You can load and use your voice as
$ESTDIR/../festival/bin/festival festvox/cmu_eth_awb_clunits.scm .... festival> (voice_cmu_eth_awb_clunits) ... festival> (SayText "Tadaima, ni ji han gurai go go desu") ... festival> (set! utt1 (SayText "Tadaima, ju ichi ji han gurai go zen desu")) ... festival> (utt.save.wave utt1 "eth_clock_11_30.wav") ...There are other optimizations we can consider, especially if you have low entrophy in your phonemes, then it might get more confused in selection (and labelling might be worse). If so contact awb and ask for help.
You must submit your completed talking clock by Friday 1st March at midnight email to Dante. You must put "11-823" plus your language name in the email subject.
Include in your email