11-823: Conlanging: Building a Talking Clock

Build a talking clock for your constructed language. The purpose of this exercise is to flesh out enough of your language to define a phonetic inventory, define some words (including counting words) and enough syntax to make semantically useful sentences. You also get to build the first speech synthesizer ever in your language.

Defining the Clock Language

You will need

Building a synthesizer to tell the time

We will use the Festival Speech Synthesis System's new language templates to build this, there are still a few things beyond the actual information you have defined in the first part that are required for this. We will also depend on some cross lingual boot strapping techniques (unless you already have access to an hour of phonetically balanced recordings and prompts in your new language).

In order to illustrate this process we will use a new constructed language called "Eth" that was spoken by northern European fishermen over 5000 years ago, that it is completely understandable to modern Japanese speakers is purely coincidental.

The whole packed voice is available for download eth_clock.tar.gz.

Preliminaries

You must install the FestVox voice building tools. Follow the installation instructions for the CMU Wilderness dataset, cloning the repository, check the prerequisties and running ./bin/do_found make_dependencies. If this works, it will generate a file festvox_env_settings. You need to add that to your .bashrc, or source this file evach time you want to build/run a voice.

If you have a Windows 10 machine, you should use the Windows Subsystem for Linux. You will need a computer to run it on, Linux, Mac OSX or Windows 10 (Linux Subsystem). You will also need to record audio, tools are included but they probably wont works so use audacity to do it, you'll need to record your prompts as 16KHz, mono, Microsoft RIFF format files with a small amount of silence (around 200ms) at the beginning and end of each file.

Setup

Festvox voices are built in a new directory, with all the files generated kept in various subdirectories. All the commands expect you to be in the main voice directory (not sub-directories) when run. But you will want to look at, play and edit some of the files in the subdirectories.

First make a new directory where you want to build the voice and change directory into it

mkdir eth_clock
cd eth_clock
We now want to copy in a number of basic template files, and make the directory structure.
$FESTVOXDIR/src/unitsel/setup_clunits cmu eth awb
The three arguments at the end are the institution making the voice, (use upitt if appropriate), the language name, if you already have an ISO two of three letter name for your language use that otherwise use something meaningful. The thrid argument is name of the person who was recorded.

Prompt List

You now need to modify the templated language files for your language. First add your prompt list by putting the file in etc/txt.done.data. It must be of the format, one utterance per line and

( FILEID_0001 "text in your language ..." )
( FILEID_0002 "more text in your language ..." )
...
You can use anything for your fileids, I use "time_0001", but don't use spaces in the file name, and make the filename size uniform over your prompts. You should start each line with a '(' followed by a space then the file id, likewise have a space before the final ')' on the line. There should be no blank lines in the file. My example is eth_clock/etc/txt.done.data

The text within double quotes can be in any encoding, though you are encouraged to use ascii or utf8/unicode. There are ways to add support for the white space characters in your language, but for simplicity you should just used standard ascii spaces between words. If you need have a double quote or a backslash within this quoted string it must be quoted with a blackslash. (cf Python, C strings)

Phone Set

Next you have to define your phoneme set. You have to edit the file in 'festvox/cmu_eth_awb_phoneset.scm' to add your phoneme list. That file has a deliberate error in it to identify where you have to change things. At line 43 delete the '(error ...' command (and the foloowing line). Then go to the line

  (    pau   -   -   0   0   0   0   0   0   -   -   -   ) 
and add you phoneme definition after this silence 'pau' definition. The phonetic features are described above there (in order). The features clst, asp and nuk should always be '-', the others are described. You might want to look the 'radio_phones.scm' file in the Festival distribution for inspiration about what features to use. (look at $ESTDIR/../festival/lib/radio_phones.scm).

My phones for Eth happen to all have direct equivalents in English (this is not necessary for you language), though I deliberately use upper case to show they are different -- but you don't need to do that. You phoneme names should be short symbols, starting with alphabetic letters, and not have digits in their last position. You should avoid using cute characters here, it'll just make youy life harder than you want it be. If you stick to (the base 26) upper and lower case letter everything is more likely to work.

Here are my additional phone definitions

( A  - + l 3 3 - 0 0 0 - - ) ;; aa
( AI - + d 3 2 - 0 0 0 - - ) ;; ay
( CH - - 0 0 0 0 a p - - - ) ;; ch
( E  - + l 2 1 0 0 0 0 - - ) ;; eh
( D  - - 0 0 0 0 s a + - - ) ;; d
( G  - - 0 0 0 0 s v + - - ) ;; g
( H  - - 0 0 0 0 f g - - - ) ;; hh
( I  - - l 1 1 - 0 0 0 - - ) ;; iy
( JH - - 0 0 0 0 a p + - - ) ;; jh
( K  - - 0 0 0 0 s v - - - ) ;; k
( M  - - 0 0 0 0 n l + - - ) ;; m
( N  - - 0 0 0 0 n a + - - ) ;; n
( O  - + l 2 3 + 0 0 0 - - ) ;; ow 
( P  - - 0 0 0 0 s l + - - ) ;; p
( R  - - 0 0 0 0 r a + - - ) ;; r
( S  - - 0 0 0 0 f a + - - ) ;; s
( T  - - 0 0 0 0 s a - - - ) ;; t
( U  - + l 1 3 + 0 0 0 - - ) ;; uw
( Y  - - 0 0 0 0 r p + - - ) ;; y
( Z  - - 0 0 0 0 f a - - - ) ;; z
My updated phoneset definition file is in eth_clock/festvox/cmu_eth_awb_phoneset.scm

In comments at the end of each phone definition I add the equivalent (radio) phoneme name, we will come to that below.

Lexical Definitions

Lexical entries in Festival consist of the lexical head word, a part of speech tag, and a phonetic pronunciation which includes syllabification and a stress/tone marker. There is no default morphological analyser built-in to Festival so for this exercise you should list each word individually. There is a non-trivial lexical subsystem for building very large lexicons (with crowdsourcing) and efficient ways to hold these lexicons including statistical machine learning methods for letter to sound rules, but we're not going to use that here.

For example in Eth the word for 7 is 'nana'.

(lex.add.entry '( "nana" nil (((N A) 0) ((N A) 0))))
We can add all the lexical entries, I have 18 words to the file 'festvox/cmu_eth_awb_lexicon.scm' We do this after the default punctuation lexical entries at line 69. Add your entries immediate before the single ')' on line 70.

My updated lexical entries are in eth_clock/festvox/cmu_eth_awb_lexicon.scm

Phonetic Alignment Tricks

In order to build a voice we need to know the start and end times of all the phonemes in the recordings you made. It is possible to do this by hand, but its tedious (and will take too lone to install the appropriate software that would help you do it). So we are going use a cross-lingual alignment techniques, that should work. We will generate approximates of your spoken forms in English and use them to help find the phone boundaries in your language. But my language X so you can't do that, welll actually that doesn't matter if you have phones that aren't in English we can still do this assuming that you still have some notion of vowels and consonants in you language. The fact that we are doing alignment, not recognition means that this almost always works. What you need to do is give a mapping for each of your phonemes in to an approximate English equivalent. For example if you have asperated K, unasperated K and retroflex K, you can map all of these to English K. As your recordings are phonetically correct all we want to do is align an English equivalent so we can find the phone boundaries for your different types of K.

We have done this for many languages. However normally when we have sufficient recordings (30 minutes of speech) we can usually do this directly from the language itself without any cross lingual bootstrapping, but for talking clocks this should be sufficient.

Again in the file 'eth_clock/festvox/cmu_eth_awb_lexicon.scm' you have to add the mapping, a function to do the mapping, and a call to that mapping function. Until you have built a voice the system (wrongly) defaults to using the English diphone synthesizer which is exactly what we want here.

My mapping, and mapping function look like this

(set! eng_map
'(
( A  aa )
( AI ay )
( CH ch )
( E  eh )
( D  d )
( G  g )
( H  hh )
( I  iy )
( JH jh )
( K  k )
( M  m )
( N  n )
( O  ow  )
( P  p )
( R  r )
( S  s )
( T  t )
( U  uw )
( Y  y )
( Z  z )
( pau pau)
))

(define (do_eng_map utt)
  (mapcar
   (lambda (s)
     (set! x (assoc_string (item.name s) eng_map))
     (if x 
         (item.set_feat s "us_diphone" (cadr x))
         (format t "no eng_map for %s\n" (item.name s))
         )
     )
   (utt.relation.items utt 'Segment))
  utt)
I added this immediately after the single ')' line after adding my lexical entries. I also have to add a function to call this mapping function as a post-lexical rule. Where it maps the lexical phone into another which will be used by the diphone synthesizer.

In the function cmu_eth_awb::select_lexicon, I changed

  ;; Post lexical rules
  (set! postlex_rules_hooks (list cmu_eth::postlex_rule1))
to
  ;; Post lexical rules
  (set! postlex_rules_hooks (list 
                             do_eng_map
                             cmu_eth::postlex_rule1))

Building the prompts

Now we can start building the voice, first we want to generate the (fake) synthesized waveform files of your prompts, but labeled with your phones.

./bin/do_build build_prompts_waves
This should generate waveform files in prompt-wav/*.wav with a bad English rendition of your intended prompts. You should listen to a few to ensure they are correct. Also look at the label files in prompt-lab/*.lab, these identify where the phones end in the prompt-wav/*.wav files. Again check to see if the phone lists there are correct.

These are not going to be directly used in your talking clock, but will be used to align the phone strings to your acoustic recordings.

Recording your prompts

If everything is set up wonderfully you can record your prompts with the command

./bin/prompt_them etc/txt.done.data
It will display the text, play the fake synthesizer prompt, estimate how long it'll take for you to speak it and record for that time. However recording on arbitrary machines seems to be orders of magnitude harder than synthesizing natural sounding speech, so this might not work. Therefor use some other recording tool, I recommend audacity to record your prompts. Record each prompt in a separate file, and export with a small amount of silence at the beginning and end (around 250ms) to a file with the right name. You should export it as 16KHz, mono, riff format (sometimes called .wav). Note there are multiple incompatible riff formats that might make your life harder.

If you successfully use prompt_them your files will be in the right format in the right directory, if you used audacity, put your exported waveform files in the 'recording/' directory. Then use the command

./bin/get_wavs recording/*.wav
This should put a powernormalized version, in the right format in the 'wav/' directory. Play some of these to ensure it worked.

Aligning the phones

We will now find the phone boundaries in your recordings with a technique called DTW (Dynamic Time Warping). We acoustically align your recordings, in your language, to the fake English synthesized ones. Because we know where the phone boundaries are in the fake English synthesized ones, we can map those endings to actual endings in your recordings.
./bin/make_labs prompt-wav/*.wav
This should produce phone label files in lab/*.lab. You should check these to ensure the look plausible. When alignment fails often all of the phones are labeled at the end of a file instead of spaced neatly through the file.

Next we have to take the phone labels the DTW algorithm found and integrate them into the utterance structure that identifies the words, syllables, phones, building an utterance structure from which we can extract information for our statistical voice build.

./bin/do_build build_utts

Speech Parameterization

In order to build a unit selection speech synthesizer we have to analyse the speech. We first find the pitch periods, find out when your glottis open and closed while you spoke. Then we have to find the spectral properties of the speech at each pitch period causes by the vocal tract shape.

./bin/do_build do_pm
./bin/do_build do_mcep
There are many options to optimize these, but for short databases the defaults should do.

Building the Cluster Unit Selection Synthesizer

There is one more modification we have to do the default set up for our language. One of the files that descritpions the features used in building the clusters has to know about your phone set. You need to replace the list of English phones with a list of your phones. Edit the files festival/clunits/all.desc and replace the lines with

( p.name
    0 "aa" "ae" "ah" "ao" "aw" "ax" "axr" "ay" "b" "brth" "ch" "d" "dh"
    "dx" "eh" "el" "em" "en" "er" "ey" "f" "g" "h#" "hh" "hv" "ih" "iy"
    "jh" "k" "l" "m" "n" "ng" "nx" "ow" "oy" "p" "pau" "r" "s" "sh" "t"
    "th" "uh" "uw" "v" "w" "y" "z" "zh"
)
and
( n.name ignore
    0 "aa" "ae" "ah" "ao" "aw" "ax" "axr" "ay" "b" "brth" "ch" "d" "dh"
    "dx" "eh" "el" "em" "en" "er" "ey" "f" "g" "h#" "hh" "hv" "ih" "iy"
    "jh" "k" "l" "m" "n" "ng" "nx" "ow" "oy" "p" "pau" "r" "s" "sh" "t"
    "th" "uh" "uw" "v" "w" "y" "z" "zh"
)
with
( p.name
0
A  
AI 
CH 
E  
D  
G  
H  
I  
JH 
K  
M  
N  
O  
P  
R  
S  
T  
U  
Y  
Z  
pau
)
and
( n.name ignore
0
A  
AI 
CH 
E  
D  
G  
H  
I  
JH 
K  
M  
N  
O  
P  
R  
S  
T  
U  
Y  
Z  
pau
)
respectively.

Now we can actually build the synthesizer

./bin/do_build build_clunits
If all went well there should be a voice unit catalogue in
ls -l ./festival/clunits/cmu_eth_awb.catalogue

Testing your voice

You can load and use your voice as

$ESTDIR/../festival/bin/festival festvox/cmu_eth_awb_clunits.scm
....
festival> (voice_cmu_eth_awb_clunits)
...
festival> (SayText "Tadaima, ni ji han gurai go go desu")
...
festival> (set! utt1 (SayText "Tadaima, ju ichi ji han gurai go zen desu"))
...
festival> (utt.save.wave utt1 "eth_clock_11_30.wav")
...
There are other optimizations we can consider, especially if you have low entrophy in your phonemes, then it might get more confused in selection (and labelling might be worse). If so contact awb and ask for help.

Homework submission

You must submit your completed talking clock by Friday 1st March at midnight email to Dante. You must put "11-823" plus your language name in the email subject.

Include in your email

You can optionally write code that will take the actually current time and generate the text string in your language and get it synthesized.

Keep your voice directory around, you may need it later in the course.