Neural Text To Speech Synthesis
Text-to-speech (TTS) systems generate a speech recording for a given input text. TTS systems have been around for a few decades now, and until recently production-level TTS systems predominantly used ML models such as GMM-HMM (Read more about it here). With the advancement of deep learning methods, researchers have developed neural-text-to-speech systems whose synthesized recordings are now perceptually indistinguishable from human speech. In this blog, we will briefly talk about neural TTS systems, how to make them speak in an expressive manner, and in multiple voices.
Neural Text To Speech Synthesis systems
One of the key approaches for Multispeaker TTS is visualized in the figure above. The speaker encoder block in green is the only difference with respect to the single-speaker TTS discussed in the previous section. For each input recording, the speaker encoder block produces a fixed-length vector called speaker embedding that is unique to the speaker present in the input recording. The rest of the blocks are unchanged from the single-speaker TTS, ie, we have the Synthesizer which is comprised of a sequence-to-sequence model with attention, followed by the vocoder. For multispeaker TTS, the fixed-length speaker embedding is concatenated with the output of the Synthesizer-encoder before the attention layer.
Some popular multispeaker TTS datasets are VCTK and LibriTTS. These datasets have recordings from multiple speakers, with varying durations. During training, unlike in single-speaker TTS where text is used as input, and audio is used as output; In the case of multispeaker TTS, we use both the audio and the transcript as inputs. The transcript text flows through the usual path, ie, the Encoder of the Synthesizer. Whereas the audio is processed by the speaker encoder block to obtain the speaker embedding, which is further concatenated with the phonetic embedding from the Synthesizer-encoder.
By training it in the above manner, our multispeaker TTS model will only be able to synthesize the voices of speakers that are part of the training data. Alternatively, if we want to build a multispeaker TTS that can speak in the voice of speakers that are not part of the training data, Google proposed to use a pre-trained speaker verification model as the speaker encoder. The weights of this speaker encoder model were kept frozen while training the multispeaker TTS model. In order for the multispeaker TTS model to generate the voices of unseen speakers successfully, the speaker encoder has to be trained on a much larger number of speakers than the number of speakers in the multispeaker TTS datasets. The more speakers the speaker encoder has seen during training, the better is the voice cloning performance of the multispeaker TTS model.
Expressive can mean multiple things - intonation, stress, rhythm, emotion, style of speech - these are collectively referred to as Prosody. Another interpretation of "Prosody is the variation in speech signals that remains after accounting for variation due to phonetics, speaker identity, and channel effects (i.e. the recording environment)". So basically, spoken audio is comprised of three key components - phonetics (what is being spoken), speaker identity, and prosody (how was it spoken). If we can train a model with these three components as the inputs, we should be able to build an expressive TTS, that can speak in multiple voices. However, collecting annotations required for the prosody input might be challenging. Should we collect annotations at a recording level, if so what should be the labels? Alternatively, should we annotate prosodic labels as a time series? In this regard, there have been multiple efforts for modelling prosody, Mellotron is one such example that uses explicit labels for pitch track and rhythm for training.
Collecting prosodic labels can be challenging. Google proposed to overcome this by learning the prosodic labels in a completely unsupervised manner. Their approach is visualized in the above figure, we see that there are three encoders - 1) Transcript encoder takes a phonetic sequence as input and produces transcript/phonetic embedding, 2) Embedding lookup is the same pre-trained speaker verification model discussed in the Multispeaker TTS section above, it takes the audio recording corresponding to the transcript as the input and produces a fixed-length speaker embedding. Finally, 3) the Reference encoder takes the same audio as the Embedding lookup and generates a fixed-length prosody embedding. The two fixed-length embeddings of prosody and speaker are each concatenated with the transcript embedding before the attention layer. During the training of the Expressive TTS model, because there is an explicit phonetic and speaker input, the Reference encoder is pushed to learn information beyond phonetic or speaker-related, according to our above interpretation this remaining information is nothing but the prosodic component. The output of the reference encoder is obtained as a fixed-length vector, thereby compressing the dimensionality of the prosodic information. Higher the dimensionality, the model will start copying the prosody of the input recording identically. However, restricting the dimensionality will push the model to summarize the prosodic components at an abstract level, and allow it to generalize well.
When the above unsupervised prosodic model is trained with multispeaker, and expressive datasets, the reference encoder learns the different expressions in the training data in a completely unsupervised manner (Does not require expression/prosody labels). During inference, you can simply use a template audio recording from which you want to copy the expression as an input to the reference encoder. The expressive TTS model will synthesize the given input text, with the prosody of the template audio.
An extension of the above work is presented in the style tokens paper, where the authors propose some architectural changes to the above network. These changes allow them to control the expression in the synthesis at inference time using either of the two options a) input template audio, or b) tune the style token.
Evaluation of text-to-speech (TTS) quality
TTS synthesis quality is generally evaluated in two dimensions - naturalness and intelligibility.
Naturalness is rated subjectively, a popular approach is the mean opinion score (MOS), where multiple listeners grade a synthesized sample on a scale of 1 to 5, where 5 indicates human-like synthesis and 1 indicates noisy/robotic synthesis. While evaluating naturalness the annotators are asked to ignore any pronunciation errors and focus solely on the perceptual quality of the synthesis.
Subjective ratings are time-consuming, expensive, and difficult to obtain for every training experiment. As an alternative, some researchers trained a deep learning model - MOSnet - with a dataset of TTS synthesized and human voice recordings and their corresponding MOS scores. They showed that the MOSnet ratings were very highly correlated with human annotators.
Intelligibility is solely about phonetics. Basically, it answers the question - did the TTS synthesize the input text completely and correctly? Most often this is computed using the word error rate (WER) or character error rate (CER) metrics from a speech-to-text (ASR) system. The assumption here is that the ASR systems are perfect and hence their WER/CER are reliable. However, it may not always be the case, and hence intelligibility will also have to be measured in a subjective manner. During the subjective analysis for intelligibility, the evaluators are asked to evaluate the synthesis quality solely on the intelligibility aspect, and not on naturalness.
Additionally, for extensions of TTS systems such as
Multispeaker TTS: We use the popular equal-error-rate (EER) metric used for speaker verification to identify if the synthesized voice sounds like the reference speaker. Alternatively, this can also be evaluated with a MOS score of 1 to 5, where 5 indicates sounds exactly like the reference speaker, and 1 indicates that the synthesized voice is nowhere close to the reference speaker.
Expressive (Prosodic) TTS: Expressiveness has mainly been evaluated in a subjective manner using MOS scores.
Speech Synthesis Markup Language (SSML)
Having built a multispeaker and expressive TTS model, how do we control and interact with the TTS model? This is where the SSML comes into the picture. SSML is a simple HTML kind of language that allows finer control. Some example usages are given below. The preprocessing block discussed in the above neural TTS section will have to be updated to first parse the SSML text and provide the relevant metadata to the TTS models.
How to pronounce? In the below example, using the <phoneme></phoneme> tag, we ask the TTS model to speak pecan with different phonemes based on context.
How to express? In the following example, we control the different expressions text using the <prosody></prosody> tags
What voice to use? Finally, we can choose the voice of the speaker on the fly in a multispeaker model by picking one of the template voices using the <voice></voice> tag
All popular TTS vendors provide SSML support