Neural Text To Speech Synthesis

Text-to-speech (TTS) systems generate a speech recording for a given input text. TTS systems have been around for a few decades now, and until recently production-level TTS systems predominantly used ML models such as GMM-HMM (Read more about it here). With the advancement of deep learning methods, researchers have developed neural-text-to-speech systems whose synthesized recordings are now perceptually indistinguishable from human speech. In this blog, we will briefly talk about neural TTS systems, how to make them speak in an expressive manner, and in multiple voices.

Neural Text To Speech Synthesis systems

Most Neural TTS systems synthesize speech in three sequential steps,

  1. Preprocessing

    • The correct pronunciation of words may not always be obvious. For example, in the English language, the letter c is pronounced differently in the words cat, and chat. These differences can be challenging for TTS systems. However, given enough examples during training, the TTS models have shown to learn the different pronunciations of letters based on context. But no language is perfect, there will always be outlier words that do not follow the pronunciation rules for a given context. Further, there might be new words that the model has never seen during training, whose pronunciation does not follow the generic rules (for example, new-age products and company names such as gmail). In order to support such fine control of pronunciation, TTS systems generally use the preprocessing layer to generate an unambiguous phoneme sequence for the input text.

    • One of the popular text to phoneme generators is the phonemizer, which supports multiple input languages.

    • Advanced versions of TTS that support prosodic synthesis with different voices can take other input values. Read the corresponding sections below to learn more.

  2. Generation of acoustic-feature representation from phoneme-sequence input

    • The phoneme-sequence output of the preprocessing block is mapped to an intermediate acoustic-feature representation using the first model, most often this representation is the mel-spectrogram.

    • Generally, the input phoneme-sequence length is much shorter than the output mel-spectrogram sequence. In order to model such varying input-output sequences, most TTS models employ the sequence-to-sequence architecture with attention layers, which are also popular for speech-to-text (ASR) modelling.

    • The sequence-to-sequence methods are auto-regressive, meaning they generate one sample of the sequence at a time. Hence the synthesis time increases linearly with the length of the input phoneme sequence. More recent methods, such as FastSpeech2, overcome this by directly modelling the phoneme-to-duration mapping from the training data. This helps generate the mel-spectrogram in one shot (non-auto-regressive).

    • Some popular models are Tacotron2, DeepSpeech, and FastSpeech2. A more exhaustive set of models are supported in the ESPnet framework.

  3. Acoustic feature representation to waveform synthesis (Vocoder)

    • The mel-spectrogram features output from the previous model is mapped to the corresponding waveform signal using the vocoder model. This completes the synthesis process and produces recordings that can be heard.

    • Some of the early versions of the vocoders such as Wavenet were computationally expensive and slow in synthesis. More recent models such as Parallel WaveGAN are cheap and super-fast.

    • Some popular models are Wavenet, WaveRNN, WaveGlow, LPCnet, Parallel WaveGAN


Training any of the above mentioned TTS models needs a minimum of 5 hours of recordings with the transcript from a single speaker. Some links to publicly available datasets can be found here. The quality of speech in the training data should be very clean, ideally recorded in a studio.

Multispeaker TTS

One of the key approaches for Multispeaker TTS is visualized in the figure above. The speaker encoder block in green is the only difference with respect to the single-speaker TTS discussed in the previous section. For each input recording, the speaker encoder block produces a fixed-length vector called speaker embedding that is unique to the speaker present in the input recording. The rest of the blocks are unchanged from the single-speaker TTS, ie, we have the Synthesizer which is comprised of a sequence-to-sequence model with attention, followed by the vocoder. For multispeaker TTS, the fixed-length speaker embedding is concatenated with the output of the Synthesizer-encoder before the attention layer.

Some popular multispeaker TTS datasets are VCTK and LibriTTS. These datasets have recordings from multiple speakers, with varying durations. During training, unlike in single-speaker TTS where text is used as input, and audio is used as output; In the case of multispeaker TTS, we use both the audio and the transcript as inputs. The transcript text flows through the usual path, ie, the Encoder of the Synthesizer. Whereas the audio is processed by the speaker encoder block to obtain the speaker embedding, which is further concatenated with the phonetic embedding from the Synthesizer-encoder.

By training it in the above manner, our multispeaker TTS model will only be able to synthesize the voices of speakers that are part of the training data. Alternatively, if we want to build a multispeaker TTS that can speak in the voice of speakers that are not part of the training data, Google proposed to use a pre-trained speaker verification model as the speaker encoder. The weights of this speaker encoder model were kept frozen while training the multispeaker TTS model. In order for the multispeaker TTS model to generate the voices of unseen speakers successfully, the speaker encoder has to be trained on a much larger number of speakers than the number of speakers in the multispeaker TTS datasets. The more speakers the speaker encoder has seen during training, the better is the voice cloning performance of the multispeaker TTS model.

Expressive can mean multiple things - intonation, stress, rhythm, emotion, style of speech - these are collectively referred to as Prosody. Another interpretation of "Prosody is the variation in speech signals that remains after accounting for variation due to phonetics, speaker identity, and channel effects (i.e. the recording environment)". So basically, spoken audio is comprised of three key components - phonetics (what is being spoken), speaker identity, and prosody (how was it spoken). If we can train a model with these three components as the inputs, we should be able to build an expressive TTS, that can speak in multiple voices. However, collecting annotations required for the prosody input might be challenging. Should we collect annotations at a recording level, if so what should be the labels? Alternatively, should we annotate prosodic labels as a time series? In this regard, there have been multiple efforts for modelling prosody, Mellotron is one such example that uses explicit labels for pitch track and rhythm for training.

Collecting prosodic labels can be challenging. Google proposed to overcome this by learning the prosodic labels in a completely unsupervised manner. Their approach is visualized in the above figure, we see that there are three encoders - 1) Transcript encoder takes a phonetic sequence as input and produces transcript/phonetic embedding, 2) Embedding lookup is the same pre-trained speaker verification model discussed in the Multispeaker TTS section above, it takes the audio recording corresponding to the transcript as the input and produces a fixed-length speaker embedding. Finally, 3) the Reference encoder takes the same audio as the Embedding lookup and generates a fixed-length prosody embedding. The two fixed-length embeddings of prosody and speaker are each concatenated with the transcript embedding before the attention layer. During the training of the Expressive TTS model, because there is an explicit phonetic and speaker input, the Reference encoder is pushed to learn information beyond phonetic or speaker-related, according to our above interpretation this remaining information is nothing but the prosodic component. The output of the reference encoder is obtained as a fixed-length vector, thereby compressing the dimensionality of the prosodic information. Higher the dimensionality, the model will start copying the prosody of the input recording identically. However, restricting the dimensionality will push the model to summarize the prosodic components at an abstract level, and allow it to generalize well.

When the above unsupervised prosodic model is trained with multispeaker, and expressive datasets, the reference encoder learns the different expressions in the training data in a completely unsupervised manner (Does not require expression/prosody labels). During inference, you can simply use a template audio recording from which you want to copy the expression as an input to the reference encoder. The expressive TTS model will synthesize the given input text, with the prosody of the template audio.

An extension of the above work is presented in the style tokens paper, where the authors propose some architectural changes to the above network. These changes allow them to control the expression in the synthesis at inference time using either of the two options a) input template audio, or b) tune the style token.

Evaluation of text-to-speech (TTS) quality

TTS synthesis quality is generally evaluated in two dimensions - naturalness and intelligibility.

Naturalness is rated subjectively, a popular approach is the mean opinion score (MOS), where multiple listeners grade a synthesized sample on a scale of 1 to 5, where 5 indicates human-like synthesis and 1 indicates noisy/robotic synthesis. While evaluating naturalness the annotators are asked to ignore any pronunciation errors and focus solely on the perceptual quality of the synthesis.

    • Subjective ratings are time-consuming, expensive, and difficult to obtain for every training experiment. As an alternative, some researchers trained a deep learning model - MOSnet - with a dataset of TTS synthesized and human voice recordings and their corresponding MOS scores. They showed that the MOSnet ratings were very highly correlated with human annotators.

Intelligibility is solely about phonetics. Basically, it answers the question - did the TTS synthesize the input text completely and correctly? Most often this is computed using the word error rate (WER) or character error rate (CER) metrics from a speech-to-text (ASR) system. The assumption here is that the ASR systems are perfect and hence their WER/CER are reliable. However, it may not always be the case, and hence intelligibility will also have to be measured in a subjective manner. During the subjective analysis for intelligibility, the evaluators are asked to evaluate the synthesis quality solely on the intelligibility aspect, and not on naturalness.

Additionally, for extensions of TTS systems such as

  • Multispeaker TTS: We use the popular equal-error-rate (EER) metric used for speaker verification to identify if the synthesized voice sounds like the reference speaker. Alternatively, this can also be evaluated with a MOS score of 1 to 5, where 5 indicates sounds exactly like the reference speaker, and 1 indicates that the synthesized voice is nowhere close to the reference speaker.

  • Expressive (Prosodic) TTS: Expressiveness has mainly been evaluated in a subjective manner using MOS scores.

Speech Synthesis Markup Language (SSML)

Having built a multispeaker and expressive TTS model, how do we control and interact with the TTS model? This is where the SSML comes into the picture. SSML is a simple HTML kind of language that allows finer control. Some example usages are given below. The preprocessing block discussed in the above neural TTS section will have to be updated to first parse the SSML text and provide the relevant metadata to the TTS models.

  • How to pronounce? In the below example, using the <phoneme></phoneme> tag, we ask the TTS model to speak pecan with different phonemes based on context.


You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.

I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.


  • How to express? In the following example, we control the different expressions text using the <prosody></prosody> tags


Normal volume for the first sentence.

<prosody volume="x-loud">Louder volume for the second sentence</prosody>.

When I wake up, <prosody rate="x-slow">I speak quite slowly</prosody>.

I can speak with my normal pitch,

<prosody pitch="x-high"> but also with a much higher pitch </prosody>,

and also <prosody pitch="low">with a lower pitch</prosody>.


  • What voice to use? Finally, we can choose the voice of the speaker on the fly in a multispeaker model by picking one of the template voices using the <voice></voice> tag


I want to tell you a secret.

<voice name="Kendra">I am not a real human.</voice>.

Can you believe it?


All popular TTS vendors provide SSML support