Return to search

Using Linguistic Features to Improve Prosody for Text-to-Speech

This thesis focuses on the problem of using text-to-speech (TTS) to synthesize speech with natural-sounding prosody. I propose a two-step process for approaching this problem. In the first step, I train text-based models to predict the locations of phrase boundaries and pitch accents in an utterance. Because these models use only text features, they can be used to predict the locations of prosodic events in novel utterances. In the second step, I incorporate these prosodic events into a text-to-speech pipeline in order to produce prosodically appropriate speech.

I trained models for predicting phrase boundaries and pitch accents on utterances from a corpus of radio news data. I found that the strongest models used a large variety of features, including syntactic features, lexical features, word embeddings, and co-reference features. In particular, using a large variety of syntactic features improved performance on both tasks. These models also performed well when tested on a different corpus of news data.

I then trained similar models on two conversational corpora: one a corpus of task-oriented dialogs and one a corpus of open-ended conversations. I again found that I could train strong models by using a wide variety of linguistic features, although performance dropped slightly in cross-corpus applications, and performance was very poor in cross-genre applications. For conversational speech, syntactic features continued to be helpful for both tasks. Additionally, word embedding features were particularly helpful in the conversational domain. Interestingly, while it is generally believed that given information (i.e., terms that have recently been referenced) is often de-accented, for all three corpora, I found that including co-reference features only slightly improved the pitch accent detection model.

I then trained a TTS system on the same radio news corpus using Merlin, an open source DNN-based toolkit for TTS. As Merlin includes a linguistic feature extraction step before training, I added two additional features: one for phrase boundaries (distinguishing between sentence boundaries and mid-sentence phrase boundaries) and one for pitch accents. The locations of all breaks and accents for all test and training data were determined using the text-based prosody prediction models. I found that the pipeline using these new features produced speech that slightly outperformed the baseline on objective metrics such as mel-cepstral distortion (MCD) and was greatly preferred by listeners in a subjective listening test.

Finally, I trained an end-to-end TTS system on data that included phrase boundaries. The model was trained on a corpus of read speech, with the locations of phrase boundaries predicted based on acoustic features, and tested on radio news stories, with phrase boundaries predicted using the text-based model. I found that including phrase boundaries lowered MCD between the synthesized speech and the original radio broadcast, as compared to the baseline, but the results of a listening test were inconclusive.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/tzqw-j106
Date January 2023
CreatorsSloan, Rose
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0026 seconds