11 |
Tone realisation for speech synthesis of Yorùbá / Daniel Rudolph van NiekerkVan Niekerk, Daniel Rudolph January 2014 (has links)
Speech technologies such as text-to-speech synthesis (TTS) and automatic speech recognition (ASR) have recently generated much interest in the developed world as a user-interface medium to smartphones [1, 2]. However, it is also recognised that these technologies may potentially have a positive impact on the lives of those in the developing world, especially in Africa, by presenting an important medium for access to information where illiteracy and a lack of infrastructure play a limiting role [3, 4, 5, 6]. While these technologies continually experience important advances that keep extending their applicability to new and under-resourced languages, one particular area in need of further development is speech synthesis of African tone languages [7, 8]. The main objective of this work is acoustic modelling and synthesis of tone for an African tone,language: Yorùbá. We present an empirical investigation to establish the acoustic properties of tone in Yorùbá, and to evaluate resulting models integrated into a Hidden Markov model-based (HMMbased) TTS system. We show that in Yorùbá, which is considered a register tone language, the realisation of tone is not solely determined by pitch levels, but also inter-syllable and intra-syllable pitch dynamics. Furthermore, our experimental results indicate that utterance-wide pitch patterns are not only a result of cumulative local pitch changes (terracing), but do contain a significant gradual declination component. Lastly, models based on inter- and intra-syllable pitch dynamics using underlying linear pitch targets are shown to be relatively efficient and perceptually preferable to the current standard approach in statistical parametric speech synthesis employing HMM pitch models based on context-dependent phones. These findings support the applicability of the proposed models in under-resourced conditions. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
12 |
Tone realisation for speech synthesis of Yorùbá / Daniel Rudolph van NiekerkVan Niekerk, Daniel Rudolph January 2014 (has links)
Speech technologies such as text-to-speech synthesis (TTS) and automatic speech recognition (ASR) have recently generated much interest in the developed world as a user-interface medium to smartphones [1, 2]. However, it is also recognised that these technologies may potentially have a positive impact on the lives of those in the developing world, especially in Africa, by presenting an important medium for access to information where illiteracy and a lack of infrastructure play a limiting role [3, 4, 5, 6]. While these technologies continually experience important advances that keep extending their applicability to new and under-resourced languages, one particular area in need of further development is speech synthesis of African tone languages [7, 8]. The main objective of this work is acoustic modelling and synthesis of tone for an African tone,language: Yorùbá. We present an empirical investigation to establish the acoustic properties of tone in Yorùbá, and to evaluate resulting models integrated into a Hidden Markov model-based (HMMbased) TTS system. We show that in Yorùbá, which is considered a register tone language, the realisation of tone is not solely determined by pitch levels, but also inter-syllable and intra-syllable pitch dynamics. Furthermore, our experimental results indicate that utterance-wide pitch patterns are not only a result of cumulative local pitch changes (terracing), but do contain a significant gradual declination component. Lastly, models based on inter- and intra-syllable pitch dynamics using underlying linear pitch targets are shown to be relatively efficient and perceptually preferable to the current standard approach in statistical parametric speech synthesis employing HMM pitch models based on context-dependent phones. These findings support the applicability of the proposed models in under-resourced conditions. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
13 |
Prosody modelling for a Sesotho text-to-speech system using the Fujisaki modelMohasi, Lehlohonolo 03 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2015. / ENGLISH ABSTRACT:
Please refer to full text for abstract.
|
14 |
Mathematical modelling of some aspects of stressing a Lithuanian text / Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimasAnbinderis, Tomas 02 July 2010 (has links)
The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words).
The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%.
The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%.
Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18... [to full text] / Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%.
|
15 |
Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimas / Mathematical modelling of some aspects of stressing a Lithuanian textAnbinderis, Tomas 02 July 2010 (has links)
Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%. / The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words).
The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%.
The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%.
Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18.8%... [to full text]
|
16 |
Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. SchlünzSchlünz, Georg Isaac January 2014 (has links)
Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
17 |
Unsupervised learning for text-to-speech synthesisWatts, Oliver Samuel January 2013 (has links)
This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.
|
18 |
Generating Facial Animation With Emotions In A Neural Text-To-Speech PipelineIgeland, Viktor January 2019 (has links)
This thesis presents the work of incorporating facial animation with emotions into a neural text-to-speech pipeline. The project aims to allow for a digital human to utter sentences given only text, removing the need for video input. Our solution consists of a neural network able to generate blend shape weights from speech which is placed in a neural text-to-speech pipeline. We build on ideas from previous work and implement a recurrent neural network using four LSTM layers and later extend this implementation by incorporating emotions. The emotions are learned by the network itself via the emotion layer and used at inference to produce the desired emotion. While using LSTMs for speech-driven facial animation is not a new idea, it has not yet been combined with the idea of using emotional states that are learned by the network itself. Previous approaches are either only two-dimensional, of complicated design or require manual laboring of the emotional states. Thus, we implement a network of simple design, taking advantage of the sequence processing ability of LSTMs and combines it with the idea of emotional states. We trained several variations of the network on data captured using a head mounted camera, and the results of the best performing model were used in a subjective evaluation. During the evaluation the participants were presented several videos and asked to rate the naturalness of the face uttering the sentence. The results showed that the naturalness of the face greatly depends on which emotion vector was used, as some vectors limited the mobility of the face. However, our best achieving emotion vector was rated at the same level of naturalness as the ground truth, proving our method successful. The purpose of the thesis was fulfilled as our implementation demonstrates one possibility of incorporating facial animation into a text-to-speech pipeline.
|
19 |
Understanding Synthetic Speech and Language Processing of Students with and without a Reading DisabilityCunningham, Todd 06 January 2012 (has links)
To help circumvent reading disability (RD) decoding difficulty, Text-To-Speech (TTS) software can be used to present written language audibly. Although TTS software is currently being used to help RD students, there is a lack of empirically supported literature to inform developers and users of TTS software on best practices. This dissertation investigated two methods to determine whether they increase the effectiveness of TTS for RD and typically-developing students. The first method compared low and high quality TTS voices in regards to understanding. TTS voice quality was identified by having 40 university students listen to and rate the quality of 10 commonly used TTS voices and 2 human voices. Three voices were chosen for the subsequent study based on the ratings; one low quality TTS, one high quality TTS, and one natural voice (Microsoft Mary, AT&T Crystal, and Susan, respectively). Understanding was assessed with tests of intelligibility and comprehensibility. Forty-five grade 6 to 8 students who were identified as having a RD were compared to same-age typically-developing peers. Results showed high quality TTS and natural voice were more intelligible than the low quality TTS voice, and high quality TTS voice resulted in higher comprehensibility scores than low quality TTS and natural voice.
The second method investigated whether it is possible to increase a student’s comprehension when using TTS by modifying the presentation style of the TTS voice. The presentation style was manipulated in two ways: varying the speed at which the TTS presented the materials (120, 150, 180 words per minute) and the presence of pauses varied (no pauses inserted, random pauses inserted, or 500 millisecond pauses at the end of noun phrases). Due to a floor effect on the comprehension of the texts the expected results were not obtained. A follow up analysis compared the participants’ prosodic sensitivity skills based on whether they had a specific language impairment, (SLI) a reading impairment (RI), or were typically-developing. Results suggested that SLI has significantly less auditory working memory then RI impacting their auditory processing. Recommendations for future research and the use of TTS based on different learning profiles are provided.
|
20 |
Understanding Synthetic Speech and Language Processing of Students with and without a Reading DisabilityCunningham, Todd 06 January 2012 (has links)
To help circumvent reading disability (RD) decoding difficulty, Text-To-Speech (TTS) software can be used to present written language audibly. Although TTS software is currently being used to help RD students, there is a lack of empirically supported literature to inform developers and users of TTS software on best practices. This dissertation investigated two methods to determine whether they increase the effectiveness of TTS for RD and typically-developing students. The first method compared low and high quality TTS voices in regards to understanding. TTS voice quality was identified by having 40 university students listen to and rate the quality of 10 commonly used TTS voices and 2 human voices. Three voices were chosen for the subsequent study based on the ratings; one low quality TTS, one high quality TTS, and one natural voice (Microsoft Mary, AT&T Crystal, and Susan, respectively). Understanding was assessed with tests of intelligibility and comprehensibility. Forty-five grade 6 to 8 students who were identified as having a RD were compared to same-age typically-developing peers. Results showed high quality TTS and natural voice were more intelligible than the low quality TTS voice, and high quality TTS voice resulted in higher comprehensibility scores than low quality TTS and natural voice.
The second method investigated whether it is possible to increase a student’s comprehension when using TTS by modifying the presentation style of the TTS voice. The presentation style was manipulated in two ways: varying the speed at which the TTS presented the materials (120, 150, 180 words per minute) and the presence of pauses varied (no pauses inserted, random pauses inserted, or 500 millisecond pauses at the end of noun phrases). Due to a floor effect on the comprehension of the texts the expected results were not obtained. A follow up analysis compared the participants’ prosodic sensitivity skills based on whether they had a specific language impairment, (SLI) a reading impairment (RI), or were typically-developing. Results suggested that SLI has significantly less auditory working memory then RI impacting their auditory processing. Recommendations for future research and the use of TTS based on different learning profiles are provided.
|
Page generated in 0.0923 seconds