Spelling suggestions: "subject:"texttospeech synthesis"" "subject:"textualspeech synthesis""
1 |
Mathematical modelling of some aspects of stressing a Lithuanian text / Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimasAnbinderis, Tomas 02 July 2010 (has links)
The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words).
The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%.
The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%.
Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18... [to full text] / Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%.
|
2 |
Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimas / Mathematical modelling of some aspects of stressing a Lithuanian textAnbinderis, Tomas 02 July 2010 (has links)
Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%. / The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words).
The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%.
The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%.
Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18.8%... [to full text]
|
3 |
Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. SchlünzSchlünz, Georg Isaac January 2014 (has links)
Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
4 |
Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. SchlünzSchlünz, Georg Isaac January 2014 (has links)
Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework. / PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
|
5 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
6 |
The effects of part–of–speech tagging on text–to–speech synthesis for resource–scarce languages / G.I. SchlünzSchlünz, Georg Isaac January 2010 (has links)
In the world of human language technology, resource–scarce languages (RSLs) suffer from the problem
of little available electronic data and linguistic expertise. The Lwazi project in South Africa
is a large–scale endeavour to collect and apply such resources for all eleven of the official South
African languages. One of the deliverables of the project is more natural text–to–speech (TTS)
voices. Naturalness is primarily determined by prosody and it is shown that many aspects of
prosodic modelling is, in turn, dependent on part–of–speech (POS) information. Solving the POS
problem is, therefore, a prudent first step towards meeting the goal of natural TTS voices.
In a resource–scarce environment, obtaining and applying the POS information are not trivial.
Firstly, an automatic tagger is required to tag the text to be synthesised with POS categories, but
state–of–the–art POS taggers are data–driven and thus require large amounts of labelled training
data. Secondly, the subsequent processes in TTS that are used to apply the POS information
towards prosodic modelling are resource–intensive themselves: some require non–trivial linguistic
knowledge; others require labelled data as well.
The first problem asks the question of which available POS tagging algorithm will be the most
accurate on little training data. This research sets out to answer the question by reviewing the
most popular supervised data–driven algorithms. Since literature to date consists mostly of isolated
papers discussing one algorithm, the aim of the review is to consolidate the research into a single
point of reference. A subsequent experimental investigation compares the tagging algorithms on
small training data sets of English and Afrikaans, and it is shown that the hidden Markov model
(HMM) tagger outperforms the rest when using both a comprehensive and a reduced POS tagset.
Regarding the second problem, the question arises whether it is perhaps possible to circumvent
the traditional approaches to prosodic modelling by learning the latter directly from the speech
data using POS information. In other words, does the addition of POS features to the HTS context
labels improve the naturalness of a TTS voice? Towards answering this question, HTS voices are
trained from English and Afrikaans prosodically rich speech. The voices are compared with and
without POS features incorporated into the HTS context labels, analytically and perceptually. For
the analytical experiments, measures of prosody to quantify the comparisons are explored. It is
then also noted whether the results of the perceptual experiments correlate with their analytical
counterparts. It is found that, when a minimal feature set is used for the HTS context labels, the
addition of POS tags does improve the naturalness of the voice. However, the same effect can be
accomplished by including segmental counting and positional information instead of the POS tags. / Thesis (M.Sc. Engineering Sciences (Electrical and Electronic Engineering))--North-West University, Potchefstroom Campus, 2011.
|
7 |
Statistical parametric speech synthesis using conversational data and phenomenaDall, Rasmus January 2017 (has links)
Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
|
8 |
Évaluation expérimentale d'un système statistique de synthèse de la parole, HTS, pour la langue française / Experimental evaluation of a statistical speech synthesis system, HTS, for frenchLe Maguer, Sébastien 05 July 2013 (has links)
Les travaux présentés dans cette thèse se situent dans le cadre de la synthèse de la parole à partir du texte et, plus précisément, dans le cadre de la synthèse paramétrique utilisant des règles statistiques. Nous nous intéressons à l'influence des descripteurs linguistiques utilisés pour caractériser un signal de parole sur la modélisation effectuée dans le système de synthèse statistique HTS. Pour cela, deux méthodologies d'évaluation objective sont présentées. La première repose sur une modélisation de l'espace acoustique, généré par HTS par des mélanges gaussiens (GMM). En utilisant ensuite un ensemble de signaux de parole de référence, il est possible de comparer les GMM entre eux et ainsi les espaces acoustiques générés par les différentes configurations de HTS. La seconde méthodologie proposée repose sur le calcul de distances entre trames acoustiques appariées pour pouvoir évaluer la modélisation effectuée par HTS de manière plus locale. Cette seconde méthodologie permet de compléter les diverses analyses en contrôlant notamment les ensembles de données générées et évaluées. Les résultats obtenus selon ces deux méthodologies, et confirmés par des évaluations subjectives, indiquent que l'utilisation d'un ensemble complexe de descripteurs linguistiques n'aboutit pas nécessairement à une meilleure modélisation et peut s'avérer contre-productif sur la qualité du signal de synthèse produit. / The work presented in this thesis is about TTS speech synthesis and, more particularly, about statistical speech synthesis for French. We present an analysis on the impact of the linguistic contextual factors on the synthesis achieved by the HTS statistical speech synthesis system. To conduct the experiments, two objective evaluation protocols are proposed. The first one uses Gaussian mixture models (GMM) to represent the acoustical space produced by HTS according to a contextual feature set. By using a constant reference set of natural speech stimuli, GMM can be compared between themselves and consequently acoustic spaces generated by HTS. The second objective evaluation that we propose is based on pairwise distances between natural speech and synthetic speech generated by HTS. Results obtained by both protocols, and confirmed by subjective evaluations, show that using a large set of contextual factors does not necessarily improve the modeling and could be counter-productive on the speech quality.
|
9 |
The punctuation and intonation of parentheticalsBodenbender, Christel 17 May 2010 (has links)
From a historical perspective, punctuation marks are often assumed to only represent some of the phonetic structure of the spoken form of that text. It has been argued recently that punctuation today is a linguistic system that not only represents some of the phonetic sentence structure but also syntactic as well as semantic information. One case in point is the observation that the semantic difference in differently punctuated parenthetical phrases is not reflected in the intonation contour. This study provides the acoustic evidence for this observation. Furthermore, this study makes recommendations to achieve natural-sounding text-to-speech output for English parentheticals by incorporating the study's findings with respect to parenthical intonation.
The experiment conducted for this study involved three male and three female native speakers of Canadian English reading aloud a set of 20 sentences with parenthetical and non-parenthetical phrases. These sentences were analyzed with respect to acoustic characteristics due to differences in punctuation as well as due to differences
between parenthetical and non-parenthetical phrases.
A number of conclusions were drawn based on the results of the experiment:
(1) a difference in punctuation, although entailing a semantic difference, is not reflected in the intonation pattern; (2) in contrast to the general understanding that parenthetical phrases are lower-leveled and narrower in pitch range than the surrounding sentence, this study shows that it is not the parenthetical phrase itself that is implemented differently from its non-parenthetical counterpart; rather, the phrase that precedes the parenthetical
exhibits a lower baseline and with that a wider pitch range than the corresponding phrase in a non-parenthetical sentence; (3) sentences with two adjacent parenthetical phrases or one embedded in the other exhibit the same pattern for the parenthetical-preceding phrase as the sentences in (2) above and a narrowed pitch range for the parenthetical phrases that are not in the final position of the sequence of parentheticals; (4) no pausing pattern
could be found; (5) the characteristics found for parenthetical phrases can be implemented in synthesized speech through the use of SABLE speech markup as part of
the SABLE speech synthesis system.
This is the first time that the connection between punctuation and intonation in parenthetical sentences has been investigated; it is also the first look at sentences with more than one parenthetical phrase. This study contributes to our understanding of the intonation of parenthetical phrases in English and their implementation in text-to-speech systems, by providing an analysis of their acoustic characteristics.
|
10 |
The development of accented English synthetic voicesMalatji, Promise Tshepiso January 2019 (has links)
Thesis (M. Sc. (Computer Science)) --University of Limpopo, 2019 / A Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice.
|
Page generated in 0.0941 seconds