Global ETD Search

51	The development of accented English synthetic voices Malatji, Promise Tshepiso January 2019 (has links) Thesis (M. Sc. (Computer Science)) --University of Limpopo, 2019 / A Text-to-speech (TTS) synthesis system is a software system that receives text as input and produces speech as output. A TTS synthesis system can be used for, amongst others, language learning, and reading out text for people living with different disabilities, i.e., physically challenged, visually impaired, etc., by native and non-native speakers of the target language. Most people relate easily to a second language spoken by a non-native speaker they share a native language with. Most online English TTS synthesis systems are usually developed using native speakers of English. This research study focuses on developing accented English synthetic voices as spoken by non-native speakers in the Limpopo province of South Africa. The Modular Architecture for Research on speech sYnthesis (MARY) TTS engine is used in developing the synthetic voices. The Hidden Markov Model (HMM) method was used to train the synthetic voices. Secondary training text corpus is used to develop the training speech corpus by recording six speakers reading the text corpus. The quality of developed synthetic voices is measured in terms of their intelligibility, similarity and naturalness using a listening test. The results in the research study are classified based on evaluators’ occupation and gender and the overall results. The subjective listening test indicates that the developed synthetic voices have a high level of acceptance in terms of similarity and intelligibility. A speech analysis software is used to compare the recorded synthesised speech and the human recordings. There is no significant difference in the voice pitch of the speakers and the synthetic voices except for one synthetic voice. Text-to-speech synthesis system Language learning Text data mining Data compression (computer science) Informal language learning
52	Efektivní neuronová syntéza řeči / Efficient neural speech synthesis Vainer, Jan January 2020 (has links) While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. In this the- sis, we present a neural speech synthesis system capable of high-quality faster- than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. Our system consists of a teacher and a student network. The teacher model is used to extract alignment between the text to synthesize and the corresponding spectrogram. The student uses the alignments from the teacher model to synthesize mel-scale spectrograms from a phonemic representation of the input text efficiently. Both systems utilize simple convo- lutional layers. We train both systems on the english LJSpeech dataset. The quality of samples synthesized by our model was rated significantly higher than baseline models. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. 1
53	Vícejazyčná syntéza řeči / Multilingual speech synthesis Nekvinda, Tomáš January 2020 (has links) This work explores multilingual speech synthesis. We compare three models based on Tacotron that utilize various levels of parameter sharing. Two of them follow recent multilingual text-to-speech systems. The first one makes use of a fully-shared encoder and an adversarial classifier that removes speaker-dependent information from the encoder. The other uses language-specific encoders. We introduce a new approach that combines the best of both previous methods. It enables effective parameter sharing using a meta- learning technique, preserves encoder's flexibility, and actively removes speaker-specific information in the encoder. We compare the three models on two tasks. The first one aims at joint multilingual training on ten languages and reveals their knowledge-sharing abilities. The second concerns code-switching. We show that our model effectively shares information across languages, and according to a subjective evaluation test, it produces more natural and accurate code-switching speech.
54	Parametarska sinteza ekspresivnog govora / Parametric synthesis of expressive speech Suzić Siniša 12 July 2019 (has links) <p>U disertaciji su opisani postupci sinteze ekspresivnog govora<br />korišćenjem parametarskih pristupa. Pokazano je da se korišćenjem<br />dubokih neuronskih mreža dobijaju bolji rezultati nego korišćenjem<br />skrivenix Markovljevih modela. Predložene su tri nove metode za<br />sintezu ekspresivnog govora korišćenjem dubokih neuronskih mreža:<br />metoda kodova stila, metoda dodatne obuke mreže i arhitektura<br />zasnovana na deljenim skrivenim slojevima. Pokazano je da se najbolji<br />rezultati dobijaju korišćenjem metode kodova stila. Takođe je<br />predložana i nova metoda za transplantaciju emocija/stilova<br />bazirana na deljenim skrivenim slojevima. Predložena metoda<br />ocenjena je bolje od referentne metode iz literature.</p> / <p>In this thesis methods for expressive speech synthesis using parametric<br />approaches are presented. It is shown that better results are achived with<br />usage of deep neural networks compared to synthesis based on hidden<br />Markov models. Three new methods for synthesis of expresive speech using<br />deep neural networks are presented: style codes, model re-training and<br />shared hidden layer architecture. It is shown that best results are achived by<br />using style code method. The new method for style transplantation based on<br />shared hidden layer architecture is also proposed. It is shown that this<br />method outperforms referent method from literature.</p>
55	Investigating reading comprehension in Reading While Listening and the relevancy of The Voice Effect / Undersökning av läsförståelse och rösteffekten inom samtidig lyssning och läsning Hedenström, Edvin, Barck-Holst, Axel January 2023 (has links) Various forms of multimedia learning have been shown to aid learners time and time again. One form of multimedia learning that has not been thoroughly studied is reading while listening (RWL). This is especially the case when it comes to the immediate impacts on reading comprehension from practising RWL. Furthermore the recent advancements of Text-To-Speech (TTS) have started to challenge the established notion that real human recorded spoken word is always preferable for learning, also known as The Voice Effect. This study looked at Swedish University students with English as their second language (L2) and examined how their reading comprehension in L2 was performing in three different groups. The groups were Reading Only (RO), Reading-While-Listening with spoken word (RWL-SW) and Reading-While-Listening with text-to-speech (RWL-TTS). The RO group was then compared to The RWL groups. The two RWL groups were also compared on test scores as well as perceived enjoyment and aid from the narration as reported by the participants. Our results did not exhibit any statistically significant difference in reading comprehension between the RO group and the RWL groups. When looking at the results of the reading comprehension test the RO and RWL-TTS groups got the exact same number of correct answers. This suggests that RWL did not have any notable impact on reading comprehension. Furthermore no statistical significant difference was found between the two RWL groups in test scores or perceived enjoyment and aid from the narration. What’s interesting to note is that RWL-SW performed slightly worse than RWL-TTS on the comprehension test. The reported perceived enjoyment and aid from the narration was also notably similar to each other. This suggests that The Voice Effect did not have relevance in this test. / Olika former av multimediainlärning har visat sig hjälpa eleverna gång på gång. En form av multimedieinlärning som inte har studerats grundligt är läsning medan man lyssnar (RWL). Detta gäller särskilt när det gäller de omedelbara effekterna på läsförståelsen av att använda på RWL. Dessutom har de senaste framstegen med text till tal (TTS) börjat utmana den etablerade uppfattningen att verkligt mänskligt inspelat talat ord alltid är att föredra vid inlärning, även kallat “Rösteffekten” (The Voice Effect). I den här studien undersöktes svenska universitetsstudenter med engelska som andraspråk (L2) och hur deras läsförståelse i L2 presterade i tre olika grupper. Grupperna var Reading Only (RO), Reading-While-Listening med en mänsklig talare (RWL-SW) och Reading-While-Listening med text-to-speech (RWL-TTS). RO-gruppen jämfördes sedan med RWL-grupperna. De två RWL-grupperna jämfördes också med avseende på testresultat samt upplevd njutning och hjälp från berättandet enligt deltagarnas rapporter. Våra resultat visade ingen statistiskt signifikant skillnad i läsförståelse mellan RO-gruppen och RWL-grupperna. När man tittar på resultaten av läsförståelsetestet fick RO- och RWL-TTS- grupperna exakt lika många korrekta svar. Detta tyder på att RWL inte hade någon anmärkningsvärd inverkan på läsförståelsen. Dessutom hittades ingen statistiskt signifikant skillnad mellan de två RWL-grupperna när det gäller testresultat eller upplevd njutning och hjälp av uppläsningen. Vad som är intressant att notera är att RWL-SW presterade något sämre än RWL-TTS på läsförståelsetestet. Den rapporterade upplevda uppskattningen och hjälp från uppläsning var också anmärkningsvärt likartade. Detta tyder på att “The Voice Effect” inte hade någon betydelse i detta test. RWL Reading-while-listening Multimedia learning Multimodal Learning TTS text-to-speech reading comprehension Computer and Information Sciences Data- och informationsvetenskap
56	Evaluation of how text-to-speech can be adapted for the specific purpose of being an AI psychologist Rayat, Pooya, Westergård, Hugo January 2023 (has links) In this research, our goal was to pinpoint the crucial characteristics that make a voice suitable for an AI psychologist. More importantly, we wanted to explore how Text-To-Speech (TTS) combined with conditional voice controlling, also known as ”prompting”, could be used to incorporate these traits into the voice generation process. This approach allowed us to create synthetic voices that were not just effective, but also tailored to the specific needs of an AI psychologist role. We conducted an exploratory survey to identify key traits such as trustworthiness, safety, sympathy, calmness, and firmness. These traits were then used as prompts in the generation of AI voices using Tortoise, a state-of-the-art text-to-speech system. The generated voices were evaluated through a survey study, resulting in a mean opinion score for different categories corresponding to the prompts. Our findings showed that while the AI-generated voices did not quite match the quality of a real human voice, they were still quite effective in capturing the essence of the prompts and producing the desired voice characteristics. This suggests that prompting within TTS, or the strategic design of prompts, can significantly enhance the effectiveness of AI voices. In addition, we explored the potential impact of AI on the labor market, considering factors such as job displacement and creation, changes in salaries, and the need for reskilling. Our study highlights that AI will have a significant impact on the job market, but the exact nature of this impact remains uncertain. Our findings offer valuable insights into the potential of AI in psychology and highlight the importance of tailoring voice synthesis to specific applications. They lay a solid foundation for future research in this area, fostering continued innovation at the intersection of AI, psychology, and economic viability. / I den här forskningen var vårt mål att lokalisera de avgörande egenskaperna som gör en röst lämplig för en AI-psykolog. Vi ville även utforska hur ”Text-Till-Tal” (TTS) i kombination med villkorlig röststyrning, också kallat prompting, kan användas för att införliva dessa egenskaper i röstgenereringsprocessen. Detta tillvägagångssätt gjorde det möjligt för oss att skapa syntetiska röster som inte bara var effektiva, utan också skräddarsydda för de specifika behoven hos en roll som AI-psykolog. Vi genomförde en utforskande undersökning för att identifiera nyckelegenskaper som pålitlighet, säkerhet, sympati, lugn och fasthet. Dessa egenskaper användes sedan som uppmaningar i genereringen av AI-röster med hjälp av TorToise, ett modern TTS-system. De genererade rösterna utvärderades genom en enkätstudie, vilket resulterade i en genomsnittlig åsiktspoäng för olika kategorier som motsvarar uppmaningarna. Våra resultat visade att även om de AI-genererade rösterna inte riktigt matchade kvaliteten på en riktig mänsklig röst, var de fortfarande ganska effektiva för att fånga kärnan i uppmaningarna och producera de önskade röstegenskaperna. Detta tyder på att TTS kombinerat med prompting, eller den emotionella styrningen av TTS, avsevärt kan förbättra effektiviteten hos AI-röster. Dessutom undersökte vi den potentiella effekten av AI på arbetsmarknaden, med hänsyn till faktorer som förskjutning och skapande av jobb, förändringar i löner och behovet av ny kompetens. Vår studie visar att AI kommer att ha en betydande inverkan på arbetsmarknaden, men den exakta karaktären av denna påverkan är fortfarande osäker. Våra resultat ger värdefulla insikter om potentialen för AI inom psykologi och belyser vikten av att skräddarsy röstsyntes för specifika applikationer. De lägger en solid grund för framtida forskning inom detta område och främjar fortsatt innovation i skärningspunkten mellan AI, psykologi och ekonomisk bärkraft. Text-to-Speech Synthesis AI Psychologist Voice Traits Conditional Speech Synthesis Economic Impact of TTS Technolog Computer and Information Sciences Data- och informationsvetenskap
57	Film Adaptation of Novels Through GenAI Head, Joshua M 01 January 2024 (has links) (PDF) When a production company commits to creating a film based on a novel, it is essential that their team is equipped to manage the extensive responsibilities required to authentically translate the book to the big screen. This study aims to explore and address these challenges by utilizing contemporary Generative Artificial Intelligence technologies, including Large Language Models, Text-To-Speech, and Text-To-Image models. While recent advancements have focused on enhancing these models, there is a gap in research on their practical application and effectiveness in real-world scenarios. This research will detail the steps necessary to deconstruct a novel’s narrative and produce the final cinematic product. Additionally, it will propose novel methods to mitigate errors such as hallucinations generated by Language Models and image models, enhancing the fidelity and quality of the adaptations. iii Language Models Narrative Transformation Generative AI Scene Generation Multimedia Content Diffusion Models Text-to-Speech Story Generalization
58	Method for creating phone duration models using very large, multi-speaker, automatically annotated speech corpus / Garsų trukmių modelių kūrimo metodas, naudojant didelės apimties daugelio kalbėtojų garsyną Norkevičius, Giedrius 01 February 2011 (has links) Two heretofore unanalyzed aspects are addressed in this dissertation: 1. Building a model capable of predicting phone duration of Lithuanian. All existing investigations of phone durations of Lithuanian were performed by linguists. Usually these investigations are the kind of exploratory statistics and are limited to a single factor, affecting phone duration, analysis. Phone duration dependencies on contextual factors were estimated and written in explicit form (decision tree) in this work by means of machine learning method. 2. Construction of language independent method for creating phone duration models using very large, multi-speaker, automatically annotated speech corpus. Most of the researchers worldwide use speech corpus that are: relatively small scale, single speaker, manually annotated or at least validated by experts. Usually the referred reasons are: using multi-speaker speech corpora is inappropriate because different speakers have different pronunciation manners and speak in different speech rate; automatically annotated corpuses lack accuracy. The created method for phone duration modeling enables the use of such corpus. The main components of the created method are: the reduction of noisy data in speech corpus; normalization of speaker specific phone durations by using phone type clustering. The performed listening tests of synthesized speech, showed that: the perceived naturalness is affected by the underlying phones durations; The use of contextual... [to full text] / Disertacijoje nagrinėjamos dvi iki šiol netyrinėtos problemos: 1. Lietuvių kalbos garsų trukmių prognozavimo modelių kūrimas Iki šiol visi darbai, kuriuose yra nagrinėjamos lietuvių kalbos garsų trukmės, yra atlikti kalbininkų, tačiau šie tyrimai yra daugiau aprašomosios statistikos pobūdžio ir apsiriboja pavienių požymių įtakos garso trukmei analize. Šiame darbe, mašininio mokymo algoritmo pagalba, požymių įtaka garsų trukmei yra išmokstama iš duomenų ir užrašoma sprendimo medžio pavidalu. 2. Nuo kalbos nepriklausomų garsų trukmių prognozavimo modelių kūrimo metodas, naudojant didelės apimties daugelio, kalbėtojų automatiškai, anotuotą garsyną. Dėl skirtingų kalbėtojų tarties specifikos ir dėl automatinio anotavimo netikslumų, kuriant garsų trukmės modelius visame pasaulyje yra apsiribojama vieno kalbėtojo ekspertų anotuotais nedidelės apimties garsynais. Darbe pasiūlyti skirtingų kalbėtojų tarties ypatybių normalizavimo ir garsyno duomenų triukšmo atmetimo algoritmai leidžia garsų trukmių modelių kūrimui naudoti didelės apimties, daugelio kalbėtojų automatiškai anotuotus garsynus. Darbo metu atliktas audicinis tyrimas, kurio pagalba parodoma, kad šnekos signalą sudarančių garsų trukmės turi įtakos klausytojų/respondentų suvokiamam šnekos signalo natūralumui; kontekstinės informacijos panaudojimas garsų trukmių prognozavimo uždavinio sprendime yra svarbus faktorius įtakojantis sintezuotos šnekos natūralumą; natūralaus šnekos signalo atžvilgiu, geriausiai vertinamas yra... [toliau žr. visą tekstą] Informatics Phone duration modeling Text-to-speech synthesis CART Multi-speaker corpus Garsų trukmių modeliai Šnekos sintezė Klasifikavimo ir regresijos medžiai Daugelio kalbėtojų garsynas
59	Tradução grafema-fonema para a língua portuguesa baseada em autômatos adaptativos. / Grapheme-phoneme translation for portuguese based on adaptive automata. Shibata, Danilo Picagli 25 March 2008 (has links) Este trabalho apresenta um estudo sobre a utilização de dispositivos adaptativos para realizar tradução texto-voz. O foco do trabalho é a criação de um método para a tradução grafema-fonema para a língua portuguesa baseado em autômatos adaptativos e seu uso em um software de tradução texto-voz. O método apresentado busca mimetizar o comportamento humano no tratamento de regras de tonicidade, separação de sílabas e as influências que as sílabas exercem sobre suas vizinhas. Essa característica torna o método facilmente utilizável para outras variações da língua portuguesa, considerando que essas características são invariantes em relação à localidade e a época da variedade escolhida. A variação contemporânea da língua falada na cidade de São Paulo foi escolhida como alvo de análise e testes neste trabalho. Para essa variação, o modelo apresenta resultados satisfatórios superando 95% de acerto na tradução grafema-fonema de palavras, chegando a 90% de acerto levando em consideração a resolução de dúvidas geradas por palavras que podem possuir duas representações sonoras e gerando uma saída sonora inteligível aos nativos da língua por meio da síntese por concatenação baseada em sílabas. Como resultado do trabalho, além do modelo para tradução grafema-fonema de palavras baseado em autômatos adaptativos, foi criado um método para escolha da representação fonética correta em caso de ambigüidade e foram criados dois softwares, um para simulação de autômatos adaptativos e outro para a tradução grafema-fonema de palavras utilizando o modelo de tradução criado e o método de escolha da representação correta. Esse último software foi unificado ao sintetizador desenvolvido por Koike et al. (2007) para a criação de um tradutor texto-voz para a língua portuguesa. O trabalho mostra a viabilidade da utilização de autômatos adaptativos como base ou como um elemento auxiliar para o processo de tradução texto-voz na língua portuguesa. / This work presents a study on the use of adaptive devices for text-to-speech translation. The work focuses on the development of a grapheme-phoneme translation method for Portuguese based on Adaptive Automata and the use of this method in a text-to-speech translation software. The presented method resembles human behavior when handling syllable separation rules, syllable stress definition and influences syllables have on each other. This feature makes the method easy to use with different variations of Portuguese, since these characteristics are invariants of the language. Portuguese spoken nowadays in São Paulo, Brazil has been chosen as the target for analysis and tests in this work. The method has good results for such variation of Portuguese, reaching 95% accuracy rate for grapheme-phoneme translation, clearing the 90% mark after resolution of ambiguous cases in which different representations are accepted for a grapheme and generating phonetic output intelligible for native speakers based on concatenation synthesis using syllables as concatenation units. As final results of this work, a model is presented for grapheme-phoneme translation for Portuguese words based on Adaptive Automata, a methodology to choose the correct phonetic representation for the grapheme in ambiguous cases, a software for Adaptive Automata simulation and a software for grapheme-phoneme translation of texts using both the model of translation and methodology for disambiguation. The latter software was unified with the speech synthesizer developed by Koike et al. (2007) to create a text-to-speech translator for Portuguese. This work evidences the feasibility of text-to-speech translation for Portuguese using Adaptive Automata as the main instrument for such task. Adaptive automata Alfabeto fonético Computational linguistics International phonetic alphabet Linguagem natural Natural language processing Síntese da fala Teoria dos autômatos Text-to-speech translation Tradução automática
60	Optimisation du procédé de création de voix en synthèse par sélection / Optimised voice creation for unit-selection synthesis Cadic, Didier 10 June 2011 (has links) Cette thèse s'inscrit dans le cadre de la synthèse de parole à partir du texte. Elle traite plus précisément du procédé de création de voix en synthèse par sélection d'unités. L'état de l'art repose pour cela sur l'enregistrement d'un locuteur pendant une à deux semaines, suivant un script de lecture de plusieurs dizaines de milliers de mots. Les 5 à 10 heures de parole collectées sont généralement révisées par des opérateurs humains, pour en vérifier la segmentation phonétique et ainsi améliorer la qualité finale de la voix de synthèse.La lourdeur générale de ce procédé freine considérablement la diversification des voix de synthèse ; aussi en proposons-nous ici une rationalisation. Nous introduisons une nouvelle unité, appelée "sandwich vocalique", pour l'optimisation de la couverture des scripts de lecture. Sur le plan phonétique, cette unité offre une meilleure prise en compte des limites segmentales de la synthèse par sélection que les unités traditionnelles (diphones, triphones, syllabes, mots, etc.). Sur le plan linguistique, un nouvel enrichissement contextuel nous permet de mieux focaliser la couverture, sans négliger les aspects prosodiques. Nous proposons des moyens d'accroître le contrôle sur les phrases du script lecture, tant dans leur longueur que dans leur pertinence phonétique et prosodique, afin de mieux anticiper le contenu du corpus de parole final et de rendre automatisable la tâche de segmentation. Nous introduisons également une alternative à la stratégie classique de condensation de corpus en mettant au point un algorithme semi-automatique de création de phrases, grâce auquel nous accroissons de 30 à 40% la densité linguistique du script de lecture.Ces nouveaux outils nous permettent d'établir un procédé très efficace de création de voix de synthèse, procédé que nous validons à travers la création et l'évaluation subjective de nombreuses voix. Des scores perceptifs comparables à l'approche traditionnelle sont ainsi atteints dès 40 minutes de parole (une demi-journée d'enregistrement) et sans post-traitement manuel. Enfin, nous mettons à profit ce résultat pour enrichir nos voix de synthèse de diverses composantes expressives, multi-expressives et paralinguistiques. / This work falls within the scope of text-to-speech (TTS) technology. More precisely, focus is on the voice creation process for unit-selection synthesis. In a standard approach, a textual script of several thousands of words is read by a speaker in order to generate approximately 5 to 10 hours of useable speech. The recording time is spread out over one or two weeks and is followed by the considerable task of manually revising the phonetic segmentation for all of the speech.Such a costly and time-consuming process presents a major obstacle to diversifying synthesized voices. In order to increase efficiency in this process, we introduce a new unit, called a "vocalic sandwich", to optimize coverage of the recording texts. Phonetically, this unit better addresses the segmental limitations of unit-selection TTS than state-of-the-art units (diphones, triphones, syllables, words...). Linguistically, a new set of contextual symbols focuses the coverage, allowing for more control and consideration of prosody. Practically, in order to automate the segmentation process, better anticipation of the phonetic and prosodic content desired in the final database is required. This is achieved here by increasing the readability and consistency of each sentence included in the script. As a side, these properties also help to facilitate the reading stage. Furthermore, as an alternative to the classic corpus condensation, a semi-automatic sentence building algorithm is developed in this work wherein sentences are built rather than selected from a reference corpus. Ultimately, the sentence building provides access to much denser scripts, specifically allowing for increases in density of between 30 and 40%.In incorporating these new approaches and tools, the voice creation process is made very efficient, as is validated in this work through the preparation and evaluation of numerous synthesized voices. Perceptive scores that are comparable to the traditional process are achieved with 40 minutes of speech (half-day recording) and without any manual post-processing. Finally, we take advantage of these results in order to enhance our synthesized voices with various expressive, multi-expressive and paralinguistic features. Synthèse vocale Sélection d'unités Script de lecture Sandwich vocalique Création de phrases Évaluation Voix Expressivité Text-to-speech Unitselection Recordingscript Vocalicsandwich Sentenceconstruction Evaluation Voice Expressiveness

Search results