1 |
Estudo de um sistema de conversão texto-fala baseado em HMM / Study of a HMM-based text-to-speech systemCarvalho, Sarah Negreiros de, 1985- 22 August 2018 (has links)
Orientador: Fábio Violaro / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-22T07:58:43Z (GMT). No. of bitstreams: 1
Carvalho_SarahNegreirosde_M.pdf: 2350561 bytes, checksum: 950d33430acbd816700ef5de4c78fa5d (MD5)
Previous issue date: 2013 / Resumo: Com o contínuo desenvolvimento da tecnologia, há uma demanda crescente por sistemas de síntese de fala que sejam capazes de falar como humanos, para integrá-los nas mais diversas aplicações, seja no âmbito da automação robótica, sejam para acessibilidade de pessoas com deficiências, seja em aplicativos destinados a cultura e lazer. A síntese de fala baseada em modelos ocultos de Markov (HMM) mostra-se promissora em suprir esta necessidade tecnológica. A sua natureza estatística e paramétrica a tornam um sistema flexível, capaz de adaptar vozes artificiais, inserir emoções no discurso e obter fala sintética de boa qualidade usando uma base de treinamento limitada. Esta dissertação apresenta o estudo realizado sobre o sistema de síntese de fala baseado em HMM (HTS), descrevendo as etapas que envolvem o treinamento dos modelos HMMs e a geração do sinal de fala. São apresentados os modelos espectrais, de pitch e de duração que constituem estes modelos HMM dos fonemas dependentes de contexto, considerando as diversas técnicas de estruturação deles. Alguns dos problemas encontrados no HTS, tais como a característica abafada e monótona da fala artificial, são analisados juntamente com algumas técnicas propostas para aprimorar a qualidade final do sinal de fala sintetizado / Abstract: With the continuous development of technology, there is a growing demand for text-to-speech systems that are able to speak like humans, in order to integrate them in the most diverse applications whether in the field of automation and robotics, or for accessibility of people with disabilities, as for culture and leisure activities. Speech synthesis based on hidden Markov models (HMM) shows to be promising in addressing this need. Their statistical and parametric nature make it a flexible system capable of adapting artificial voices, insert emotions in speech and get artificial speech of good quality using a limited amount of speech data for HMM training. This thesis presents the study realized on HMM-based speech synthesis system (HTS), describing the steps that involve the training of HMM models and the artificial speech generation. Spectral, pitch and duration models are presented, which form context-dependent HMM models, and also are considered the various techniques for structuring them. Some of the problems encountered in the HTS, such as the characteristic muffled and monotone of artificial speech, are analyzed along with some of the proposed techniques to improve the final quality of the synthesized speech signal / Mestrado / Telecomunicações e Telemática / Mestra em Engenharia Elétrica
|
2 |
Proposta de metodologia de avaliação de voz sintética com ênfase no ambiente educacional / Methodology for evaluation of synthetic speech emphasizing the educational environmentLeite, Harlei Miguel de Arruda, 1989- 06 September 2014 (has links)
Orientador: Dalton Soares Arantes / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-25T15:09:09Z (GMT). No. of bitstreams: 1
Leite_HarleiMigueldeArruda_M.pdf: 3631088 bytes, checksum: b997adfa6f8915d31a23e0eb6daf0cc3 (MD5)
Previous issue date: 2014 / Resumo: A principal contribuição desta dissertação é a proposta de uma metodologia de avaliação de voz sintetizada. O método consiste em um conjunto de etapas que buscam auxiliar o avaliador nas etapas de planejamento, aplicação e análise dos dados coletados. O método foi originalmente desenvolvido para avaliar um conjunto de vozes sintetizadas para encontrar a voz que melhor se adapta a ambientes de educação a distância usando avatares. Também foram estudadas as relações entre inteligibilidade, compreensibilidade e naturalidade a fim conhecer os fatores a serem considerados para aprimorar os sintetizadores de fala. Esta dissertação também apresenta os principais métodos de avaliação encontrados na literatura e o princípio de funcionamento dos sistemas TTS / Abstract: This thesis proposes, as main contribution, a new synthesized voice evaluation methodology. The method consists of a set of steps that seek to assist the assessor in the stages of planning, implementation and analysis of data collected. The method was originally developed to evaluate a set of synthesized voices to find the voice that best fits the environments for distance education using avatars. Relations between intelligibility, comprehensibility and naturalness were studied in order to know the factors to be considered to enhance the speech synthesizers. This thesis also presents the main evaluation methods in the literature and how TTS (Text-to-Speech) systems work / Mestrado / Telecomunicações e Telemática / Mestre em Engenharia Elétrica
|
3 |
Modélisation et synthèse de voix chantée à partir de descripteurs visuels extraits d'images échographiques et optiques des articulateurs / Singing voice modeling and synthesis using visual features extracted from ultrasound and optical images of articulatorsJaumard-Hakoun, Aurore 05 September 2016 (has links)
Le travail présenté dans cette thèse porte principalement sur le développement de méthodes permettant d'extraire des descripteurs pertinents des images acquises des articulateurs dans les chants rares : les polyphonies traditionnelles Corses, Sardes, la musique Byzantine, ainsi que le Human Beat Box. Nous avons collecté des données, et employons des méthodes d'apprentissage statistique pour les modéliser, notamment les méthodes récentes d'apprentissage profond (Deep Learning).Nous avons étudié dans un premier temps des séquences d'images échographiques de la langue apportant des informations sur l'articulation, mais peu lisibles sans connaissance spécialisée en échographie. Nous avons développé des méthodes pour extraire de façon automatique le contour supérieur de la langue montré par les images échographiques. Nos travaux ont donné des résultats d'extraction du contour de la langue comparables à ceux obtenus dans la littérature, ce qui pourrait permettre des applications en pédagogie du chant.Ensuite, nous avons prédit l'évolution des paramètres du filtre qu'est le conduit vocal depuis des séquences d'images de langue et de lèvres, sur des bases de données constituées de voyelles isolées puis de chants traditionnels Corses. L'utilisation des paramètres du filtre du conduit vocal, combinés avec le développement d'un modèle acoustique de source vocale exploitant l'enregistrement électroglottographique, permet de synthétiser des extraits de voix chantée en utilisant les images articulatoires (de la langue et des lèvres)et l'activité glottique, avec des résultats supérieurs à ceux obtenus avec les techniques existant dans la littérature. / This thesis reports newly developed methods which can be applied to extract relevant features from articulator images in rare singing: traditional Corsican and Sardinian polyphonies, Byzantine music, as well as Human Beat Box. We collected data, and modeled these using machine learning methods, specifically novel deep learning methods. We first modelled tongue ultrasound image sequences, carrying relevant articulatory information which would otherwise be difficult to interpret without specialized skills in ultrasound imaging. We developed methods to extract automatically the superior contour of the tongue displayed on ultrasound images. Our tongue contour extraction results are comparable with those obtained in the literature, which could lead to applications in singing pedagogy. Afterwards, we predicted the evolution of the vocal tract filter parameters from sequences of tongue and lip images, first on isolated vowel databases then on traditional Corsican singing. Applying the predicted filter parameters, combined with the development of a vocal source acoustic model exploiting electroglottographic recordings, allowed us to synthesize singing voice excerpts using articulatory images (of tongue and lips) and glottal activity, with results superior to those obtained using existing technics reported in the literature.
|
4 |
Synthesis and expressive transformation of singing voice / Synthèse et transformation expressive de la voix chantéeArdaillon, Luc 21 November 2017 (has links)
Le but de cette thèse était de conduire des recherches sur la synthèse et transformation expressive de voix chantée, en vue de pouvoir développer un synthétiseur de haute qualité capable de générer automatiquement un chant naturel et expressif à partir d’une partition et d’un texte donnés. 3 directions de recherches principales peuvent être identifiées: les méthodes de modélisation du signal afin de générer automatiquement une voix intelligible et naturelle à partir d’un texte donné; le contrôle de la synthèse, afin de produire une interprétation d’une partition donnée tout en transmettant une certaine expressivité liée à un style de chant spécifique; la transformation du signal vocal afin de le rendre plus naturel et plus expressif, en faisant varier le timbre en adéquation avec la hauteur, l’intensité et la qualité vocale. Cette thèse apporte diverses contributions dans chacune de ces 3 directions. Tout d’abord, un système de synthèse complet a été développé, basé sur la concaténation de diphones. L’architecture modulaire de ce système permet d’intégrer et de comparer différent modèles de signaux. Ensuite, la question du contrôle est abordée, comprenant la génération automatique de la f0, de l’intensité, et des durées des phonèmes. La modélisation de styles de chant spécifiques a également été abordée par l’apprentissage des variations expressives des paramètres de contrôle modélisés à partir d’enregistrements commerciaux de chanteurs célèbres. Enfin, des investigations sur des transformations expressives du timbre liées à l'intensité et à la raucité vocale ont été menées, en vue d'une intégration future dans notre synthétiseur. / This thesis aimed at conducting research on the synthesis and expressive transformations of the singing voice, towards the development of a high-quality synthesizer that can generate a natural and expressive singing voice automatically from a given score and lyrics. Mainly 3 research directions can be identified: the methods for modelling the voice signal to automatically generate an intelligible and natural-sounding voice according to the given lyrics; the control of the synthesis to render an adequate interpretation of a given score while conveying some expressivity related to a specific singing style; the transformation of the voice signal to improve its naturalness and add expressivity by varying the timbre adequately according to the pitch, intensity and voice quality. This thesis provides some contributions in each of those 3 directions. First, a fully-functional synthesis system has been developed, based on diphones concatenations. The modular architecture of this system allows to integrate and compare different signal modeling approaches. Then, the question of the control is addressed, encompassing the automatic generation of the f0, intensity, and phonemes durations. The modeling of specific singing styles has also been addressed by learning the expressive variations of the modeled control parameters on commercial recordings of famous French singers. Finally, some investigations on expressive timbre transformations have been conducted, for a future integration into our synthesizer. This mainly concerns methods related to intensity transformation, considering the effects of both the glottal source and vocal tract, and the modeling of vocal roughness.
|
5 |
Chanter avec les mains : interfaces chironomiques pour les instruments de musique numériques / Singing with hands : chironomic interfaces for digital musical instrumentsPerrotin, Olivier 23 September 2015 (has links)
Le travail de cette thèse porte sur l'étude du contrôle en temps réel de synthèse de voix chantée par une tablette graphique dans le cadre de l'instrument de musique numérique Cantor Digitalis.La pertinence de l'utilisation d'une telle interface pour le contrôle de l'intonation vocale a été traitée en premier lieu, démontrant que la tablette permet un contrôle de la hauteur mélodique plus précis que la voix réelle en situation expérimentale.Pour étendre la justesse du jeu à toutes situations, une méthode de correction dynamique de l'intonation a été développée, permettant de jouer en dessous du seuil de perception de justesse et préservant en même temps l'expressivité du musicien. Des évaluations objective et perceptive ont permis de valider l'efficacité de cette méthode.L'utilisation de nouvelles interfaces pour la musique pose la question des modalités impliquées dans le jeu de l'instrument. Une troisième étude révèle une prépondérance de la perception visuelle sur la perception auditive pour le contrôle de l'intonation, due à l'introduction d'indices visuels sur la surface de la tablette. Néanmoins, celle-ci est compensée par l'important pouvoir expressif de l'interface.En effet, la maîtrise de l'écriture ou du dessin dès l'enfance permet l'acquisition rapide d'un contrôle expert de l'instrument. Pour formaliser ce contrôle, nous proposons une suite de gestes adaptés à différents effets musicaux rencontrés dans la musique vocale. Enfin, une pratique intensive de l'instrument est réalisée au sein de l'ensemble Chorus Digitalis à des fins de test et de diffusion. Un travail de recherche artistique est conduit tant dans la mise en scène que dans le choix du répertoire musical à associer à l'instrument. De plus, un retour visuel dédié au public a été développé, afin d'aider à la compréhension du maniement de l'instrument. / This thesis deals with the real-time control of singing voice synthesis by a graphic tablet, based on the digital musical instrument Cantor Digitalis.The relevance of the graphic tablet for the intonation control is first considered, showing that the tablet provides a more precise pitch control than real voice in experimental conditions.To extend the accuracy of control to any situation, a dynamic pitch warping method for intonation correction is developed. It enables to play under the pitch perception limens preserving at the same time the musician's expressivity. Objective and perceptive evaluations validate the method efficiency.The use of new interfaces for musical expression raises the question of the modalities implied in the playing of the instrument. A third study reveals a preponderance of the visual modality over the auditive perception for the intonation control, due to the introduction of visual clues on the tablet surface. Nevertheless, this is compensated by the expressivity allowed by the interface.The writing or drawing ability acquired since early childhood enables a quick acquisition of an expert control of the instrument. An ensemble of gestures dedicated to the control of different vocal effects is suggested.Finally, an intensive practice of the instrument is made through the Chorus Digitalis ensemble, to test and promote our work. An artistic research has been conducted for the choice of the Cantor Digitalis' musical repertoire. Moreover, a visual feedback dedicated to the audience has been developed, extending the perception of the players' pitch and articulation.
|
6 |
Proposal of a Java Front-end for voice synthesizer based on MBROLA / Proposta de um Front-end em Java para sintetizador de voz baseado no MBROLANÃcolas de AraÃjo Moreira 02 September 2015 (has links)
CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior / It is estimated that, in Brazil, about 3.46% of population presents difficulty to see and 1.6% is blind. The lack of adequate inclusive tools imposes many restrictions on the life of these people, in other words, non-accessible hardware and software create a negative impact on academic, professional and personal life. In this context, the present thesis aims to develop a an accessible system for digital inclusion of blind users, since the existing systems present many disadvantages as low quality or cost that make impossible the daily use. The system is composed by a multiplatform Java front-end. In addition, the system is free to reach the maximum numbers of users as possible and to be modified and improved by the community. The developed solution was tested, presenting a medium intelligibility rate of 79% and naturalness classified as "reasonable" by a group of 20 users. In the end, the system proved to be feasible, filling an existing gap on Brazilian software marked, allowing greater inclusion of blind users to digital resources. / Estima-se que, no Brasil, cerca de 3,46% da populaÃÃo apresenta grande limitaÃÃo de visÃo e 1,6% seja totalmente incapaz de enxergar. A falta de meios de inclusÃo adequados impÃe uma sÃrie de restriÃÃes na vida destas pessoas, em outras palavras, ferramentas de hardware e software nÃo acessÃveis geram impacto negativo na vida acadÃmica, pessoal e profissional. Dentro desse contexto, a presente DissertaÃÃo tem por objetivo principal desenvolver um sistema para inclusÃo digital de deficientes visuais. O sistema à composto por um front-end multiplataforma para o sintetizador de voz MBROLA e um conjunto programas acessÃveis, que inclui editor de texto, cliente de chat, lente de aumento virtual, entre outros, desenvolvido em Java a fim gerar um software multiplataforma. AlÃm disso, o sistema à gratuito e livre para que possa atingir o maior nÃmero de usuÃrios possÃvel e ser modificado e aprimorado pela comunidade. A soluÃÃo desenvolvida foi testada em campo, apresentando Ãndice de inteligibilidade mÃdio de 79% e com naturalidade classificada como razoÃvel em um grupo de 20 usuÃrios. Por fim, o sistema se mostrou viÃvel, vindo a preencher uma lacuna existente no mercado brasileiro de softwares, permitindo maior inclusÃo dos deficientes visuais aos meios digitais.
|
7 |
Model-based synthesis of singing / Modellbaserad syntes av sångZeng, Xiaofeng January 2023 (has links)
The legacy KTH Music and Singing Synthesis Equipment (MUSSE) system, developed decades ago, is no longer compatible with contemporary computer systems. Nonetheless, the fundamental synthesis model at its core, known as the source-filter model, continues to be a valuable technology in the research field of voice synthesis. In this thesis, the author re-implemented the legacy system with the traditional source-filter model and the modern platform SuperCollider. This re-implementation led to great enhancements in functionality, flexibility and performance. The most noteworthy improvement introduced in the new system is the addition of notch filters, which is able to simulate anti-resonances in the human vocal tract, thereby allowing a broader range of vocal nuances to be reproduced. To demonstrate the significance of notches in vowel synthesis, a subjective auditory experiment was conducted. The results of this experiment clearly show that vowels synthesized with notches sound much more natural and closer to real human voice. The work presented in this thesis, the new MUSSE program with notch filters, will serve as a foundation to support general acoustics research at TMH in the future. / Den äldre KTH Music and Singing Synthesis Equipment (MUSSE) -systemet, som utvecklades för decennier sedan, är inte längre kompatibelt med samtida datorsystem. Trots det fortsätter den grundläggande syntesmodellen vid dess kärna, känd som källa-filtermodellen, att vara en värdefull teknik inom forskningsområdet för röstsyntes. I den här avhandlingen har författaren återimplementerat det äldre systemet med den traditionella källa-filtermodellen och den moderna plattformen SuperCollider. Denna återimplementering ledde till betydande förbättringar i funktionalitet, flexibilitet och prestanda. Den mest anmärkningsvärda förbättringen som infördes i det nya systemet är tillägget av notch-filter, som kan simulera anti-resonanser i den mänskliga röstkanalen och därmed möjliggöra en bredare uppsättning vokala nyanser att återskapas. För att visa betydelsen av notch-filter i vokalsyntes utfördes en subjektiv auditiv undersökning. Resultaten av denna undersökning visar tydligt att vokaler som syntetiseras med notch-filter låter mycket mer naturliga och liknar den verkliga mänskliga rösten. Arbetet som presenteras i denna avhandling, det nya MUSSE-programmet med notch-filter, kommer att fungera som en grund för att stödja allmän akustikforskning vid TMH i framtiden.
|
8 |
(A)I want to start a podcast : En designbaserad & kvalitativ studie om AI verktyg i podcastproduktionGrimberg, Vilhelm, Kenez, Xander January 2024 (has links)
This study investigates the application and implications of AI-generated content in podcast production. The research particularly explores the use of text-to-speech (TTS) systems and AI language models to simulate authentic-sounding conversations. This study analyzes listener responses to different AI-generated and human-edited podcast episodes through a series of prototypes and interviews with listeners. Findings suggest that listeners often perceive AI-generated conversations as less authentic and natural than human-made ones, especially due to issues like unnatural intonation and a lack of natural discourse markers. Despite these challenges, improvements were noted in later prototypes where manual editing was combined with AI-generated content. This highlights the potential for AI to complement human creativity in podcast production. The study concludes that for AI-generated content to achieve the desired level of authenticity, further involvement of human intuition is necessary. Future research should explore refining AI models to better simulate natural conversation flow and focus on enhancing the nuances of human-like speech. The findings also underline the potential of AI tools to revolutionize podcast production workflows. / Denna studie undersöker användningen och implikationerna av AI-genererat innehåll i podcastproduktion. Forskningen utforskar särskilt användningen av text-till-tal-system (TTS) och AI-språkmodeller för att simulera samtal som låter autentiska. Studien analyserar lyssnarreaktioner på olika AI-genererade och mänskligt redigerade poddavsnitt genom en serie prototyper och intervjuer med lyssnare. Resultaten visar att lyssnare ofta upplever AI-genererade samtal som mindre autentiska och naturliga än de som skapats av människor. Särskilt på grund av problem som onaturliga betoningar och brist på naturliga diskurspartiklar. Trots dessa utmaningar märktes förbättringar i senare prototyper där manuell redigering kombinerades med AI-genererat innehåll, vilket belyser potentialen för AI att komplettera mänsklig kreativitet i podcastproduktion. Genom forskningen dras slutsatsen att AI-genererat innehåll kräver ytterligare integration av mänsklig intuition för att uppnå önskad nivå av autenticitet. Framtida forskning bör utforska hur AI-modeller kan förfinas för att bättre simulera naturligt samtalsflöde och fokusera på att förbättra nyanserna i mänskligt tal. Resultaten understryker också potentialen hos AI-verktyg att revolutionera arbetsflödena för podcastproduktion.
|
9 |
Expressivité et contrôle de modèles d’apprentissage automatique dans un corpus d’installations audiovisuellesLavoie Viau, Gabriel 12 1900 (has links)
L’appropriation d’algorithmes existants, la création d’outils numériques et des recherches
conceptuelles ont mené à la création de deux installations audiovisuelles interactives. La
première, Deep Duo, met en scène des réseaux de neurones artificiels contrôlant des
synthétiseurs modulaires. La deuxième, Morphogenèse, l’œuvre d’envergure de ce
mémoire, met en relation le spectateur avec des modèles profonds génératifs et le place
face à des représentations artificielles de sa voix et de son visage.
Les installations et leurs fonctionnements seront décrits et, à travers des exemples de
stratégies créatives et des concepts théoriques en lien avec l’interactivité et l’esthétique
des comportements, des pistes pour favoriser l’utilisation d’algorithmes d’apprentissage
automatique à des fins créatives seront proposées. / The appropriation of existing algorithms, the creation of digital tools and conceptual
research have led to the creation of two interactive audiovisual installations. The first,
Deep Duo, features artificial neural networks controlling modular synthesizers. The
second, Morphogenesis, the major work of this dissertation, connects the viewer with
generative deep models and places them in front of artificial representations of their voice
and face.
We will describe these installations and their functioning and, through examples of
creative strategies and theoretical concepts related to interactivity and the aesthetics of
behaviour, we will propose ways to promote the use of machine learning algorithms for
creative purposes.
|
Page generated in 0.0449 seconds