Global ETD Search

161	Voice input for the disabled / Holmes, William Paul. January 1987 (has links) (PDF) Thesis (M. Eng. Sc.)--University of Adelaide, 1987. / Typescript. Includes a copy of a paper presented at TADSEM '85 --Australian Seminar on Devices for Expressive Communication and Environmental Control, co-authored by the author. Includes bibliographical references (leaves [115-121]).
162	Vers une interface cerveau-machine pour la restauration de la parole / Toward a brain-computer interface for speech restoration Bocquelet, Florent 24 April 2017 (has links) Restorer la faculté de parler chez des personnes paralysées et aphasiques pourrait être envisagée via l’utilisation d’une interface cerveau-machine permettant de contrôler un synthétiseur de parole en temps réel. L’objectif de cette thèse était de développer trois aspects nécessaires à la mise au point d’une telle preuve de concept.Premièrement, un synthétiseur permettant de produire en temps-réel de la parole intelligible et controlé par un nombre raisonable de paramètres est nécessaire. Nous avons choisi de synthétiser de la parole à partir des mouvements des articulateurs du conduit vocal. En effet, des études récentes ont suggéré que l’activité neuronale du cortex moteur de la parole pourrait contenir suffisamment d’information pour décoder la parole, et particulièrement ses propriété articulatoire (ex. l’ouverture des lèvres). Nous avons donc développé un synthétiseur produisant de la parole intelligible à partir de données articulatoires. Dans un premier temps, nous avons enregistré un large corpus de données articulatoire et acoustiques synchrones chez un locuteur. Ensuite, nous avons utilisé des techniques d’apprentissage automatique, en particulier des réseaux de neurones profonds, pour construire un modèle permettant de convertir des données articulatoires en parole. Ce synthétisuer a été construit pour fonctionner en temps réel. Enfin, comme première étape vers un contrôle neuronal de ce synthétiseur, nous avons testé qu’il pouvait être contrôlé en temps réel par plusieurs locuteurs, pour produire de la parole inetlligible à partir de leurs mouvements articulatoires dans un paradigme de boucle fermée.Deuxièmement, nous avons étudié le décodage de la parole et de ses propriétés articulatoires à partir d’activités neuronales essentiellement enregistrées dans le cortex moteur de la parole. Nous avons construit un outil permettant de localiser les aires corticales actives, en ligne pendant des chirurgies éveillées à l’hôpital de Grenoble, et nous avons testé ce système chez deux patients atteints d’un cancer du cerveau. Les résultats ont montré que le cortex moteur exhibe une activité spécifique pendant la production de parole dans les bandes beta et gamma du signal, y compris lors de l’imagination de la parole. Les données enregistrées ont ensuite pu être analysées pour décoder l’intention de parler du sujet (réelle ou imaginée), ainsi que la vibration des cordes vocales et les trajectoires des articulateurs principaux du conduit vocal significativement au dessus du niveau de la chance.Enfin, nous nous sommes intéressés aux questions éthiques qui accompagnent le développement et l’usage des interfaces cerveau-machine. Nous avons en particulier considéré trois niveaux de réflexion éthique concernant respectivement l’animal, l’humain et l’humanité. / Restoring natural speech in paralyzed and aphasic people could be achieved using a brain-computer interface controlling a speech synthesizer in real-time. The aim of this thesis was thus to develop three main steps toward such proof of concept.First, a prerequisite was to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. Here we chose to synthesize speech from movements of the speech articulators since recent studies suggested that neural activity from the speech motor cortex contains relevant information to decode speech, and especially articulatory features of speech. We thus developed a speech synthesizer that produced intelligible speech from articulatory data. This was achieved by first recording a large dataset of synchronous articulatory and acoustic data in a single speaker. Then, we used machine learning techniques, especially deep neural networks, to build a model able to convert articulatory data into speech. This synthesizer was built to run in real time. Finally, as a first step toward future brain control of this synthesizer, we tested that it could be controlled in real-time by several speakers to produce intelligible speech from articulatory movements in a closed-loop paradigm.Second, we investigated the feasibility of decoding speech and articulatory features from neural activity essentially recorded in the speech motor cortex. We built a tool that allowed to localize active cortical speech areas online during awake brain surgery at the Grenoble Hospital and tested this system in two patients with brain cancer. Results show that the motor cortex exhibits specific activity during speech production in the beta and gamma bands, which are also present during speech imagination. The recorded data could be successfully analyzed to decode speech intention, voicing activity and the trajectories of the main articulators of the vocal tract above chance.Finally, we addressed ethical issues that arise with the development and use of brain-computer interfaces. We considered three levels of ethical questionings, dealing respectively with the animal, the human being, and the human species. Interface cerveau-Machine Parole Restauration Reconnaissance formes Bci Synthèse parole Brain-Computer interface Speech Restoration Bci Speech synthesis Machine learning 620
163	Génération de parole expressive dans le cas des langues à tons / Generation the expressive speech in case of tonal languages Mac, Dang Khoa 15 June 2012 (has links) De plus en plus, l'interaction entre personne et machine se rapproche du naturel afin de ressembler à l'interaction entre humains, incluant l'expressivité (en particulier les émotions et les attitudes). Dans la communication parlée, les attitudes, et plus généralement les affects sociaux, sont véhiculés principalement par la prosodie. Pour les langues tonales, la prosodie est utilisée aussi pour coder l'information sémantique dans les variations de tons. Ce travail de thèse présente une étude des affects sociaux du vietnamien, une langue à tons et une langue peu dotée, afin d'appliquer les résultats obtenus à un système de synthèse de haute qualité capable de produire la parole « expressive » pour le vietnamien. Le premier travail de cette thèse consiste en la construction du premier corpus audio-visuel des attitudes vietnamiennes, qui contient seize attitudes. Ce corpus est ensuite utilisé pour étudier la perception audio-visuelle et interculturelle des attitudes vietnamiennes. Pour cela, une série de tests perceptifs a été effectuée avec des auditeurs natifs et non-natifs (des auditeurs francophones pour les non-natifs). Les résultats de ces tests montrent que les facteurs influant sur la perception des attitudes sont l'expression de l'attitude elle-même et la modalité de présentation (audio, visuelle et audio-visuelle). Ces résultats nous ont ainsi permis de trouver des affects sociaux communs ou interculturels entre le vietnamien et le français. Puis, un autre test de perception a été réalisé sur des phrases avec tons afin d'explorer l'effet du système tonal du vietnamien sur la perception des attitudes. Les résultats montrent que les juges non-natifs peuvent traiter et séparer les indices tonals locaux et les traits saillants prosodiques de portée globale. Après une présentation de nos études sur les affects sociaux en vietnamien, nous décrivons notre modélisation de la prosodie des attitudes en vue de la synthèse de la parole expressive en vietnamien. En nous basant sur le modèle de superposition des contours fonctionnels, nous proposons une méthode pour modéliser et générer de la prosodie expressive en vietnamien. Cette méthode est ensuite appliquée pour générer de la parole expressive en vietnamien, puis évaluée par des tests de perception sur les énoncés synthétiques. Les résultats de perception valident bien la performance de notre modèle et confirment que l'approche de superposition de contours fonctionnels peut être utilisée pour modéliser une prosodie complexe comme dans le cas de la parole expressive d'une langue à tons. / Today, the human-computer interaction is reaching the naturalness and is increasingly similar to the human-human interaction, including the expressiveness (especially emotions and attitudes). In spoken communication, attitudes or social affects are mainly transferred through prosody. For tonal languages, prosody is also used to encode semantic information via tones. This thesis presents a study of social affects in Vietnamese, a tonal and under-resourced language, in order to apply the results to Vietnamese expressive speech synthesis task. The first task of this thesis concerns the construction of a first audio-visual corpus of Vietnamese attitudes which contains sixteen attitudes. This corpus is then used to study the audio-visual and intercultural perceptions of the Vietnamese attitudes. A series of perceptual tests was carried out with native and non-native listeners (French for non-native listeners). Experimental results reveal the fact that the influential factors on the perception of attitudes include the modality of presentation (audio, visual and audio-visual) and the attitudinal expression itself. These results also allow us to investigate the common specificities and cross-cultural specificities between Vietnamese and French attitudes. Another perception test was carried out using sentences with tonal variation to study the influence of Vietnamese tones on the perception of attitudes. The results show that non-native listeners can process the local prosodic cues of tones, together with the global cues of attitude patterns. After presenting our studies on Vietnamese social affects, we describe our work on attitude modelling to apply it to Vietnamese expressive speech synthesis. Based on the concept of prosodic contour superposition, a prosodic model was proposed to encode the attitudinal function of prosody for Vietnamese attitudes. This model was applied to generate the Vietnamese expressive speech and then evaluated in a perceptual experiment with synthetic utterances. The results validate the ability of applying our proposed model in generating the prosody of attitudes for a tonal language such as Vietnamese. Parole expressive Synthese de la parole Vietnamienne Affects sociaux Contours prosodiques Modélisation de la prosodie Expressive speech Speech synthesis Vietnamese Social affects Prosodic contours Prosody modeling
164	Využití uživatelské odezvy pro zvýšení kvality řečové syntézy / Improving text-to-speech in spoken dialogue systems by employing user's feedback Hudeček, Vojtěch January 2017 (has links) Although spoken dialogue systems have greatly improved, they still cannot handle communications involving unknown topics. One of the problems is, that they experience difficulties when they should pronounce unknown words. We will investigate methods that can improve spoken dialogue systems by correcting the pronunciation of unknown words. This is a crucial step to provide a better user experience, since for example mispronounced proper nouns are highly undesirable. Incorrect pronunciation is caused by imperfect phonetic representation of the word. We aim to detect incorrectly pronounced words, use knowledge about the pronunciation and user's feedback and correct the transcriptions accordingly. Furthermore, the learned phonetic transcriptions can be added to the speech recognition module's vocabulary. Thus extracting correct pronunciations benefits both speech recognition and text-to-speech components of the dialogue systems.
165	Discussion On Effective Restoration Of Oral Speech Using Voice Conversion Techniques Based On Gaussian Mixture Modeling Alverio, Gustavo 01 January 2007 (has links) Today's world consists of many ways to communicate information. One of the most effective ways to communicate is through the use of speech. Unfortunately many lose the ability to converse. This in turn leads to a large negative psychological impact. In addition, skills such as lecturing and singing must now be restored via other methods. The usage of text-to-speech synthesis has been a popular resolution of restoring the capability to use oral speech. Text to speech synthesizers convert text into speech. Although text to speech systems are useful, they only allow for few default voice selections that do not represent that of the user. In order to achieve total restoration, voice conversion must be introduced. Voice conversion is a method that adjusts a source voice to sound like a target voice. Voice conversion consists of a training and converting process. The training process is conducted by composing a speech corpus to be spoken by both source and target voice. The speech corpus should encompass a variety of speech sounds. Once training is finished, the conversion function is employed to transform the source voice into the target voice. Effectively, voice conversion allows for a speaker to sound like any other person. Therefore, voice conversion can be applied to alter the voice output of a text to speech system to produce the target voice. The thesis investigates how one approach, specifically the usage of voice conversion using Gaussian mixture modeling, can be applied to alter the voice output of a text to speech synthesis system. Researchers found that acceptable results can be obtained from using these methods. Although voice conversion and text to speech synthesis are effective in restoring voice, a sample of the speaker before voice loss must be used during the training process. Therefore it is vital that voice samples are made to combat voice loss. voice conversion text to speech speech synthesis gaussian mixture modeling voice speech processing digital signal processing Electrical and Computer Engineering Electrical and Electronics Engineering
166	Efficient 3D Acoustic Simulation of the Vocal Tract by Combining the Multimodal Method and Finite Elements Blandin, Rémi, Arnela, Marc, Félix, Simon, Doc, Jean-Baptiste, Birkholz, Peter 22 February 2024 (has links) Acoustic simulation of sound propagation inside the vocal tract is a key element of speech research, especially for articulatory synthesis, which allows one to relate the physics of speech production to other fields of speech science, such as speech perception. Usual methods, such as the transmission line method, have a very low computational cost and perform relatively good up to 4-5 kHz, but are not satisfying above. Fully numerical 3D methods such as finite elements achieve the best accuracy, but have a very high computational cost. Better performances are achieved with the state of the art semi-analytical methods, but they cannot describe the vocal tract geometry as accurately as fully numerical methods (e.g. no possibility to take into account the curvature). This work proposes a new semi-analytical method that achieves a better description of the three-dimensional vocal-tract geometry while keeping the computational cost substantially lower than the fully numerical methods. It is a multimodal method which relies on two-dimensional finite elements to compute transverse modes and takes into account the curvature and the variations of crosssectional area. The comparison with finite element simulations shows that the same degree of accuracy (about 1% of difference in the resonance frequencies) is achieved with a computational cost about 10 times lower. info:eu-repo/classification/ddc/004 ddc:004 info:eu-repo/classification/ddc/621.3 ddc:621.3
167	Vocal Expression of Emotion : Discrete-emotions and Dimensional Accounts Laukka, Petri January 2004 (has links) <p>This thesis investigated whether vocal emotion expressions are conveyed as discrete emotions or as continuous dimensions. </p><p>Study I consisted of a meta-analysis of decoding accuracy of discrete emotions (anger, fear, happiness, love-tenderness, sadness) within and across cultures. Also, the literature on acoustic characteristics of expressions was reviewed. Results suggest that vocal expressions are universally recognized and that there exist emotion-specific patterns of voice-cues for discrete emotions.</p><p>In Study II, actors vocally portrayed anger, disgust, fear, happiness, and sadness with weak and strong emotion intensity. The portrayals were decoded by listeners and acoustically analyzed with respect to 20 voice-cues (e.g., speech rate, voice intensity, fundamental frequency, spectral energy distribution). Both the intended emotion and intensity of the portrayals were accurately decoded and had an impact on voice-cues. Listeners’ ratings of both emotion and intensity could be predicted from a selection of voice-cues.</p><p>In Study III, listeners rated the portrayals from Study II on emotion dimensions (activation, valence, potency, emotion intensity). All dimensions were correlated with several voice-cues. Listeners’ ratings could be successfully predicted from the voice-cues for all dimensions except valence.</p><p>In Study IV, continua of morphed expressions, ranging from one emotion to another in equal steps, were created using speech synthesis. Listeners identified the emotion of each expression and discriminated between pairs of expressions. The continua were perceived as two distinct sections separated by a sudden category boundary. Also, discrimination accuracy was generally higher for pairs of stimuli falling across category boundaries than for pairs belonging to the same category. This suggests that vocal expressions are categorically perceived.</p><p>Taken together, the results suggest that a discrete-emotions approach provides the best account of vocal expression. Previous difficulties in finding emotion-specific patterns of voice-cues may be explained in terms of limitations of previous studies and the coding of the communicative process.</p> Psychology speech emotion vocal expression emotion dimensions acoustic cues categorical perception nonverbal communication speech synthesis cross-cultural communication decoding accuracy emotion intensity meta-analysis discrete emotions Psykologi Psychology Psykologi
168	Vocal Expression of Emotion : Discrete-emotions and Dimensional Accounts Laukka, Petri January 2004 (has links) This thesis investigated whether vocal emotion expressions are conveyed as discrete emotions or as continuous dimensions. Study I consisted of a meta-analysis of decoding accuracy of discrete emotions (anger, fear, happiness, love-tenderness, sadness) within and across cultures. Also, the literature on acoustic characteristics of expressions was reviewed. Results suggest that vocal expressions are universally recognized and that there exist emotion-specific patterns of voice-cues for discrete emotions. In Study II, actors vocally portrayed anger, disgust, fear, happiness, and sadness with weak and strong emotion intensity. The portrayals were decoded by listeners and acoustically analyzed with respect to 20 voice-cues (e.g., speech rate, voice intensity, fundamental frequency, spectral energy distribution). Both the intended emotion and intensity of the portrayals were accurately decoded and had an impact on voice-cues. Listeners’ ratings of both emotion and intensity could be predicted from a selection of voice-cues. In Study III, listeners rated the portrayals from Study II on emotion dimensions (activation, valence, potency, emotion intensity). All dimensions were correlated with several voice-cues. Listeners’ ratings could be successfully predicted from the voice-cues for all dimensions except valence. In Study IV, continua of morphed expressions, ranging from one emotion to another in equal steps, were created using speech synthesis. Listeners identified the emotion of each expression and discriminated between pairs of expressions. The continua were perceived as two distinct sections separated by a sudden category boundary. Also, discrimination accuracy was generally higher for pairs of stimuli falling across category boundaries than for pairs belonging to the same category. This suggests that vocal expressions are categorically perceived. Taken together, the results suggest that a discrete-emotions approach provides the best account of vocal expression. Previous difficulties in finding emotion-specific patterns of voice-cues may be explained in terms of limitations of previous studies and the coding of the communicative process. Psychology speech emotion vocal expression emotion dimensions acoustic cues categorical perception nonverbal communication speech synthesis cross-cultural communication decoding accuracy emotion intensity meta-analysis discrete emotions Psykologi Psychology Psykologi
169	Síntesi basada en models ocults de Markov aplicada a l'espanyol i a l'anglès, les seves aplicacions i una proposta híbrida Gonzalvo Fructuoso, Javier 16 July 2010 (has links) Avui en dia, la Interacció Home Màquina (IHM) és una de les disciplines més estudiades amb l'objectiu de millorar les interaccions humanes amb sistemes reals actuals i futurs. Cada vegada més gent utilitza més dispositius electrònics a la vida quotidiana Aquesta incursió electrònica es deu principalment a dues raons. D'una banda, la facilitat d'accés a aquesta tecnologia però d'altra banda, unes interfícies més amigables que permeten un ús més fàcil i intuitiu. Simplement fa falta observar els ordinadors personals d'avui en dia, les computadores de butxaca i inclús els telèfons mòbils. Tots aquests nous dispositius permeten que usuaris poc experimentats puguin fer ús de les tecnologies més punteres. D'altra banda, la inclusió de les tecnologies de la parla estan arribant a ser més comunes gràcies a què els sistemes de reconeixement i de síntesi de veu han millorat considerablement el seu funcionament i fiabilitat.L'objectiu final de les tecnologies de la parla és crear sistemes tan naturals com els éssers humans per tal de fer que el seu ús es pugui extendre a qualsevol racó de la vida quotidiana Els conversors de Text-a-Parla (o sintetitzadors) són un dels mòduls que més esforç investigador han rebut amb l'objectiu de millorar la seva naturalitat i expressivitat. L'ús de sintetitzadors s'ha ampliat durant els últims temps degut a l'alta qualitat aconseguida en aplicacions de domini restringit i el bon comportament en aplicacions de propòsit general. De totes formes, encara queda un llarg camí per recòrrer pel que respecta a la qualitat en aplicacions de domini obert. A més a més, algunes de les tendències dels sistemes sintetitzadors comporten reduir el tamany de les bases de dades, sistemes flexibles per adaptar locutors i estils de locució i sistemes entrenables.Aquesta tesi doctoral presentarà un sintetizador de veu basat en l'entorn probabilístic dels Models Ocults de Makov (MOM) que tractarà amb els principals temes estudiats a l'actualitat, tal com l'adaptació de l'estil del locutor, sistemes conversors de veu entrenables i bases de dades de tamany reduit. Es descriurà el funcionament convencional dels algoritmes i es propondran millores en diferents àmbits com per exemple l'expressivitat. A la vegada, es presenta un sistema híbrid punter que combina models estadístics i de concatenació de veu. Els resultats obtinguts mostren com les propostes d'aquest treball donen un pas endavant en l'àmbit de la creació de veu sintètica utilitzant models estadístics. / Hoy en día, la Interacción Hombre-Máquina (IHM) es una de las disciplinas más estudiadas con el objetivo de mejorar las interacciones humanas con sistemas reales para el presente y para el futuro venidero. Más y más dispositivos electrónicos son usados por más gente en la vida diaria. Esta incursión electrónica se debe principalmente a dos razones. Por un lado, el indudable aumento en la accesibilidad económica a esta tecnología pero por otra parte, unos interfaces más amigables que permiten un uso más fácil e intuitivo. Simplemente hace falta observar hoy en día los ordenadores personales, las computadoras de bolsillo e incluso los teléfonos móviles. Todos estos nuevos dispositivos admiten que usuarios poco experimentados puedan hacer uso de las tecnologías más punteras. Por otra parte, la inclusión de las tecnologías del habla está llegando a ser más común gracias a que los sistemas de reconocimiento y de síntesis de voz han estado mejorando su funcionamiento y fiabilidad.El objetivo final de las tecnologías del habla es crear sistemas tan naturales como los seres humanos para que su uso se pueda extender a cualquier rincón de la vida diaria. Los conversores de Texto-a-Voz (o sintetizadores) son de los módulos que más esfuerzo investigador han recibido con el objetivo de mejorar su naturalidad y la expresividad. El uso de los sintetizadores se ha ampliado durante los últimos tiempos debido a la alta calidad alcanzada en usos de dominio restringido y el buen comportamiento en aplicaciones de propósito general. De todas formas, todavía queda un largo camino por recorrer por lo que respecta a la calidad en aplicaciones de dominio abierto. Además, algunas de las tendencias de los sistemas sintetizadores conllevan reducir el tamaño de las bases de datos, sistemas flexibles para adaptar locutores y estilos de locución y sistemas entrenables.Esta tesis doctoral presentará un sintetizador de voz basado en el entorno probabilístico de los Modelos Ocultos de Markov (MOM) que lidiará con los principales temas estudiados en la actualidad tales como adaptación del estilo de locutor, sistema conversores de voz entrenables y bases de datos de tamaño reducido. Se describirá el funcionamiento convencional de los algoritmos y se propondrán mejoras en varios ámbitos tales como la expresividad. A la vez se presenta un sistema híbrido puntero que combina modelos estadísticos y de concatenación de voz. Los resultados obtenidos muestran como las propuestas de este trabajo dan un paso adelante en el ámbito de la creación de voz sintética usando modelos estadísticos. / Nowadays, Human Computer Interaction (HCI) is one of the most studied disciplines in order to improve real human interactions with machines on the present time and for the incoming future. More and more electronic devices of the daily life are used by more people. This electronic incursion is mainly due to two reasons. On the one hand, the undoubted increasing of the economical accessibility to this technology but on the other hand, the more friendly interfaces allow an easier and more intuitive use. As a matter of fact, nowadays it is only necessary to observe the personal computer interfaces, pocket size computers and even mobile telephones. All these new interfaces let little experienced users make use of cutting edge technologies. Moreover, the inclusion of speech technologies in these systems is becoming more usual since speech recognition and synthesis systems have improved their performance and reliability.The purpose of speech technology is to provide systems with a natural human interface so the use can be extended to daily life. Text-to-Speech (TTS) systems are one of the main modules under intense research activity in order to improve their naturalness and expressiveness. The use of synthesizers has been extended during the last times due to the high-quality reached in real limited domain applications and the good performance in generic purposes applications. However, there is still a long way to go with respect to quality and open domain systems.This work will present a TTS system based on a statistical framework using Hidden Markov Models (HMMs) that will deal with the main topics under study in recent years such as voice style adaptation, trainable TTS systems and low print databases. Moreover, a cutting edge hybrid approach combining concatenative and statistical synthesis will also be presented. Ideas and results in this work show a step forward in the HMM-based TTS system field hybrid Spanish and English HMM synthesis speech synthesis híbrido síntesis de voz híbrid síntesi de veu Les TIC i la seva Gestió 62 621.3
170	Αυτόματος τεμαχισμός ψηφιακών σημάτων ομιλίας και εφαρμογή στη σύνθεση ομιλίας, αναγνώριση ομιλίας και αναγνώριση γλώσσας / Automatic segmentation of digital speech signals and application to speech synthesis, speech recognition and language recognition Μπόρας, Ιωσήφ 19 October 2009 (has links) Η παρούσα διατριβή εισάγει μεθόδους για τον αυτόματο τεμαχισμό σημάτων ομιλίας. Συγκεκριμένα παρουσιάζονται τέσσερις νέες μέθοδοι για τον αυτόματο τεμαχισμό σημάτων ομιλίας, τόσο για γλωσσολογικά περιορισμένα όσο και μη προβλήματα. Η πρώτη μέθοδος κάνει χρήση των σημείων του σήματος που αντιστοιχούν στα ανοίγματα των φωνητικών χορδών κατά την διάρκεια της ομιλίας για να εξάγει όρια ψευδό-φωνημάτων με χρήση του αλγορίθμου δυναμικής παραμόρφωσης χρόνου. Η δεύτερη τεχνική εισάγει μια καινοτόμα υβριδική μέθοδο εκπαίδευσης κρυμμένων μοντέλων Μαρκώφ, η οποία τα καθιστά πιο αποτελεσματικά στον τεμαχισμό της ομιλίας. Η τρίτη μέθοδος χρησιμοποιεί αλγορίθμους μαθηματικής παλινδρόμησης για τον συνδυασμό ανεξαρτήτων μηχανών τεμαχισμού ομιλίας. Η τέταρτη μέθοδος εισάγει μια επέκταση του αλγορίθμου Βιτέρμπι με χρήση πολλαπλών παραμετρικών τεχνικών για τον τεμαχισμό της ομιλίας. Τέλος, οι προτεινόμενες μέθοδοι τεμαχισμού χρησιμοποιούνται για την βελτίωση συστημάτων στο πρόβλημα της σύνθεσης ομιλίας, αναγνώρισης ομιλίας και αναγνώρισης γλώσσας. / The present dissertation introduces methods for the automatic segmentation of speech signals. In detail, four new segmentation methods are presented both in for the cases of linguistically constrained or not segmentation. The first method uses pitchmark points to extract pseudo-phonetic boundaries using dynamic time warping algorithm. The second technique introduces a new hybrid method for the training of hidden Markov models, which makes them more effective in the speech segmentation task. The third method uses regression algorithms for the fusion of independent segmentation engines. The fourth method is an extension of the Viterbi algorithm using multiple speech parameterization techniques for segmentation. Finally, the proposed methods are used to improve systems in the task of speech synthesis, speech recognition and language recognition. Τεμαχισμός ομιλίας Αναγνώριση ομιλίας Σύνθεση ομιλίας Αναγνώριση γλώσσας Αλγόριθμος Βιτέρμπι 006.454 Speech segmentation Hidden Markov models Speech recognition Speech synthesis Language recognition Viterbi algorithm Regression Dynamic time warping

Search results