• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 88
  • 16
  • 6
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 2
  • 2
  • Tagged with
  • 181
  • 181
  • 61
  • 38
  • 38
  • 35
  • 33
  • 33
  • 20
  • 19
  • 18
  • 17
  • 14
  • 14
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
161

Voice input for the disabled /

Holmes, William Paul. January 1987 (has links) (PDF)
Thesis (M. Eng. Sc.)--University of Adelaide, 1987. / Typescript. Includes a copy of a paper presented at TADSEM '85 --Australian Seminar on Devices for Expressive Communication and Environmental Control, co-authored by the author. Includes bibliographical references (leaves [115-121]).
162

Vers une interface cerveau-machine pour la restauration de la parole / Toward a brain-computer interface for speech restoration

Bocquelet, Florent 24 April 2017 (has links)
Restorer la faculté de parler chez des personnes paralysées et aphasiques pourrait être envisagée via l’utilisation d’une interface cerveau-machine permettant de contrôler un synthétiseur de parole en temps réel. L’objectif de cette thèse était de développer trois aspects nécessaires à la mise au point d’une telle preuve de concept.Premièrement, un synthétiseur permettant de produire en temps-réel de la parole intelligible et controlé par un nombre raisonable de paramètres est nécessaire. Nous avons choisi de synthétiser de la parole à partir des mouvements des articulateurs du conduit vocal. En effet, des études récentes ont suggéré que l’activité neuronale du cortex moteur de la parole pourrait contenir suffisamment d’information pour décoder la parole, et particulièrement ses propriété articulatoire (ex. l’ouverture des lèvres). Nous avons donc développé un synthétiseur produisant de la parole intelligible à partir de données articulatoires. Dans un premier temps, nous avons enregistré un large corpus de données articulatoire et acoustiques synchrones chez un locuteur. Ensuite, nous avons utilisé des techniques d’apprentissage automatique, en particulier des réseaux de neurones profonds, pour construire un modèle permettant de convertir des données articulatoires en parole. Ce synthétisuer a été construit pour fonctionner en temps réel. Enfin, comme première étape vers un contrôle neuronal de ce synthétiseur, nous avons testé qu’il pouvait être contrôlé en temps réel par plusieurs locuteurs, pour produire de la parole inetlligible à partir de leurs mouvements articulatoires dans un paradigme de boucle fermée.Deuxièmement, nous avons étudié le décodage de la parole et de ses propriétés articulatoires à partir d’activités neuronales essentiellement enregistrées dans le cortex moteur de la parole. Nous avons construit un outil permettant de localiser les aires corticales actives, en ligne pendant des chirurgies éveillées à l’hôpital de Grenoble, et nous avons testé ce système chez deux patients atteints d’un cancer du cerveau. Les résultats ont montré que le cortex moteur exhibe une activité spécifique pendant la production de parole dans les bandes beta et gamma du signal, y compris lors de l’imagination de la parole. Les données enregistrées ont ensuite pu être analysées pour décoder l’intention de parler du sujet (réelle ou imaginée), ainsi que la vibration des cordes vocales et les trajectoires des articulateurs principaux du conduit vocal significativement au dessus du niveau de la chance.Enfin, nous nous sommes intéressés aux questions éthiques qui accompagnent le développement et l’usage des interfaces cerveau-machine. Nous avons en particulier considéré trois niveaux de réflexion éthique concernant respectivement l’animal, l’humain et l’humanité. / Restoring natural speech in paralyzed and aphasic people could be achieved using a brain-computer interface controlling a speech synthesizer in real-time. The aim of this thesis was thus to develop three main steps toward such proof of concept.First, a prerequisite was to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. Here we chose to synthesize speech from movements of the speech articulators since recent studies suggested that neural activity from the speech motor cortex contains relevant information to decode speech, and especially articulatory features of speech. We thus developed a speech synthesizer that produced intelligible speech from articulatory data. This was achieved by first recording a large dataset of synchronous articulatory and acoustic data in a single speaker. Then, we used machine learning techniques, especially deep neural networks, to build a model able to convert articulatory data into speech. This synthesizer was built to run in real time. Finally, as a first step toward future brain control of this synthesizer, we tested that it could be controlled in real-time by several speakers to produce intelligible speech from articulatory movements in a closed-loop paradigm.Second, we investigated the feasibility of decoding speech and articulatory features from neural activity essentially recorded in the speech motor cortex. We built a tool that allowed to localize active cortical speech areas online during awake brain surgery at the Grenoble Hospital and tested this system in two patients with brain cancer. Results show that the motor cortex exhibits specific activity during speech production in the beta and gamma bands, which are also present during speech imagination. The recorded data could be successfully analyzed to decode speech intention, voicing activity and the trajectories of the main articulators of the vocal tract above chance.Finally, we addressed ethical issues that arise with the development and use of brain-computer interfaces. We considered three levels of ethical questionings, dealing respectively with the animal, the human being, and the human species.
163

Génération de parole expressive dans le cas des langues à tons / Generation the expressive speech in case of tonal languages

Mac, Dang Khoa 15 June 2012 (has links)
De plus en plus, l'interaction entre personne et machine se rapproche du naturel afin de ressembler à l'interaction entre humains, incluant l'expressivité (en particulier les émotions et les attitudes). Dans la communication parlée, les attitudes, et plus généralement les affects sociaux, sont véhiculés principalement par la prosodie. Pour les langues tonales, la prosodie est utilisée aussi pour coder l'information sémantique dans les variations de tons. Ce travail de thèse présente une étude des affects sociaux du vietnamien, une langue à tons et une langue peu dotée, afin d'appliquer les résultats obtenus à un système de synthèse de haute qualité capable de produire la parole « expressive » pour le vietnamien. Le premier travail de cette thèse consiste en la construction du premier corpus audio-visuel des attitudes vietnamiennes, qui contient seize attitudes. Ce corpus est ensuite utilisé pour étudier la perception audio-visuelle et interculturelle des attitudes vietnamiennes. Pour cela, une série de tests perceptifs a été effectuée avec des auditeurs natifs et non-natifs (des auditeurs francophones pour les non-natifs). Les résultats de ces tests montrent que les facteurs influant sur la perception des attitudes sont l'expression de l'attitude elle-même et la modalité de présentation (audio, visuelle et audio-visuelle). Ces résultats nous ont ainsi permis de trouver des affects sociaux communs ou interculturels entre le vietnamien et le français. Puis, un autre test de perception a été réalisé sur des phrases avec tons afin d'explorer l'effet du système tonal du vietnamien sur la perception des attitudes. Les résultats montrent que les juges non-natifs peuvent traiter et séparer les indices tonals locaux et les traits saillants prosodiques de portée globale. Après une présentation de nos études sur les affects sociaux en vietnamien, nous décrivons notre modélisation de la prosodie des attitudes en vue de la synthèse de la parole expressive en vietnamien. En nous basant sur le modèle de superposition des contours fonctionnels, nous proposons une méthode pour modéliser et générer de la prosodie expressive en vietnamien. Cette méthode est ensuite appliquée pour générer de la parole expressive en vietnamien, puis évaluée par des tests de perception sur les énoncés synthétiques. Les résultats de perception valident bien la performance de notre modèle et confirment que l'approche de superposition de contours fonctionnels peut être utilisée pour modéliser une prosodie complexe comme dans le cas de la parole expressive d'une langue à tons. / Today, the human-computer interaction is reaching the naturalness and is increasingly similar to the human-human interaction, including the expressiveness (especially emotions and attitudes). In spoken communication, attitudes or social affects are mainly transferred through prosody. For tonal languages, prosody is also used to encode semantic information via tones. This thesis presents a study of social affects in Vietnamese, a tonal and under-resourced language, in order to apply the results to Vietnamese expressive speech synthesis task. The first task of this thesis concerns the construction of a first audio-visual corpus of Vietnamese attitudes which contains sixteen attitudes. This corpus is then used to study the audio-visual and intercultural perceptions of the Vietnamese attitudes. A series of perceptual tests was carried out with native and non-native listeners (French for non-native listeners). Experimental results reveal the fact that the influential factors on the perception of attitudes include the modality of presentation (audio, visual and audio-visual) and the attitudinal expression itself. These results also allow us to investigate the common specificities and cross-cultural specificities between Vietnamese and French attitudes. Another perception test was carried out using sentences with tonal variation to study the influence of Vietnamese tones on the perception of attitudes. The results show that non-native listeners can process the local prosodic cues of tones, together with the global cues of attitude patterns. After presenting our studies on Vietnamese social affects, we describe our work on attitude modelling to apply it to Vietnamese expressive speech synthesis. Based on the concept of prosodic contour superposition, a prosodic model was proposed to encode the attitudinal function of prosody for Vietnamese attitudes. This model was applied to generate the Vietnamese expressive speech and then evaluated in a perceptual experiment with synthetic utterances. The results validate the ability of applying our proposed model in generating the prosody of attitudes for a tonal language such as Vietnamese.
164

Využití uživatelské odezvy pro zvýšení kvality řečové syntézy / Improving text-to-speech in spoken dialogue systems by employing user's feedback

Hudeček, Vojtěch January 2017 (has links)
Although spoken dialogue systems have greatly improved, they still cannot handle communications involving unknown topics. One of the problems is, that they experience difficulties when they should pronounce unknown words. We will investigate methods that can improve spoken dialogue systems by correcting the pronunciation of unknown words. This is a crucial step to provide a better user experience, since for example mispronounced proper nouns are highly undesirable. Incorrect pronunciation is caused by imperfect phonetic representation of the word. We aim to detect incorrectly pronounced words, use knowledge about the pronunciation and user's feedback and correct the transcriptions accordingly. Furthermore, the learned phonetic transcriptions can be added to the speech recognition module's vocabulary. Thus extracting correct pronunciations benefits both speech recognition and text-to-speech components of the dialogue systems.
165

Discussion On Effective Restoration Of Oral Speech Using Voice Conversion Techniques Based On Gaussian Mixture Modeling

Alverio, Gustavo 01 January 2007 (has links)
Today's world consists of many ways to communicate information. One of the most effective ways to communicate is through the use of speech. Unfortunately many lose the ability to converse. This in turn leads to a large negative psychological impact. In addition, skills such as lecturing and singing must now be restored via other methods. The usage of text-to-speech synthesis has been a popular resolution of restoring the capability to use oral speech. Text to speech synthesizers convert text into speech. Although text to speech systems are useful, they only allow for few default voice selections that do not represent that of the user. In order to achieve total restoration, voice conversion must be introduced. Voice conversion is a method that adjusts a source voice to sound like a target voice. Voice conversion consists of a training and converting process. The training process is conducted by composing a speech corpus to be spoken by both source and target voice. The speech corpus should encompass a variety of speech sounds. Once training is finished, the conversion function is employed to transform the source voice into the target voice. Effectively, voice conversion allows for a speaker to sound like any other person. Therefore, voice conversion can be applied to alter the voice output of a text to speech system to produce the target voice. The thesis investigates how one approach, specifically the usage of voice conversion using Gaussian mixture modeling, can be applied to alter the voice output of a text to speech synthesis system. Researchers found that acceptable results can be obtained from using these methods. Although voice conversion and text to speech synthesis are effective in restoring voice, a sample of the speaker before voice loss must be used during the training process. Therefore it is vital that voice samples are made to combat voice loss.
166

Efficient 3D Acoustic Simulation of the Vocal Tract by Combining the Multimodal Method and Finite Elements

Blandin, Rémi, Arnela, Marc, Félix, Simon, Doc, Jean-Baptiste, Birkholz, Peter 22 February 2024 (has links)
Acoustic simulation of sound propagation inside the vocal tract is a key element of speech research, especially for articulatory synthesis, which allows one to relate the physics of speech production to other fields of speech science, such as speech perception. Usual methods, such as the transmission line method, have a very low computational cost and perform relatively good up to 4-5 kHz, but are not satisfying above. Fully numerical 3D methods such as finite elements achieve the best accuracy, but have a very high computational cost. Better performances are achieved with the state of the art semi-analytical methods, but they cannot describe the vocal tract geometry as accurately as fully numerical methods (e.g. no possibility to take into account the curvature). This work proposes a new semi-analytical method that achieves a better description of the three-dimensional vocal-tract geometry while keeping the computational cost substantially lower than the fully numerical methods. It is a multimodal method which relies on two-dimensional finite elements to compute transverse modes and takes into account the curvature and the variations of crosssectional area. The comparison with finite element simulations shows that the same degree of accuracy (about 1% of difference in the resonance frequencies) is achieved with a computational cost about 10 times lower.
167

Vocal Expression of Emotion : Discrete-emotions and Dimensional Accounts

Laukka, Petri January 2004 (has links)
<p>This thesis investigated whether vocal emotion expressions are conveyed as discrete emotions or as continuous dimensions. </p><p>Study I consisted of a meta-analysis of decoding accuracy of discrete emotions (anger, fear, happiness, love-tenderness, sadness) within and across cultures. Also, the literature on acoustic characteristics of expressions was reviewed. Results suggest that vocal expressions are universally recognized and that there exist emotion-specific patterns of voice-cues for discrete emotions.</p><p>In Study II, actors vocally portrayed anger, disgust, fear, happiness, and sadness with weak and strong emotion intensity. The portrayals were decoded by listeners and acoustically analyzed with respect to 20 voice-cues (e.g., speech rate, voice intensity, fundamental frequency, spectral energy distribution). Both the intended emotion and intensity of the portrayals were accurately decoded and had an impact on voice-cues. Listeners’ ratings of both emotion and intensity could be predicted from a selection of voice-cues.</p><p>In Study III, listeners rated the portrayals from Study II on emotion dimensions (activation, valence, potency, emotion intensity). All dimensions were correlated with several voice-cues. Listeners’ ratings could be successfully predicted from the voice-cues for all dimensions except valence.</p><p>In Study IV, continua of morphed expressions, ranging from one emotion to another in equal steps, were created using speech synthesis. Listeners identified the emotion of each expression and discriminated between pairs of expressions. The continua were perceived as two distinct sections separated by a sudden category boundary. Also, discrimination accuracy was generally higher for pairs of stimuli falling across category boundaries than for pairs belonging to the same category. This suggests that vocal expressions are categorically perceived.</p><p>Taken together, the results suggest that a discrete-emotions approach provides the best account of vocal expression. Previous difficulties in finding emotion-specific patterns of voice-cues may be explained in terms of limitations of previous studies and the coding of the communicative process.</p>
168

Vocal Expression of Emotion : Discrete-emotions and Dimensional Accounts

Laukka, Petri January 2004 (has links)
This thesis investigated whether vocal emotion expressions are conveyed as discrete emotions or as continuous dimensions. Study I consisted of a meta-analysis of decoding accuracy of discrete emotions (anger, fear, happiness, love-tenderness, sadness) within and across cultures. Also, the literature on acoustic characteristics of expressions was reviewed. Results suggest that vocal expressions are universally recognized and that there exist emotion-specific patterns of voice-cues for discrete emotions. In Study II, actors vocally portrayed anger, disgust, fear, happiness, and sadness with weak and strong emotion intensity. The portrayals were decoded by listeners and acoustically analyzed with respect to 20 voice-cues (e.g., speech rate, voice intensity, fundamental frequency, spectral energy distribution). Both the intended emotion and intensity of the portrayals were accurately decoded and had an impact on voice-cues. Listeners’ ratings of both emotion and intensity could be predicted from a selection of voice-cues. In Study III, listeners rated the portrayals from Study II on emotion dimensions (activation, valence, potency, emotion intensity). All dimensions were correlated with several voice-cues. Listeners’ ratings could be successfully predicted from the voice-cues for all dimensions except valence. In Study IV, continua of morphed expressions, ranging from one emotion to another in equal steps, were created using speech synthesis. Listeners identified the emotion of each expression and discriminated between pairs of expressions. The continua were perceived as two distinct sections separated by a sudden category boundary. Also, discrimination accuracy was generally higher for pairs of stimuli falling across category boundaries than for pairs belonging to the same category. This suggests that vocal expressions are categorically perceived. Taken together, the results suggest that a discrete-emotions approach provides the best account of vocal expression. Previous difficulties in finding emotion-specific patterns of voice-cues may be explained in terms of limitations of previous studies and the coding of the communicative process.
169

Síntesi basada en models ocults de Markov aplicada a l'espanyol i a l'anglès, les seves aplicacions i una proposta híbrida

Gonzalvo Fructuoso, Javier 16 July 2010 (has links)
Avui en dia, la Interacció Home Màquina (IHM) és una de les disciplines més estudiades amb l'objectiu de millorar les interaccions humanes amb sistemes reals actuals i futurs. Cada vegada més gent utilitza més dispositius electrònics a la vida quotidiana Aquesta incursió electrònica es deu principalment a dues raons. D'una banda, la facilitat d'accés a aquesta tecnologia però d'altra banda, unes interfícies més amigables que permeten un ús més fàcil i intuitiu. Simplement fa falta observar els ordinadors personals d'avui en dia, les computadores de butxaca i inclús els telèfons mòbils. Tots aquests nous dispositius permeten que usuaris poc experimentats puguin fer ús de les tecnologies més punteres. D'altra banda, la inclusió de les tecnologies de la parla estan arribant a ser més comunes gràcies a què els sistemes de reconeixement i de síntesi de veu han millorat considerablement el seu funcionament i fiabilitat.L'objectiu final de les tecnologies de la parla és crear sistemes tan naturals com els éssers humans per tal de fer que el seu ús es pugui extendre a qualsevol racó de la vida quotidiana Els conversors de Text-a-Parla (o sintetitzadors) són un dels mòduls que més esforç investigador han rebut amb l'objectiu de millorar la seva naturalitat i expressivitat. L'ús de sintetitzadors s'ha ampliat durant els últims temps degut a l'alta qualitat aconseguida en aplicacions de domini restringit i el bon comportament en aplicacions de propòsit general. De totes formes, encara queda un llarg camí per recòrrer pel que respecta a la qualitat en aplicacions de domini obert. A més a més, algunes de les tendències dels sistemes sintetitzadors comporten reduir el tamany de les bases de dades, sistemes flexibles per adaptar locutors i estils de locució i sistemes entrenables.Aquesta tesi doctoral presentarà un sintetizador de veu basat en l'entorn probabilístic dels Models Ocults de Makov (MOM) que tractarà amb els principals temes estudiats a l'actualitat, tal com l'adaptació de l'estil del locutor, sistemes conversors de veu entrenables i bases de dades de tamany reduit. Es descriurà el funcionament convencional dels algoritmes i es propondran millores en diferents àmbits com per exemple l'expressivitat. A la vegada, es presenta un sistema híbrid punter que combina models estadístics i de concatenació de veu. Els resultats obtinguts mostren com les propostes d'aquest treball donen un pas endavant en l'àmbit de la creació de veu sintètica utilitzant models estadístics. / Hoy en día, la Interacción Hombre-Máquina (IHM) es una de las disciplinas más estudiadas con el objetivo de mejorar las interacciones humanas con sistemas reales para el presente y para el futuro venidero. Más y más dispositivos electrónicos son usados por más gente en la vida diaria. Esta incursión electrónica se debe principalmente a dos razones. Por un lado, el indudable aumento en la accesibilidad económica a esta tecnología pero por otra parte, unos interfaces más amigables que permiten un uso más fácil e intuitivo. Simplemente hace falta observar hoy en día los ordenadores personales, las computadoras de bolsillo e incluso los teléfonos móviles. Todos estos nuevos dispositivos admiten que usuarios poco experimentados puedan hacer uso de las tecnologías más punteras. Por otra parte, la inclusión de las tecnologías del habla está llegando a ser más común gracias a que los sistemas de reconocimiento y de síntesis de voz han estado mejorando su funcionamiento y fiabilidad.El objetivo final de las tecnologías del habla es crear sistemas tan naturales como los seres humanos para que su uso se pueda extender a cualquier rincón de la vida diaria. Los conversores de Texto-a-Voz (o sintetizadores) son de los módulos que más esfuerzo investigador han recibido con el objetivo de mejorar su naturalidad y la expresividad. El uso de los sintetizadores se ha ampliado durante los últimos tiempos debido a la alta calidad alcanzada en usos de dominio restringido y el buen comportamiento en aplicaciones de propósito general. De todas formas, todavía queda un largo camino por recorrer por lo que respecta a la calidad en aplicaciones de dominio abierto. Además, algunas de las tendencias de los sistemas sintetizadores conllevan reducir el tamaño de las bases de datos, sistemas flexibles para adaptar locutores y estilos de locución y sistemas entrenables.Esta tesis doctoral presentará un sintetizador de voz basado en el entorno probabilístico de los Modelos Ocultos de Markov (MOM) que lidiará con los principales temas estudiados en la actualidad tales como adaptación del estilo de locutor, sistema conversores de voz entrenables y bases de datos de tamaño reducido. Se describirá el funcionamiento convencional de los algoritmos y se propondrán mejoras en varios ámbitos tales como la expresividad. A la vez se presenta un sistema híbrido puntero que combina modelos estadísticos y de concatenación de voz. Los resultados obtenidos muestran como las propuestas de este trabajo dan un paso adelante en el ámbito de la creación de voz sintética usando modelos estadísticos. / Nowadays, Human Computer Interaction (HCI) is one of the most studied disciplines in order to improve real human interactions with machines on the present time and for the incoming future. More and more electronic devices of the daily life are used by more people. This electronic incursion is mainly due to two reasons. On the one hand, the undoubted increasing of the economical accessibility to this technology but on the other hand, the more friendly interfaces allow an easier and more intuitive use. As a matter of fact, nowadays it is only necessary to observe the personal computer interfaces, pocket size computers and even mobile telephones. All these new interfaces let little experienced users make use of cutting edge technologies. Moreover, the inclusion of speech technologies in these systems is becoming more usual since speech recognition and synthesis systems have improved their performance and reliability.The purpose of speech technology is to provide systems with a natural human interface so the use can be extended to daily life. Text-to-Speech (TTS) systems are one of the main modules under intense research activity in order to improve their naturalness and expressiveness. The use of synthesizers has been extended during the last times due to the high-quality reached in real limited domain applications and the good performance in generic purposes applications. However, there is still a long way to go with respect to quality and open domain systems.This work will present a TTS system based on a statistical framework using Hidden Markov Models (HMMs) that will deal with the main topics under study in recent years such as voice style adaptation, trainable TTS systems and low print databases. Moreover, a cutting edge hybrid approach combining concatenative and statistical synthesis will also be presented. Ideas and results in this work show a step forward in the HMM-based TTS system field
170

Αυτόματος τεμαχισμός ψηφιακών σημάτων ομιλίας και εφαρμογή στη σύνθεση ομιλίας, αναγνώριση ομιλίας και αναγνώριση γλώσσας / Automatic segmentation of digital speech signals and application to speech synthesis, speech recognition and language recognition

Μπόρας, Ιωσήφ 19 October 2009 (has links)
Η παρούσα διατριβή εισάγει μεθόδους για τον αυτόματο τεμαχισμό σημάτων ομιλίας. Συγκεκριμένα παρουσιάζονται τέσσερις νέες μέθοδοι για τον αυτόματο τεμαχισμό σημάτων ομιλίας, τόσο για γλωσσολογικά περιορισμένα όσο και μη προβλήματα. Η πρώτη μέθοδος κάνει χρήση των σημείων του σήματος που αντιστοιχούν στα ανοίγματα των φωνητικών χορδών κατά την διάρκεια της ομιλίας για να εξάγει όρια ψευδό-φωνημάτων με χρήση του αλγορίθμου δυναμικής παραμόρφωσης χρόνου. Η δεύτερη τεχνική εισάγει μια καινοτόμα υβριδική μέθοδο εκπαίδευσης κρυμμένων μοντέλων Μαρκώφ, η οποία τα καθιστά πιο αποτελεσματικά στον τεμαχισμό της ομιλίας. Η τρίτη μέθοδος χρησιμοποιεί αλγορίθμους μαθηματικής παλινδρόμησης για τον συνδυασμό ανεξαρτήτων μηχανών τεμαχισμού ομιλίας. Η τέταρτη μέθοδος εισάγει μια επέκταση του αλγορίθμου Βιτέρμπι με χρήση πολλαπλών παραμετρικών τεχνικών για τον τεμαχισμό της ομιλίας. Τέλος, οι προτεινόμενες μέθοδοι τεμαχισμού χρησιμοποιούνται για την βελτίωση συστημάτων στο πρόβλημα της σύνθεσης ομιλίας, αναγνώρισης ομιλίας και αναγνώρισης γλώσσας. / The present dissertation introduces methods for the automatic segmentation of speech signals. In detail, four new segmentation methods are presented both in for the cases of linguistically constrained or not segmentation. The first method uses pitchmark points to extract pseudo-phonetic boundaries using dynamic time warping algorithm. The second technique introduces a new hybrid method for the training of hidden Markov models, which makes them more effective in the speech segmentation task. The third method uses regression algorithms for the fusion of independent segmentation engines. The fourth method is an extension of the Viterbi algorithm using multiple speech parameterization techniques for segmentation. Finally, the proposed methods are used to improve systems in the task of speech synthesis, speech recognition and language recognition.

Page generated in 0.0903 seconds