Global ETD Search

1	Using Latin Square Design To Evaluate Model Interpolation And Adaptation Based Emotional Speech Synthesis Hsu, Chih-Yu 19 July 2012 (has links) ¡@¡@In this thesis, we use a hidden Markov model which can use a small amount of corpus to synthesize speech with certain quality to implement speech synthesis system for Chinese. More, the emotional speech are synthesized by the flexibility of the parametric speech in this model. We conduct model interpolation and model adaptation to synthesize speech from neutral to particular emotion without target speaker¡¦s emotional speech. In model adaptation, we use monophone-based Mahalanobis distance to select emotional models which are close to target speaker from pool of speakers, and estimate the interpolation weight to synthesize emotional speech. In model adaptation, we collect abundant of data training average voice models for each individual emotion. These models are adapted to specific emotional models of target speaker by CMLLR method. In addition, we design the Latin-square evaluation to reduce the systematic offset in the subjective tests, making results more credible and fair. We synthesize emotional speech include happiness, anger, sadness, and use Latin square design to evaluate performance in three part similarity, naturalness, and emotional expression respectively. According to result, we make a comprehensive comparison and conclusions of two method in emotional speech synthesis. model interpolation Latin-square design hidden Markov model model adaptation emotional speech synthesis Mahalanobis distance
2	Τεμαχιοποίηση ομιλίας σε φωνητικές ομάδες για αναγνώριση και σύνθεση ομιλίας Μπουρνά, Βασιλική 21 January 2009 (has links) H διαρκώς αυξανόμενη ανάπτυξη εφαρμογών όπως τα συστήματα μετατροπής κειμένου σε ομιλία (TTS systems) ή τα συστήματα αυτόματης αναγώρισης ομιλίας (ASR systems) κάνουν επιτακτική την ανάγκη της μελέτης χαρακτηριστικών της ομιλίας που δεν περιορίζονται σε συντακτικούς ή λεξιλογικούς κανόνες, αλλά σηματοδοτούνται από διαφορετικές διαδικασίες, όπως είναι η προσωδία. Τα προσωδιακά χαρακτηριστικά της ομιλίας είναι αυτά που πέρα από το λεξιλογικό περιεχόμενο των προτάσεων, επισημαίνουν άλλα σημαντικά στοιχεία που αφορούν στην εστίαση και την έμφαση, εισάγωντας με αυτό τον τρόπο ένα δευτερεύον υποκείμενο κανάλι στην επικοινωνία. Επιπλεόν, συνδέονται σε μεγάλο βαθμό με την έκφραση συναισθήματος στην ομιλία. Γι'αυτό το λόγο είναι σημαντικό το να διερευνηθούν τα χαρακτηριστικά αυτά, τόσο στην ουδέτερη ομιλία, όσο και στις περιπτώσεις ομιλίας σε ορισμένες συναισθηματικές καταστάσεις. Στην παρούσα διπλωματική εργασία γίνεται τεμαχιοποίηση μιας συναισθηματικής ομιλίας, σε επίπεδο φωνημάτων και επιτονική επισημείωση των προσωδιακών γεγονότων που λαμβάνουν χώρα σε επίπεδο συλλαβών, προκειμένου να εξαχθούν οι παράμετροι εκείνες που θα μας επιτρέψουν να μελετήσουμε τα προσωδιακά χαρακτηριστικά παρουσία συναισθηματικής κατάστασης, σε σύγκριση με την ουδέτερη ομιλία. Στη συνέχεια πραγματοποιείται επεξεργασία των δεδομένων και μελέτη των προσωδιακών χαρακτηριστικών, μέσω σύγκρισης των χαρακτηριστικών που παρατηρούνται απο συναίσθημα σε συναίσθημα και μέσω της κατασκευής μοντέλων πρόβλεψης της διάρκειας των φωνημάτων και από αυτές τις διαδικασίες προκύπτουν και παρουσιάζονται κάποια συμπεράσματα σχετικά με την προσωδιακή πτυχή της συναισθηματικής ομιλίας. / The continuously rising development of applications such as Text-to-Speech systems (TTS systems) or Automatic Speech Recognition systems (ASR systems), make imperative the investigation of characteristics of speech which are not limited within the syntactic οr lexical rules, but are signaled by different processes, such as prosody. The prosodic features of speech are those which, beyond the lexical content of utterances, point out other important elements concerning the focus and the accent, implying in that way a secondary subjacent channel of communication. Moreover, they are connected to a great extent with the expression of emotion in speech. Thus, it is important to investigate these features, in neutral speech as well as in cases of speech under emotional conditions. In this thesis, took place the segmentation of a database of emotional speech in phonemic level and the intonational annotation of the prosodic events that occur in the syllabic level, in order to extract the parameters that allow us to study the prosodic features in the presence of emotional state compared to the neutral speech. Following, the extracted data were processed and the prosodic features were studied, through comparing the characteristics that are observed in the different emotional conditions and by building duration prediction models of phonemes and the conclusions drawn through these processes are presented, with regard to the prosodic aspect of emotional speech. Τεχνολογία ομιλίας Προσωδία 414.602 85 Speech technology Prosody Emotional speech
3	The automatic recognition of emotions in speech Manamela, Phuti, John January 2020 (has links) Thesis(M.Sc.(Computer Science)) -- University of Limpopo, 2020 / Speech emotion recognition (SER) refers to a technology that enables machines to detect and recognise human emotions from spoken phrases. In the literature, numerous attempts have been made to develop systems that can recognise human emotions from their voice, however, not much work has been done in the context of South African indigenous languages. The aim of this study was to develop an SER system that can classify and recognise six basic human emotions (i.e., sadness, fear, anger, disgust, happiness, and neutral) from speech spoken in Sepedi language (one of South Africa’s official languages). One of the major challenges encountered, in this study, was the lack of a proper corpus of emotional speech. Therefore, three different Sepedi emotional speech corpora consisting of acted speech data have been developed. These include a RecordedSepedi corpus collected from recruited native speakers (9 participants), a TV broadcast corpus collected from professional Sepedi actors, and an Extended-Sepedi corpus which is a combination of Recorded-Sepedi and TV broadcast emotional speech corpora. Features were extracted from the speech corpora and a data file was constructed. This file was used to train four machine learning (ML) algorithms (i.e., SVM, KNN, MLP and Auto-WEKA) based on 10 folds validation method. Three experiments were then performed on the developed speech corpora and the performance of the algorithms was compared. The best results were achieved when Auto-WEKA was applied in all the experiments. We may have expected good results for the TV broadcast speech corpus since it was collected from professional actors, however, the results showed differently. From the findings of this study, one can conclude that there are no precise or exact techniques for the development of SER systems, it is a matter of experimenting and finding the best technique for the study at hand. The study has also highlighted the scarcity of SER resources for South African indigenous languages. The quality of the dataset plays a vital role in the performance of SER systems. / National research foundation (NRF) and Telkom Center of Excellence (CoE) Speech emotion recognition Machine learning Feature extraction Classification Emotional speech database Automatic speech recognition Machine learning
4	Hybrid Concatenated-Formant Expressive Speech Synthesizer For Kinesensic Voices Chandra, Nishant 05 May 2007 (has links) Traditional and commercial speech synthesizers are incapable of synthesizing speech with proper emotion or prosody. Conveying prosody in artificially synthesized speech is difficult because of extreme variability in human speech. An arbitrary natural language sentence can have different meanings, depending upon the speaker, speaking style, context, and many other factors. Most concatenated speech synthesizers use phonemes, which are phonetic units defined by the International Phonetic Alphabet (IPA). The 50 phonemes in English are standardized and unique units of sound, but not expression. An earlier work proposed the analogy between speech and music ? ?speech is music, music is speech.? The speech data obtained from the master practitioners, who are trained in kinesensic voice, is marked on a five level intonation scale, which is similar to the music scale. From this speech data, 1324 unique expressive units, called expressemes®, are identified. The expressemes consist of melody and rhythm, which, in digital signal processing, is analogous to pitch, duration and energy of the signal. The expressemes have less acoustic and phonetic variability than phonemes, so they better convey the prosody. The goal is to develop a speech synthesizer which exploits the prosodic content of expressemes in order to synthesize expressive speech, with a small speech database. To create a reasonably small database that captures multiple expressions is a challenge because there may not be a complete set of speech segments available to create an emotion. Methods are suggested whereby acoustic mathematical modeling is used to create missing prosodic speech segments from the base prosody unit. New concatenatedormant hybrid speech synthesizer architecture is developed for this purpose. A pitch-synchronous time-varying frequency-warped wavelet transform based prosody manipulation algorithm is developed for transformation between prosodies. A time-varying frequency-warping transform is developed to smoothly concatenate the temporal and spectral parameters of adjacent expressemes to create intelligible speech. Additionally, issues specific to expressive speech synthesis using expressemes are resolved for example, Ergodic Hidden Markov Model based expresseme segmentation, model creation for F0 and segment duration, and target and join cost calculation. The performance of the hybrid synthesizer is measured against a commercially available synthesizer using objective and perceptual evaluations. Subjects consistently rated the hybrid synthesizer better in five different perceptual tests. 70% of speakers rated the hybrid synthesis as more expressive, and 72% preferred it over the commercial synthesizer. The hybrid synthesizer also got a comparable mean opinion score. Emotional speech synthesis Speech morphing Time-varying frequency-warping Expressive speech synthesizer TTS
5	An affective personality for an embodied conversational agent Xiao, He January 2006 (has links) Curtin Universitys Embodied Conversational Agents (ECA) combine an MPEG-4 compliant Facial Animation Engine (FAE), a Text To Emotional Speech Synthesiser (TTES), and a multi-modal Dialogue Manager (DM), that accesses a Knowledge Base (KB) and outputs Virtual Human Markup Language (VHML) text which drives the TTES and FAE. A user enters a question and an animated ECA responds with a believable and affective voice and actions. However, this response to the user is normally marked up in VHML by the KB developer to produce the required facial gestures and emotional display. A real person does not react by fixed rules but on personality, beliefs, previous experiences, and training. This thesis details the design, implementation and pilot study evaluation of an Affective Personality Model for an ECA. The thesis discusses the Email Agent system that informs a user when they have email. The system, built in Curtins ECA environment, has personality traits of Friendliness, Extraversion and Neuroticism. A small group of participants evaluated the Email Agent system to determine the effectiveness of the implemented personality system. An analysis of the qualitative and quantitative results from questionnaires is presented.
6	Prosody modelling using machine learning techniques for neutral and emotional speech synthesis / Μοντελοποίηση προσωδίας με χρήση τεχνικών μηχανικής μάθησης στα πλαίσια ουδέτερης και συναισθηματικής συνθετικής ομιλίας Λαζαρίδης, Αλέξανδρος 11 August 2011 (has links) In this doctoral dissertation three proposed approaches were evaluated using two databases of different languages, one American-English and one Greek. The proposed approaches were compared to the state-of-the-art models in the phone duration modelling task. The SVR model outperformed all the other individual models evaluated in this dissertation. Their ability to outperform all the other models is mainly based on their advantage of coping in a better way with high-dimensionality feature spaces in respect to the other models used in phone duration modelling, which makes them appropriate even for the case when the amount of the training data would be small respectively to the number of the feature set used. The proposed fusion scheme, taking advantage of the observation that different prediction algorithms perform better in different conditions, when implemented with SVR (SVR-fusion), contributed to the improvement of the phone duration prediction accuracy over that of the best individual model (SVR). Furthermore the SVR-fusion model managed to reduce the outliers in respect to the best individual model (SVR). Moreover, the proposed two-stage scheme using individual phone duration models as feature constructors in the first stage and feature vector extension (FVE) in the second stage, implemented with SVR (SVR-FVE), improved the prediction accuracy over the best individual predictor (SVR), and the SVR-fusion scheme and moreover managed to reduce the outliers in respect to the other two proposed schemes (SVR and SVR-fusion). The SVR two-stage scheme confirms in this way their advantage over all the other algorithms of coping well with high-dimensionality feature sets. The improved accuracy of phone duration modelling contributes to a better control of the prosody, and thus quality of synthetic speech. Furthermore, the first proposed method (SVR) was also evaluated on the phone duration modelling task in emotional speech, outperforming all the state-of-the-art models in all the emotional categories. Finally, perceptual tests were performed evaluating the impact of the proposed phone duration models to synthetic speech. The perceptual test for both the databases confirmed the results of objective tests showing the improvement achieved by the proposed models in the naturalness of synthesized speech. / Η παρούσα διδακτορική διατριβή πραγματεύεται προβλήματα που αφορούν στο χώρο της τεχνολογίας ομιλίας, με στόχο την μοντελοποίηση προσωδίας με χρήση τεχνικών μηχανικής μάθησης στα πλαίσια ουδέτερης και συναισθηματικής συνθετικής ομιλίας. Μελετήθηκαν τρεις καινοτόμες μέθοδοι μοντελοποίησης προσωδίας, οι οποίες αξιολογήθηκαν με αντικειμενικά τεστ και με υποκειμενικά τεστ ποιότητας ομιλίας για την συνεισφορά τους στην βελτίωση της ποιότητα της συνθετικής ομιλίας: Η πρώτη τεχνική μοντελοποίησης διάρκειας φωνημάτων, βασίζεται στην μοντελοποίηση με χρήση Μηχανών Υποστήριξης Διανυσμάτων (Support Vector Regression – SVR). Η μέθοδος αυτή δεν έχει χρησιμοποιηθεί έως σήμερα στην πρόβλεψη διάρκειας φωνημάτων. Η μέθοδος αυτή συγκρίθηκε και ξεπέρασε σε απόδοση όλες τις μεθόδους της επικρατούσας τεχνολογίας (state-of-the-art) στη μοντελοποίηση της διάρκειας φωνημάτων. Η δεύτερη τεχνική, βασίζεται στην μοντελοποίηση διάρκειας φωνημάτων με συνδυαστικό μοντέλο πολλαπλών προβλέψεων. Συγκεκριμένα, οι προβλέψεις διάρκειας φωνημάτων από ένα σύνολο ανεξάρτητων μοντέλων πρόβλεψης διάρκειας φωνημάτων χρησιμοποιούνται ως είσοδος σε ένα μοντέλο μηχανικής μάθησης, το οποίο συνδυάζει τις εξόδους από τα ανεξάρτητα μοντέλα πρόβλεψης και επιτυγχάνει μοντελοποίηση της διάρκειας φωνημάτων με μεγαλύτερη ακρίβεια, μειώνοντας επιπλέον και τα μεγάλα σφάλματα (outliers), δηλαδή τα σφάλματα που βρίσκονται μακριά από το μέσο όρο των σφαλμάτων. Η τρίτη τεχνική, είναι μια μέθοδος μοντελοποίησης διάρκειας φωνημάτων δύο σταδίων με κατασκευή νέων χαρακτηριστικών και επέκταση του διανύσματος χαρακτηριστικών. Συγκεκριμένα, στο πρώτο στάδιο, ένα σύνολο ανεξάρτητων μοντέλων πρόβλεψης διάρκειας φωνημάτων που χρησιμοποιούνται ως παραγωγοί νέων χαρακτηριστικών εμπλουτίζουν το διάνυσμα χαρακτηριστικών. Στο δεύτερο στάδιο, το εμπλουτισμένο διάνυσμα χρησιμοποιείται για να εκπαιδευτεί ένα μοντέλο πρόβλεψης διάρκειας φωνημάτων το οποίο επιτυγχάνει υψηλότερη απόδοση σε σχέση με όλες τις προηγούμενες μεθόδους, και μειώνει τα μεγάλα σφάλματα. Επιπλέον εφαρμόστηκε η πρώτη μέθοδος σε συναισθηματική ομιλία. Το προτεινόμενο SVR μοντέλο επιτυγχάνει την υψηλότερη απόδοση συγκρινόμενο με όλα τα state-of-the-art μοντέλα. Τέλος, πραγματοποιήθηκαν υποκειμενικά τεστ ποιότητας ομιλίας ώστε να αξιολογηθεί η συνεισφορά των τριών προτεινόμενων μεθόδων στη βελτίωση της ποιότητας της συνθετικής ομιλίας. Τα τεστ αυτά επιβεβαίωσαν την αξία των προτεινόμενων μεθόδων και τη συνεισφορά τους στη βελτίωση της ποιότητας στην συνθετική ομιλία. Phone duration modelling Prosody modelling Speech synthesis Machine learning Neutral speech Emotional speech 006.31 Σύνθεση ομιλίας Μηχανική μάθηση Ουδέτερη ομιλία
7	El discurso emocional como estrategia de comunicación en entrevistas en vivo. Análisis del caso del candidato presidencial Rafael López Aliaga en la primera vuelta del proceso electoral 2021 en Perú (octubre 2020 a abril 2021) / Emotional speech as a communication strategy during live interviews. Analysis of the case of the presidential candidate Rafael Lopez Aliaga during the first round of the 2021 electoral process in Peru (October 2020 - April 2021) Ponce Campos, Geraldine Joyce 25 October 2021 (has links) Las emociones se han convertido en un arma para utilizar en campañas políticas. El uso de la emocionalidad en el discurso de un candidato aporta una propuesta de valor diferente. La exposición del político -sea para un cargo nacional, regional o local- brinda a la población la posibilidad de conocer quién es o al menos la imagen que desea proyectar. El candidato por el partido Renovación Popular, Rafael López Aliaga, expuso su imagen política de campaña en base a polémicas declaraciones -además coincidieron con el alza de su nombre en las encuestas de opinión-, sustentadas en diversas emociones, que se ajustan a una estrategia similar empleada por Donald Trump o Jair Bolsonaro. El presente estudio cualitativo ha permitido entender cuál ha sido el rol del discurso emocional en una campaña política de gran envergadura, como lo fueron las elecciones generales 2021 en Perú. De esta manera, el análisis realizado permite entender la relación entre lo político y lo emocional, cómo ambos términos convergen y cómo los diferentes usos de la emocionalidad generan impacto en la estrategia de comunicación. / Emotions have become a weapon when it comes to political campaigns. The usage of emotions on a candidate’s speech provides a unique value proposition. The politician’s exposure- either for a national charge- regional or local- supplies to the population the possibility of getting to know them or at least to know the image they are trying to portray. The “Renovación Popular” candidate designed his political campaign image based on controversial statements – furthermore, concurred with the raise of his name on opinion polls –, sustained by different emotions, which comprehends a strategy very similar to the one proposed by Donald Trump or Jair Bolsonaro. This qualitative study allowed to understand what the role of emotional discourse has been in a political campaign, such as the first round of the electoral process of 2021 in Peru. In this way, the analysis allows us to understand the relation between the political and the emotional, how both terms converge and how the different uses of emotionality generate an impact on the communication strategy. / Tesis Comunicación política Discurso emocional Campaña política Procesos electorales Estrategias de comunicación Political communication Emotional speech Politic campaign Electoral processes Communication strategies

1

Page generated in 0.078 seconds