1 |
Emotion Recognition Using Glottal and Prosodic FeaturesIliev, Alexander Iliev 21 December 2009 (has links)
Emotion conveys the psychological state of a person. It is expressed by a variety of physiological changes, such as changes in blood pressure, heart beat rate, degree of sweating, and can be manifested in shaking, changes in skin coloration, facial expression, and the acoustics of speech. This research focuses on the recognition of emotion conveyed in speech. There were three main objectives of this study. One was to examine the role played by the glottal source signal in the expression of emotional speech. The second was to investigate whether it can provide improved robustness in real-world situations and in noisy environments. This was achieved through testing in clear and various noisy conditions. Finally, the performance of glottal features was compared to diverse existing and newly introduced emotional feature domains. A novel glottal symmetry feature is proposed and automatically extracted from speech. The effectiveness of several inverse filtering methods in extracting the glottal signal from speech has been examined. Other than the glottal symmetry, two additional feature classes were tested for emotion recognition domains. They are the: Tonal and Break Indices (ToBI) of American English intonation, and Mel Frequency Cepstral Coefficients (MFCC) of the glottal signal. Three corpora were specifically designed for the task. The first two investigated the four emotions: Happy, Angry, Sad, and Neutral, and the third added Fear and Surprise in a six emotions recognition task. This work shows that the glottal signal carries valuable emotional information and using it for emotion recognition has many advantages over other conventional methods. For clean speech, in a four emotion recognition task using classical prosodic features achieved 89.67% recognition, ToBI combined with classical features, reached 84.75% recognition, while using glottal symmetry alone achieved 98.74%. For a six emotions task these three methods achieved 79.62%, 90.39% and 85.37% recognition rates, respectively. Using the glottal signal also provided greater classifier robustness under noisy conditions and distortion caused by low pass filtering. Specifically, for additive white Gaussian noise at SNR = 10 dB in the six emotion task the classical features and the classical with ToBI both failed to provide successful results; speech MFCC's achieved a recognition rate of 41.43% and glottal symmetry reached 59.29%. This work has shown that the glottal signal, and the glottal symmetry in particular, provides high class separation for both the four and six emotion cases. It is confidently surpassing the performance of all other features included in this investigation in noisy speech conditions and in most clean signal conditions.
|
2 |
The automatic recognition of emotions in speechManamela, Phuti, John January 2020 (has links)
Thesis(M.Sc.(Computer Science)) -- University of Limpopo, 2020 / Speech emotion recognition (SER) refers to a technology that enables machines to detect and recognise human emotions from spoken phrases. In the literature, numerous attempts have been made to develop systems that can recognise human emotions from their voice, however, not much work has been done in the context of South African indigenous languages. The aim of this study was to develop an SER system that can classify and recognise six basic human emotions (i.e., sadness, fear, anger, disgust, happiness, and neutral) from speech spoken in Sepedi language (one of South Africa’s official languages). One of the major challenges encountered, in this study, was the lack of a proper corpus of emotional speech. Therefore, three different Sepedi emotional speech corpora consisting of acted speech data have been developed. These include a RecordedSepedi corpus collected from recruited native speakers (9 participants), a TV broadcast corpus collected from professional Sepedi actors, and an Extended-Sepedi corpus which is a combination of Recorded-Sepedi and TV broadcast emotional speech corpora. Features were extracted from the speech corpora and a data file was constructed. This file was used to train four machine learning (ML) algorithms (i.e., SVM, KNN, MLP and Auto-WEKA) based on 10 folds validation method. Three experiments were then performed on the developed speech corpora and the performance of the algorithms was compared. The best results were achieved when Auto-WEKA was applied in all the experiments. We may have expected good results for the TV broadcast speech corpus since it was collected from professional actors, however, the results showed differently. From the findings of this study, one can conclude that there are no precise or exact techniques for the development of SER systems, it is a matter of experimenting and finding the best technique for the study at hand. The study has also highlighted the scarcity of SER resources for South African indigenous languages. The quality of the dataset plays a vital role in the performance of SER systems. / National research foundation (NRF) and
Telkom Center of Excellence (CoE)
|
3 |
Studentų emocinės būklės testavimo metu tyrimas panauduojant biometrines technologijas / Research of emotional state students during test using biometric technologyVlasenko, Andrej 29 March 2012 (has links)
Disertacijoje nagrinėjamas kompiuterinės sistemos kūrimas, su kuria būtų galima nustatyti asmens psichoemicinę būseną pagal jo balso signalų požymius. Taip pat pateikiama vyzdžio skersmens matavimo sistema. Taigi, pagrindiniai mokslinio tyrimo objektai yra žmogaus balso požymiai ir jo vyzdžio dydžio pa-sikeitimo dinamika. Pagrindinis disertacijos tikslas – sukurti metodikas ir algo-ritmus, skirtus automatiškai apdoroti ir išanalizuoti balso signalo požymius. Šių sukurtų algoritmų taikymo sritis – streso valdymo sistemos programinė įranga. Šiame darbe sprendžiami keli pagrindiniai uždaviniai: analizuojant kalbėtojo balsą, kalbančiojo psichoemocinės būklės identifikavimo galimybės ir vyzdžio dydžio kaitos dinamika.
Disertaciją sudaro įvadas, keturi skyriai, rezultatų apibendrinimas, naudotos literatūros sąrašas ir autoriaus publikacijų disertacijos tema sąrašas.
Įvade aptariama tiriamoji problema, darbo aktualumas, aprašomas tyrimų objektas, formuluojamas darbo tikslas bei uždaviniai, aprašoma tyrimų metodi-ka, darbo mokslinis naujumas, darbo rezultatų praktinė reikšmė, ginamieji teigi-niai. Įvado pabaigoje pristatomos disertacijos tema autoriaus paskelbtos publika-cijos bei pranešimai konferencijose ir disertacijos struktūra.
Pirmajame skyriuje pateikta asmens biometrinių bei fiziologiniu požymiu analizės pagrindu sukurta „Rekomendacine biometrinė streso valdymo sistema” (angl. Recommended Biometric Stress Management System). Sistema gali padėti nustatyti neigiamą streso lygį... [toliau žr. visą tekstą] / The dissertation investigates the issues of creating a computer system that uses voice signal features to determine person’s emotional state. In addition pre-sented system of measuring pupil diameter.The main objects of research include emotion recognition from speech and dynamics of eye pupil size change.The main purpose of this dissertation is employing suitable methodologies and algo-rithms to automatically process and analyse human voice parameters. Created algorithms can be used in Stress Management System software. The dissertation also focuses on researching the possibilities of identification of speaker’s psy-choemotional state: applying the analysis of speaker’s voice parameters and the analysis of dynamics of eye pupil size change.
The dissertation consists of four parts including Introduction, 4 chapters, Conclusions and References.
The introduction reveals the investigated problem, importance of the thesis and the object of research and describes the purpose and tasks of the paper, re-search methodology, scientific novelty, the practical significance of results ex-amined in the paper and defended statements. The introduction ends in present-ing the author’s publications on the subject of the defended dissertation, offering the material of made presentations in conferences and defining the structure of the dissertation.
Chapter 1- the Recommended Biometric Stress Management System found-ed on the speech analysis. The System can assist in determining the level of... [to full text]
|
4 |
Optimization techniques for speech emotion recognitionSidorova, Julia 15 December 2009 (has links)
Hay tres aspectos innovadores. Primero, un algoritmo novedoso para calcular el contenido emocional de un enunciado, con un diseño mixto que emplea aprendizaje estadístico e información sintáctica. Segundo, una extensión para selección de rasgos que permite adaptar los pesos y así aumentar la flexibilidad del sistema. Tercero, una propuesta para incorporar rasgos de alto nivel al sistema. Dichos rasgos, combinados con los rasgos de bajo nivel, permiten mejorar el rendimiento del sistema. / The first contribution of this thesis is a speech emotion recognition system called the ESEDA capable of recognizing emotions in di®erent languages. The second contribution is the classifier TGI+. First objects are modeled by means of a syntactic method and then, with a statistical method the mappings of samples are classified, not their feature vectors. The TGI+ outperforms the state of the art top performer on a benchmark data set of acted emotions. The third contribution is high-level features, which are distances from a feature vector to the tree automata accepting class i, for all i in the set of class labels. The set of low-level features and the set of high-level features are concatenated and the resulting set is submitted to the feature selection procedure. Then the classification step is done in the usual way. Testing on a benchmark dataset of authentic emotions showed that this classification strategy outperforms the state of the art top performer.
|
5 |
Decisional-Emotional Support System for a Synthetic Agent : Influence of Emotions in Decision-Making Toward the Participation of Automata in SocietyGuerrero Razuri, Javier Francisco January 2015 (has links)
Emotion influences our actions, and this means that emotion has subjective decision value. Emotions, properly interpreted and understood, of those affected by decisions provide feedback to actions and, as such, serve as a basis for decisions. Accordingly, "affective computing" represents a wide range of technological opportunities toward the implementation of emotions to improve human-computer interaction, which also includes insights across a range of contexts of computational sciences into how we can design computer systems to communicate and recognize the emotional states provided by humans. Today, emotional systems such as software-only agents and embodied robots seem to improve every day at managing large volumes of information, and they remain emotionally incapable to read our feelings and react according to them. From a computational viewpoint, technology has made significant steps in determining how an emotional behavior model could be built; such a model is intended to be used for the purpose of intelligent assistance and support to humans. Human emotions are engines that allow people to generate useful responses to the current situation, taking into account the emotional states of others. Recovering the emotional cues emanating from the natural behavior of humans such as facial expressions and bodily kinetics could help to develop systems that allow recognition, interpretation, processing, simulation, and basing decisions on human emotions. Currently, there is a need to create emotional systems able to develop an emotional bond with users, reacting emotionally to encountered situations with the ability to help, assisting users to make their daily life easier. Handling emotions and their influence on decisions can improve the human-machine communication with a wider vision. The present thesis strives to provide an emotional architecture applicable for an agent, based on a group of decision-making models influenced by external emotional information provided by humans, acquired through a group of classification techniques from machine learning algorithms. The system can form positive bonds with the people it encounters when proceeding according to their emotional behavior. The agent embodied in the emotional architecture will interact with a user, facilitating their adoption in application areas such as caregiving to provide emotional support to the elderly. The agent's architecture uses an adversarial structure based on an Adversarial Risk Analysis framework with a decision analytic flavor that includes models forecasting a human's behavior and their impact on the surrounding environment. The agent perceives its environment and the actions performed by an individual, which constitute the resources needed to execute the agent's decision during the interaction. The agent's decision that is carried out from the adversarial structure is also affected by the information of emotional states provided by a classifiers-ensemble system, giving rise to a "decision with emotional connotation" included in the group of affective decisions. The performance of different well-known classifiers was compared in order to select the best result and build the ensemble system, based on feature selection methods that were introduced to predict the emotion. These methods are based on facial expression, bodily gestures, and speech, with satisfactory accuracy long before the final system. / <p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 8: Accepted.</p>
|
6 |
Speech Emotion Recognition from Raw Audio using Deep Learning / Känsloigenkänning från rå ljuddata med hjälp av djupinlärningRintala, Jonathan January 2020 (has links)
Traditionally, in Speech Emotion Recognition, models require a large number of manually engineered features and intermediate representations such as spectrograms for training. However, to hand-engineer such features often requires both expert domain knowledge and resources. Recently, with the emerging paradigm of deep-learning, end-to-end models that extract features themselves and learn from the raw speech signal directly have been explored. A previous approach has been to combine multiple parallel CNNs with different filter lengths to extract multiple temporal features from the audio signal, and then feed the resulting sequence to a recurrent block. Also, other recent work present high accuracies when utilizing local feature learning blocks (LFLBs) for reducing the dimensionality of a raw audio signal, extracting the most important information. Thus, this study will combine the idea of LFLBs for feature extraction with a block of parallel CNNs with different filter lengths for capturing multitemporal features; this will finally be fed into an LSTM layer for global contextual feature learning. To the best of our knowledge, such a combined architecture has yet not been properly investigated. Further, this study will investigate different configurations of such an architecture. The proposed model is then trained and evaluated on the well-known speech databases EmoDB and RAVDESS, both in a speaker-dependent and speaker-independent manner. The results indicate that the proposed architecture can produce comparable results with state-of-the-art; despite excluding data augmentation and advanced pre-processing. It was reported 3 parallel CNN pipes yielded the highest accuracy, together with a series of modified LFLBs that utilize averagepooling and ReLU activation. This shows the power of leaving the feature learning up to the network and opens up for interesting future research on time-complexity and trade-off between introducing complexity in pre-processing or in the model architecture itself. / Traditionellt sätt, vid talbaserad känsloigenkänning, kräver modeller ett stort antal manuellt konstruerade attribut och mellanliggande representationer, såsom spektrogram, för träning. Men att konstruera sådana attribut för hand kräver ofta både domänspecifika expertkunskaper och resurser. Nyligen har djupinlärningens framväxande end-to-end modeller, som utvinner attribut och lär sig direkt från den råa ljudsignalen, undersökts. Ett tidigare tillvägagångssätt har varit att kombinera parallella CNN:er med olika filterlängder för att extrahera flera temporala attribut från ljudsignalen och sedan låta den resulterande sekvensen passera vidare in i ett så kallat Recurrent Neural Network. Andra tidigare studier har också nått en hög noggrannhet när man använder lokala inlärningsblock (LFLB) för att reducera dimensionaliteten hos den råa ljudsignalen, och på så sätt extraheras den viktigaste informationen från ljudet. Således kombinerar denna studie idén om att nyttja LFLB:er för extraktion av attribut, tillsammans med ett block av parallella CNN:er som har olika filterlängder för att fånga multitemporala attribut; detta kommer slutligen att matas in i ett LSTM-lager för global inlärning av kontextuell information. Så vitt vi vet har en sådan kombinerad arkitektur ännu inte undersökts. Vidare kommer denna studie att undersöka olika konfigurationer av en sådan arkitektur. Den föreslagna modellen tränas och utvärderas sedan på de välkända taldatabaserna EmoDB och RAVDESS, både via ett talarberoende och talaroberoende tillvägagångssätt. Resultaten indikerar att den föreslagna arkitekturen kan ge jämförbara resultat med state-of-the-art, trots att ingen ökning av data eller avancerad förbehandling har inkluderats. Det rapporteras att 3 parallella CNN-lager gav högsta noggrannhet, tillsammans med en serie av modifierade LFLB:er som nyttjar average-pooling och ReLU som aktiveringsfunktion. Detta visar fördelarna med att lämna inlärningen av attribut till nätverket och öppnar upp för intressant framtida forskning kring tidskomplexitet och avvägning mellan introduktion av komplexitet i förbehandlingen eller i själva modellarkitekturen.
|
7 |
Hotspot Detection for Automatic Podcast Trailer Generation / Hotspot-detektering för automatisk generering av podcast-trailersZhu, Winstead Xingran January 2021 (has links)
With podcasts being a fast growing audio-only form of media, an effective way of promoting different podcast shows becomes more and more vital to all the stakeholders concerned, including the podcast creators, the podcast streaming platforms, and the podcast listeners. This thesis investigates the relatively little studied topic of automatic podcast trailer generation, with the purpose of en- hancing the overall visibility and publicity of different podcast contents and gen- erating more user engagement in podcast listening. This thesis takes a hotspot- based approach, by specifically defining the vague concept of “hotspot” and designing different appropriate methods for hotspot detection. Different meth- ods are analyzed and compared, and the best methods are selected. The selected methods are then used to construct an automatic podcast trailer generation sys- tem, which consists of four major components and one schema to coordinate the components. The system can take a random podcast episode audio as input and generate an around 1 minute long trailer for it. This thesis also proposes two human-based podcast trailer evaluation approaches, and the evaluation results show that the proposed system outperforms the baseline with a large margin and achieves promising results in terms of both aesthetics and functionality.
|
8 |
Prediktion av användaromdömen om språkcafé-samtal baserat på automatisk röstanalys / Prediction of user ratings of language cafe conversations based on automatic voice analysisHansson Svan, Angus, Mannerstråle, Carl January 2019 (has links)
Spoken communication between humans generate information in two channels; the primary channel, linked to the syntactic-semantic part of the speech (what a person is litteraly saying); the secondary channel conveys paralinguistic information (tone, emotional state and gestures). This study examines the paralinguistic part of the speech, more specific the tone and emotional state. The study examines if there is a correlation between human speech and the opinion of a participant to a language café based conversation. The language café conversations is moderated by the social robot platform Furhat created by Furhat Robotics. The report is written from two perspectives. A data scientific view where identified emotions in audio files are analysed with machine learning algorithms and mathematical models. Vokaturi, an emotion recognition software, analyses the audio files and quantifies the emotional attributes. The classification model is based upon these attributes and the answers from the language café survey. Speech emotion recognition is also evaluated as a method for gathering customer opinions in a customer feedback loop. The results show an accuracy of 61% and indicates that some sort of prediction is possible. However there is no clear correlation between the recorded human voice and the participants opinion of the conversation. In the discussion part the difficulties of creating a high accuracy model with current data is analysed. It also contains a hypothetic analysis of the model as a gathering method for customer data. / En person som talar sprider information genom en primär samt en sekundär kanal. Den primära kanalen är kopplat till den syntaktiska semantiken av talet (vad personen bokstavligen säger), medan den sekundära kanalen är kopplat till den paralingvistiska delen (ton, känslotillstånd och gester). Denna studie undersöker den paralingvistiska delen av talet, mer specifikt en människas tonläge och känsla. Studien undersöker om det finns någon korrelation mellan mänskligt tal och vad personen tycker om ett parkcafé-samtal. Parkcafé samtalen i denna studie har genomförts tillsammans med den sociala roboten Furhat skapad av Furhat Robotics. Rapporten är skriven ur två perspektiv. Ett datatekniskt perspektiv där känsloyttringar i ljudfiler analyseras med hjälp av maskininlärning och matematiska modeller. Med hjälp av Vokaturi, som tillhandahåller mjukvara för känsloigenkänning av ljud, analyseras inspelade konversationer och attribut för olika känslor kvantifieras. Klassificeringsmodellen skapas sedan av dessa attribut, svar på enkätundersökningar (del ett) samt av författarna egen-annoterade ljudfiler (del två). Dessutom analyseras känsloigenkänning som metod för insamling av användaråsikter ur ett företagsekonomiskt perspektiv. Resultaten påvisar en träffsäkerhet på ca 62% och 61% för del ett respektive två och pekar på att någon form av prediktion är möjlig. Ett tydligt samband mellan deltagarens röst och dess åsikt om samtalet är dock svårt att finna med dessa resultat. I analysen och slutsatsen diskuteras svårigheterna med att ta fram en funktionell modell med tillgänglig data samt en hypotetisk diskussion kring modellen som del av en customer feedback loop.
|
Page generated in 0.1521 seconds