Spelling suggestions: "subject:"epeech perceptions.theoretical models"" "subject:"epeech observation.mathematical models""
1 |
Identifying Speaker State from Multimodal CuesYang, Zixiaofan January 2021 (has links)
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities.
The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer.
In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers.
The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments.
|
2 |
Power control for mobile radio systems using perceptual speech quality metricsRohani Mehdiabadi, Behrooz January 2007 (has links)
As the characteristics of mobile radio channels vary over time, transmit power must be controlled accordingly to ensure that the received signal level is within the receiver's sensitivity. As a consequence, modern mobile radio systems employ power control to regulate the received signal level such that it is neither less nor excessively larger than receiver sensitivity in order to maintain adequate service quality. In this context, speech quality measurement is an important aspect in the delivery of speech services as it will impact satisfaction of customers as well as the usage of precious system resources. A variety of techniques for speech quality measurement has been produced over the last few years as result of tireless research in the area of perceptual speech quality estimation. These are mainly based on psychoacoustic models of the human auditory systems. However, these techniques cannot be directly applied for real-time communication purposes as they typically require a copy of the transmitted and received speech signals for their operation. This thesis presents a novel technique of incorporating perceptual speech quality metrics with power control for mobile radio systems. The technique allows for standardized perceptual speech quality measurement algorithms to be used for in-service measurement of speech quality. The accuracy of the proposed Real-Time Perceptual Speech Quality Measurement (RTPSQM) technique with respect to measuring speech quality is first validated by extensive simulations. On this basis, RTPSQM is applied to power control in the Global System for Mobile (GSM) communication and the Universal Mobile Telecommunication System (UMTS). It is shown by simulations that the use of perceptual-based power control in GSM and UMTS outperforms conventional power control in terms of reducing the transmitter signal power required for providing adequate speech quality. This in turn facilitates the observed increase in system capacity and thus offers better utilization of available system resources. To enable an analytical performance assessment of perceptual speech quality metrics in power control, the mathematical frameworks for conventional and perceptual-based power control are derived. The derivations are performed for Code Division Multiple Access (CDMA) systems and kept as generic as possible. Numerical results are presented which could be used in a system design to readily find the Erlang capacity per cell for either of the considered power control algorithms.
|
Page generated in 0.1467 seconds