Global ETD Search

1	Paralinguistic event detection in children's speech Rao, Hrishikesh 07 January 2016 (has links) Paralinguistic events are useful indicators of the affective state of a speaker. These cues, in children's speech, are used to form social bonds with their caregivers. They have also been found to be useful in the very early detection of developmental disorders such as autism spectrum disorder (ASD) in children's speech. Prior work on children's speech has focused on the use of a limited number of subjects which don't have sufficient diversity in the type of vocalizations that are produced. Also, the features that are necessary to understand the production of paralinguistic events is not fully understood. To account for the lack of an off-the-shelf solution to detect instances of laughter and crying in children's speech, the focus of the thesis is to investigate and develop signal processing algorithms to extract acoustic features and use machine learning algorithms on various corpora. Results obtained using baseline spectral and prosodic features indicate the ability of the combination of spectral, prosodic, and dysphonation-related features that are needed to detect laughter and whining in toddlers' speech with different age groups and recording environments. The use of long-term features were found to be useful to capture the periodic properties of laughter in adults' and children's speech and detected instances of laughter to a high degree of accuracy. Finally, the thesis focuses on the use of multi-modal information using acoustic features and computer vision-based smile-related features to detect instances of laughter and to reduce the instances of false positives in adults' and children's speech. The fusion of the features resulted in an improvement of the accuracy and recall rates than when using either of the two modalities on their own. Paralinguistic Speech signal processing Pattern recognition
2	Single-Microphone Speech Dereverberation: Modulation Domain Processing and Quality Assessment ZHENG, CHENXI 25 July 2011 (has links) In a reverberant enclosure, acoustic speech signals are degraded by reflections from walls, ceilings, and objects. Restoring speech quality and intelligibility from reverberated speech has received increasing interest over the past few years. Although multiple channel dereverberation methods provide some improvements in speech quality/ intelligibility, single-channel dereverberation remains an open challenge. Two types of advanced single-channel dereverberation methods, namely acoustic domain spectral subtraction and modulation domain filtering, provide small improvement in speech quality and intelligibility. In this thesis, we study single-channel dereverberation algorithms. Firstly, an upper bound of time-frequency masking (TFM) performance for dereverberation is obtained using ideal time-frequency masking (ITFM). ITFM has access to both the clean and reverberated speech signals in estimating the binary-mask matrix. ITFM implements binary masking in the short time Fourier transform (STFT) domain, preserving only those spectral components less corrupted by reverberation. The experiment results show that single-channel ITFM outperforms four existing multi-channel dereverberation methods and suggest that large potential improvements could be obtained using TFM for speech dereverberation. Secondly, a novel modulation domain spectral subtraction method is proposed for dereverberation. This method estimates modulation domain long reverberation spectral variance (LRSV) from time domain LRSV using a statistical room impulse response (RIR) model and implements spectral subtraction in the modulation domain. On one hand, different from acoustic domain spectral subtraction, our method implements spectral subtraction in the modulation domain, which has been shown to play an important role in speech perception. On the other hand, different from modulation domain filtering which uses a time-invariant filter, our method takes the changes of reverberated speech spectral variance along time into account and implements spectral subtraction adaptively. Objective and informal subjective tests show that our proposed method outperforms two existing state-of-the-art single-channel dereverberation algorithms. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2011-07-20 03:18:30.021 Speech enhancement Dereverberation Speech signal processing
3	Implementation of i-vector algorithm in speech emotion recognition by using two different classifiers : Gaussian mixture model and support vector machine Gomes, Joan January 2016 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Emotions are essential for our existence, as they exert great influence on the mental health of people. Speech is the most powerful mode to communicate. It controls our intentions and emotions. Over the past years many researchers worked hard to recognize emotion from speech samples. Many systems have been proposed to make the Speech Emotion Recognition (SER) process more correct and accurate. This thesis research discusses the design of speech emotion recognition system implementing a comparatively new method, i-vector model. I-vector model has found much success in the areas of speaker identification, speech recognition, and language identification. But it has not been much explored in recognition of emotion. In this research, i-vector model was implemented in processing extracted features for speech representation. Two different classification schemes were designed using two different classifiers - Gaussian Mixture Model (GMM) and Support Vector Machine (SVM), along with i-vector algorithm. Performance of these two systems was evaluated using the same emotional speech database to identify four emotional speech signals: Angry, Happy, Sad and Neutral. Results were analyzed, and more than 75% of accuracy was obtained by both systems, which proved that our proposed i-vector algorithm can identify speech emotions with less error and with more accuracy. Speech Signal Processing Emotion Recognition I-vector Algorithm
4	DSP Techniques for Performance Enhancement of Digital Hearing Aid Udayashankara, V 12 1900 (has links) Hearing impairment is the number one chronic disability affecting people in the world. Many people have great difficulty in understanding speech with background noise. This is especially true for a large number of elderly people and the sensorineural impaired persons. Several investigations on speech intelligibility have demonstrated that subjects with sensorineural loss may need a 5-15 dB higher signal-to-noise ratio than the normal hearing subjects. While most defects in transmission chain up to cochlea can nowadays be successfully rehabilitated by means of surgery, the great majority of the remaining inoperable cases are sensorineural hearing impaired, Recent statistics of the hearing impaired patients applying for a hearing aid reveal that 20% of the cases are due to conductive losses, more than 50% are due to sensorineural losses, and the rest 30% of the cases are of mixed origin. Presenting speech to the hearing impaired in an intelligible form remains a major challenge in hearing-aid research today. Even-though various methods have been suggested in the literature for the minimization of noise from the contaminated speech signals, they fail to give good SNR improvement and intelligibility improvement for moderate to-severe sensorineural loss subjects. So far, the power and capability of Newton's method, Nonlinear adaptive filtering methods and the feedback type artificial neural networks have not been exploited for this purpose. Hence we resort to the application of all these methods for improving SNR and intelligibility for the sensorineural loss subjects. Digital hearing aids frequently employ the concept of filter banks. One of the major drawbacks of this techniques is the complexity of computation requiring more number of multiplications. This increases the power consumption. Therefore this Thesis presents the new approach to speech enhancement for the hearing impaired and also the construction of filter bank in Digital hearing aid with minimum number of multiplications. The following are covered in this thesis. One of the most important application of adaptive systems is in noise cancellation using adaptive filters. The ANC setup requires two input signals (viz., primary and reference). The primary input consists of the sum of the desired signal and noise which is uncorrelated. The reference input consists of mother noise which is correlated in Some unknown way with noise of primary input. The primary signal is obtained by placing the omnidirectional microphone just above one ear on the head of the KEMAR mannikan and the reference signal is obtained by placing the hypercardioid microphone at the center of the vertebral column on the back. Conventional speech enhancement techniques use linear schemes for enhancing speech signals. So far Nonlinear adaptive filtering techniques are not used in hearing aid applications. The motivation behind the use of nonlinear model is that it gives better noise suppression as compared to linear model. This is because the medium through which signals reach the microphone may be highly nonlinear. Hence the use of linear schemes, though motivated by computational simplicity and mathematical tractability, may be suboptimal. Hence, we propose the use of nonlinear models to enhance the speech signals for the hearing impaired: We propose both Linear LMS and Nonlinear second order Volterra LMS schemes to enhance speech signals. Studies conducted for different environmental noise including babble, cafeteria and low frequency noise show that the second-order Volterra LMS performs better compared to linear LMS algorithm. We use measures such as signal-to-noise ratio (SNR), time plots, and intelligibility tests for performance comparison. We also propose an ANC scheme which uses Newton's method to enhance speech signals. The main problem associated with LMS based ANC is that their convergence is slow and hence their performance becomes poor for hearing aid applications. The reason for choosing Newton's method is that they have high performance adaptive-filtering methods that often converge and track faster than LMS method. We propose two models to enhance speech signals: one is conventional linear model and the other is a nonlinear model using a second order Volterra function. Development of Newton's type algorithm for linear mdel results in familiar Recursive least square (RLS) algorithm. The performance of both linear and non-linear Newton's algorithm is evaluated for babble, cafeteria and frequency noise. SNR, timeplots and intelligibility tests are used for performance comparison. The results show that Newton's method using Volterra nonlinearity performs better than RLS method. ln addition to the ANC based schemes, we also develop speech enhancement for the hearing impaired by using the feedback type neural network (FBNN). The main reason is that here we have parallel algorithm which can be implemented directly in hardware. We translate the speech enhancement problem into a neural network (NN) framework by forming an appropriate energy function. We propose both linear and nonlinear FBNN for enhancing the speech signals. Simulated studies on different environmental noise reveal that the FBNN using the Volterra nonlinearity is superior to linear FBNN in enhancing speech signals. We use SNR, time plots, and intelligibility tests for performance comparison. The design of an effective hearing aid is a challenging problem for sensorineural hearing impaired people. For persons with sensorineural losses it is necessary that the frequency response should be optimally fitted into their residual auditory area. Digital filter enhances the performance of the hearing aids which are either difficult or impossible to realize using analog techniques. The major problem in digital hearing aid is that of reducing power consumption. Multiplication is one of the most power consuming operation in digital filtering. Hence a serious effort has been made to design filter bank with minimum number of multiplications, there by minimizing the power consumption. It is achieved by using Interpolated and complementary FIR filters. This method gives significant savings in the number of arithmetic operations. The Thesis is concluded by summarizing the results of analysis, and suggesting scope for further investigation Electrical Communication Speech Signal Processing Digital Hearing Aid Speech Enhancement Digital Filter Bank Linear Model
5	Classification of affect using novel voice and visual features Kim, Jonathan Chongkang 07 January 2016 (has links) Emotion adds an important element to the discussion of how information is conveyed and processed by humans; indeed, it plays an important role in the contextual understanding of messages. This research is centered on investigating relevant features for affect classification, along with modeling the multimodal and multitemporal nature of emotion. The use of formant-based features for affect classification is explored. Since linear predictive coding (LPC) based formant estimators often encounter problems with modeling speech elements, such as nasalized phonemes and give inconsistent results for bandwidth estimation, a robust formant-tracking algorithm was introduced to better model the formant and spectral properties of speech. The algorithm utilizes Gaussian mixtures to estimate spectral parameters and refines the estimates using maximum a posteriori (MAP) adaptation. When the method was used for features extraction applied to emotion classification, the results indicate that an improved formant-tracking method will also provide improved emotion classification accuracy. Spectral features contain rich information about expressivity and emotion. However, most of the recent work in affective computing has not progressed beyond analyzing the mel-frequency cepstral coefficients (MFCC’s) and their derivatives. A novel method for characterizing spectral peaks was introduced. The method uses a multi-resolution sinusoidal transform coding (MRSTC). Because of MRSTC’s high precision in representing spectral features, including preservation of high frequency content not present in the MFCC’s, additional resolving power was demonstrated. Facial expressions were analyzed using 53 motion capture (MoCap) markers. Statistical and regression measures of these markers were used for emotion classification along the voice features. Since different modalities use different sampling frequencies and analysis window lengths, a novel classifier fusion algorithm was introduced. This algorithm is intended to integrate classifiers trained at various analysis lengths, as well as those obtained from other modalities. Classification accuracy was statistically significantly improved using a multimodal-multitemporal approach with the introduced classifier fusion method. A practical application of the techniques for emotion classification was explored using social dyadic plays between a child and an adult. The Multimodal Dyadic Behavior (MMDB) dataset was used to automatically predict young children’s levels of engagement using linguistic and non-linguistic vocal cues along with visual cues, such as direction of a child’s gaze or a child’s gestures. Although this and similar research is limited by inconsistent subjective boundaries, and differing theoretical definitions of emotion, a significant step toward successful emotion classification has been demonstrated; key to the progress has been via novel voice and visual features and a newly developed multimodal-multitemporal approach. Speech signal processing Machine learning Classifier fusion Multimodal analysis Emotion Affective computing Human behavior analysis Speech production
6	Μοντελοποίηση και ψηφιακή επεξεργασία προσωδιακών φαινομένων της ελληνικής γλώσσας με εφαρμογή στην σύνθεση ομιλίας / Modeling and signal processing of greek language prosodic events with application to speech synthesis Ζέρβας, Παναγιώτης 04 February 2008 (has links) Αντικείμενο της παρούσης διδακτορικής διατριβής αποτελεί η μελέτη και μοντελοποίηση των φαινομένων επιτονισμού της Ελληνικής γλώσσας με εφαρμογές στην σύνθεση ομιλίας. Στα πλαίσια της διατριβής αυτής αναπτύχθηκαν πόροι ομιλίας και εργαλεία για την επεξεργασία και μελέτη προσωδιακών παραγόντων οι οποίοι επηρεάζουν την πληροφορία που μεταφέρεται μέσω του προφορικού λόγου. Για την διαχείρηση και επεξεργασία των παραπάνω πόρων υλοποιήθηκε πλατφόρμα μετατροπής κειμένου σε ομιλία βασισμένη στην συνένωση δομικών μονάδων ομιλίας. Για την μελέτη και την δημιουργία των μοντέλων μηχανικής μάθησης χρησιμοποιήθηκε η γλωσσολογική αναπαράσταση GRToBI των φαινομένων επιτονισμού. / In this thesis we cope with the task of studying and modeling prosodic phenomena encountered in Greek language with applications to the task of speech synthesis from tex. Thus, spoken corpora with various levels of morphosyntactical and linguistic representation as well as tools for their processing, we constructed. For the task of coding the emerged prosodic phenomena of our recorded utterences we have utilized the GRToBI annotation of speech. Προσωδία Επιτονισμός Σύνθεση ομιλίας Μηχανική μάθηση 621.382 23 Prosody Intonation Speech signal processing Speech synthesis Machine learning
7	Steuerung sprechernormalisierender Abbildungen durch künstliche neuronale Netzwerke Müller, Knut 01 November 2000 (has links) No description available. 530 Physik Mathematics and Computer Science Automatic Speech Recognition Speaker Normalization Speech Signal Processing Artificial Neural Networks 33.06 RDH00
8	Fluency Features and Elicited Imitation as Oral Proficiency Measurement Christensen, Carl V. 07 July 2012 (has links) (PDF) The objective and automatic grading of oral language tests has been the subject of significant research in recent years. Several obstacles lie in the way of achieving this goal. Recent work has suggested a testing technique called elicited imitation (EI) can be used to accurately approximate global oral proficiency. This testing methodology, however, does not incorporate some fundamental aspects of language such as fluency. Other work has suggested another testing technique, simulated speech (SS), as a supplement to EI that can provide automated fluency metrics. In this work, I investigate a combination of fluency features extracted for SS testing and EI test scores to more accurately predict oral language proficiency. I also investigate the role of EI as an oral language test, and the optimal method of extracting fluency features from SS sound files. Results demonstrate the ability of EI and SS to more effectively predict hand-scored SS test item scores. I finally discuss implications of this work for future automated oral testing scenarios. Second language oral proficiency Elicited Imitation Simulated Speech Automatic Speech Recognition language modalities speech signal processing computerized oral test Linguistics
9	Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment Cox, Troy L. 14 March 2013 (has links) (PDF) Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the potential of using ASR timing fluency features to predict speech ratings and the effect of prompt difficulty in that process. A speaking test with ten prompts representing five different intended difficulty levels was administered to 201 subjects. The speech samples obtained were then (a) rated by human raters holistically, (b) rated by human raters analytically at the item level, and (c) scored automatically using PRAAT to calculate ten different ASR timing fluency features. The ratings and scores of the speech samples were analyzed with Rasch measurement to evaluate the functionality of the scales and the separation reliability of the examinees, raters, and items. There were three ASR timed fluency features that best predicted human speaking ratings: speech rate, mean syllables per run, and number of silent pauses. However, only 31% of the score variance was predicted by these features. The significance in this finding is that those fluency features alone likely provide insufficient information to predict human rated speaking ability accurately. Furthermore, neither the item difficulties calculated by the ASR nor those rated analytically by the human raters aligned with the intended item difficulty levels. The misalignment of the human raters with the intended difficulties led to a further analysis that found that it was problematic for raters to use a holistic scale at the item level. However, modifying the holistic scale to a scale that examined if the response to the prompt was at-level resulted in a significant correlation (r = .98, p < .01) between the item difficulties calculated analytically by the human raters and the intended difficulties. This result supports the hypothesis that item prompts are important when it comes to obtaining quality speech samples. As test developers seek to use ASR to score speaking assessments, caution is warranted to ensure that score differences are due to examinee ability and not the prompt composition of the test. Automatic Speech Recognition second language oral proficiency language testing and assessment English as a second language tests speech signal processing Educational Psychology
10	Análise cepstral baseada em diferentes famílias transformada wavelet / Cepstral analysis based on different family of wavelet transform Sanchez, Fabrício Lopes 02 December 2008 (has links) Este trabalho apresenta um estudo comparativo entre diferentes famílias de transformada Wavelet aplicadas à análise cepstral de sinais digitais de fala humana, com o objetivo específico de determinar o período de pitch dos mesmos e, ao final, propõe um algoritmo diferencial para realizar tal operação, levando-se em consideração aspectos importantes do ponto de vista computacional, tais como: desempenho, complexidade do algoritmo, plataforma utilizada, dentre outros. São apresentados também, os resultados obtidos através da implementação da nova técnica (baseada na transformada wavelet) em comparação com a abordagem tradicional (baseada na transformada de Fourier). A implementação da técnica foi testada em linguagem C++ padrão ANSI sob as plataformas Windows XP Professional SP3, Windows Vista Business SP1, Mac OSX Leopard e Linux Mandriva 10. / This work presents a comparative study between different family of wavelets applied on cepstral analysis of the digital speech human signal with specific objective for determining of pitch period of the same and in the end, proposes an differential algorithm to make such a difference operation take into consideration important aspects of computational point of view, such as: performance, algorithm complexity, used platform, among others. They are also present, the results obtained through of the technique implementation compared with the traditional approach. The technique implementation was tested in C++ language standard ANSI under the platform Windows XP Professional SP3 Edition, Windows Vista Business SP1, MacOSX Leopard and Linux Mandriva 10. Análise cepstral Cepstrum analysis Digital speech signal processing Discrete Fourier transforn Discrete wavelet transform Período de pitch Pitch period Processamento digital de sinais de fala Transformada discreta de Fourier Transformada discreta wavelet

Search results