Global ETD Search

1	Popular Music Analysis: Chorus and Emotion Detection Lin, Yu-Dun 16 August 2010 (has links) In this thesis, a chorus detection and an emotion detection algorithm for popular music are proposed. First, a popular music is decomposed into chorus and verse segments based on its color representation and MFCCs (Mel-frequency cepstral coefficients). Four features including intensity, tempo and rhythm regularity are extracted from these structured segments for emotion detection. The emotion of a song is classified into four classes of emotions: happy, angry, depressed and relaxed via two classification methods. One is back-propagation neural network classifier and the other is Adaboost classifier. A test database consisting of 350 popular music songs is utilized in our experiment. Experimental results show that the average recall and precision of the proposed chorus detection are approximated to 95% and 84%, respectively; the average precision rate of emotion detection is 86% for neural network classifier and 92% for Adaboost classifier. The emotions of a song with different cover versions are also detected in our experiment. The precision rate is 92%. tempo MFCCs rhythm emotion neural network
2	Bio-inspired noise robust auditory features Javadi, Ailar 12 June 2012 (has links) The purpose of this work is to investigate a series of biologically inspired modifications to state-of-the-art Mel- frequency cepstral coefficients (MFCCs) that may improve automatic speech recognition results. We have provided recommendations to improve speech recognition results de- pending on signal-to-noise ratio levels of input signals. This work has been motivated by noise-robust auditory features (NRAF). In the feature extraction technique, after a signal is filtered using bandpass filters, a spatial derivative step is used to sharpen the results, followed by an envelope detector (recti- fication and smoothing) and down-sampling for each filter bank before being compressed. DCT is then applied to the results of all filter banks to produce features. The Hidden- Markov Model Toolkit (HTK) is used as the recognition back-end to perform speech recognition given the features we have extracted. In this work, we investigate the role of filter types, window size, spatial derivative, rectification types, smoothing, down- sampling and compression and compared the final results to state-of-the-art Mel-frequency cepstral coefficients (MFCC). A series of conclusions and insights are provided for each step of the process. The goal of this work has not been to outperform MFCCs; however, we have shown that by changing the compression type from log compression to 0.07 root compression we are able to outperform MFCCs for all noisy conditions. Speech recognition MFCCs Noise-robust features Feature extraction Biologically-inspired computing Automatic speech recognition Computational auditory scene analysis
3	Perceptually motivated speech recognition and mispronunciation detection Koniaris, Christos January 2012 (has links) This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes. The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions. The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language’s own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods. / <p>QC 20120914</p> / European Union FP6-034362 research project ACORNS / Computer-Animated language Teachers (CALATea) feature extraction feature selection auditory models MFCCs speech recognition distortion measures perturbation analysis psychoacoustics human perception sensitivity matrix pronunciation error detection phoneme second language perceptual assessment

1

Page generated in 0.0239 seconds