Global ETD Search

21	A Design of Korean Speech Recognition System Wu, Bing-Yang 24 August 2010 (has links) This thesis investigates the design and implementation strategies for a Korean speech recognition system. It utilizes the speech features of the common Korean mono-syllables as the major training and recognition methodology. A training database of 10 utterances per mono-syllable is established by applying Korean pronunciation rules. These 10 utterances are collected through reading 5 rounds of the same mono-syllables twice with different tones. The first pronounced pattern has high pitch of tone 1,while the second one has falling pitch of tone 4.Mel-frequency cepstral coefficients, linear predictive cepstrum coefficients, and hidden Markov model are used as the two feature models and the recognition model respectively. Under the Pentium 2.4 GHz personal computer and Ubuntu 9.04 operating system environment, a correct phrase recognition rate of 92.25% can be reached for a 4865 Korean phrase database. The average computation time for each phrase is about 1.5 seconds. Mel-frequency cepstral coefficients Linear predictive cepstrum coefficients Hidden Markov model
22	A Design of Speech Recognition System under Noisy Environment Cheng, Po-Wen 11 August 2003 (has links) The objective of this thesis is to build a phrase recognition system under noisy environment that can be used in real-life. In this system, the noisy speech is first filtered by the enhanced spectral subtraction method to reduce the noise level. Then the MFCC with cepstral mean subtraction is applied to extract the speech features. Finally, hidden Markov model (HMM) is used in the last stage to build the probabilistic model for each phrase. A Mandarin microphone database of 514 company names that are in Taiwan¡¦s stock market is collected. A speaker independent noisy phrase recognition system is then implemented. This system has been tested under various noise environments and different noise strengths. cepstral mean subtraction spectral subtraction speaker-independent phrase recognition hidden Markov model
23	Prediction and Estimation of Random Fields Kohli, Priya 2012 August 1900 (has links) For a stationary two dimensional random field, we utilize the classical Kolmogorov-Wiener theory to develop prediction methodology which requires minimal assumptions on the dependence structure of the random field. We also provide solutions for several non-standard prediction problems which deals with the "modified past," in which a finite number of observations are added to the past. These non-standard prediction problems are motivated by the network site selection in the environmental and geostatistical applications. Unlike the time series situation, the prediction results for random fields seem to be expressible only in terms of the moving average parameters, and attempts to express them in terms of the autoregressive parameters lead to a new and mysterious projection operator which captures the nature of edge-effects. We put forward an approach for estimating the predictor coefficients by carrying out an extension of the exponential models. Through simulation studies and real data example, we demonstrate the impressive performance of our prediction method. To the best of our knowledge, the proposed method is the first to deliver a unified framework for forecasting random fields both in the time and spectral domain without making a subjective choice of the covariance structure. Finally, we focus on the estimation of the hurst parameter for long range dependence stationary random fields, which draws its motivation from applications in the environmental and atmospheric processes. Current methods for estimation of the Hurst parameter include parametric models like fractional autoregressive integrated moving average models, and semiparametric estimators which are either inefficient or inconsistent. We propose a novel semiparametric estimator based on the fractional exponential spectrum. We develop three data-driven methods which can automatically select the optimal model order for the fractional exponential models. Extensive simulation studies and analysis of Mercer and Hall?s wheat data are used to illustrate the performance of the proposed estimator and model order selection criteria. The results show that our estimator outperforms existing estimators, including the GPH (Geweke and Porter-Hudak) estimator. We show that the proposed estimator is consistent, works for different definitions of long range dependent random fields, is computationally simple and is not susceptible to model misspecification or poor efficiency. Spatial Prediction Wold Decomposition Exponential Models Cepstral Coefficients Long Range Dependence.
24	A robust audio-based symbol recognition system using machine learning techniques Wu, Qiming 02 1900 (has links) Masters of Science / This research investigates the creation of an audio-shape recognition system that is able to interpret a user’s drawn audio shapes—fundamental shapes, digits and/or letters— on a given surface such as a table-top using a generic stylus such as the back of a pen. The system aims to make use of one, two or three Piezo microphones, as required, to capture the sound of the audio gestures, and a combination of the Mel-Frequency Cepstral Coeﬃcients (MFCC) feature descriptor and Support Vector Machines (SVMs) to recognise audio shapes. The novelty of the system is in the use of piezo microphones which are low cost, light-weight and portable, and the main investigation is around determining whether these microphones are able to provide suﬃciently rich information to recognise the audio shapes mentioned in such a framework. Audio shape Recognition system Piezo microphones Support vector machines Mel-frequency cepstral coefficients
25	Identifikace řečové aktivity v rušeném řečovém signálu / Identification of Speech Activity in Noisy Speech Signal Pelikán, Martin January 2013 (has links) This paper is focused on identification of pauses in noisy speech signal and following filtering of the noise from the signal. Firstly the signal processing methods are theoretically described, then voice activity detectors and in the end noise filtering methods are described. Several voice activity detectors were created and their pause detection rate was compared.
26	Voice Onset Time in Children With and Without Vocal Fold Nodules Colletti, Lauren Anna January 2022 (has links) Purpose: This study examined voice onset time (VOT) in children with and without vocal fold nodules (VFN). The purpose of this study was to provide further evidence regarding the need for individualized research and treatment dedicated to the pediatric population. The pediatric population has a distinctly different laryngeal mechanism than adults, as they are still developing. Although the pediatric system is anatomically different from that of a fully mature adult system, treatment for children with VFN is largely based on adult research. This study examined the VOTs of voiceless consonants, as the transition from the voiceless consonant to the subsequent vowel requires significant vocal and articulatory control and coordination. Measures of VOT change throughout the maturation as VOT follows a significant developmental pattern. Children with and without VFN were enlisted in order to examine the effects VFN have on VOT. Hypotheses: We hypothesize that children with VFN would have differences in 1) average VOT values compared to the control group, with no prediction for direction of difference (shorter and longer), and 2) between-word variability of VOT values compared to the control group, with no prediction for direction of difference (more variable and less variable). Methods: Participant data were retrospectively collected and included children between 6 and 12 years old with VFN and age- and sex-matched controls. Participants were recorded producing the six CAPE-V sentences. Four voiceless consonants were selected for VOT analysis. Praat was utilized to manually mark the vocal onset of the stop consonant by the current researcher. A previous researcher identified the vocal offset, and each placement was confirmed by the current researcher. VOT was calculated as the time between the stop consonant burst and the vocal onset of the vowel. Results: There was no significant difference between the VFN and the control groups in average VOT or VOT variability. Within the VFN group, participants who were more dysphonic (lower cepstral peak prominence (CPP) values) had more variable VOT values. Participants in the VFN group had lower CPP values than the control group, suggesting that CPP measures are a reliable indicator of dysphonia. Additionally, within the VFN group, male children had lower CPP values than female children. Conclusion: Although no group difference was found, the within-group analyses indicated that VFNs impacted productions. Children with VFN who were more dysphonic had increased VOT variability. This may suggest that VFN impact a child’s ability to phonate therefore causing more variability within productions. Future research is needed to study the impact dysphonia treatment for children with VFN may have on VOT values. Additionally, a longitudinal study of the impact of VFNs on VOT values during developmental stages may be warranted. / Public Health Speech therapy Cepstral peak prominence Children Pediatric Vocal fold nodules Voice Voice onset time
27	Comparison of Acoustic Measures in Discriminating Between Those With Friedreich's Ataxia and Neurologically Normal Peers Luna-Webb, Sophia 01 January 2015 (has links) Background: Technological advancements in speech acoustic analysis have led to the development of spectral/cepstral analyses due to questions regarding the validity of traditional time-based measures (i.e., Jitter, Shimmer, and Harmonics-to-Noise-Ratio) in objectifying perturbations in dysphonic voices. Aim: This study investigated the validity of time-based measures in discriminating those with Friedreich’s ataxia (FA) from normal voiced (NV) peers when compared to cepstral-spectral measures. Method: A total of 120 sustained vowel phonations from an existing database of 40 participants (20 FA; 20 NV) of the vowels /ɑ/, /i/, and /o/ were analyzed to determine which set of variables (i.e., time-based vs. cepstral-spectral) better predicted group membership. Four variables of time-based measures (Jitter Local %, Jitter RAP, Shimmer Local %, Shimmer APQ11, and HNR) were analyzed via the freeware program PRAAT and compared to four cepstral-spectral measures (Cepstral Peak Prominence, Cepstral Peak Prominence Standard Deviation, Low/High Ratio Standard Deviation, and the Cepstral/ Spectral Index of Dysphonia) extracted from the Analysis of Dysphonia in Speech and Voice (ADSV) software program. Results: Findings from a discriminant analysis showed sensitivity and specificity results to be better for ADSV measures; 100% of those in the FA group were classified correctly (sensitivity), and 95% of members in the NV group were correctly identified (specificity) as compared to PRAAT (70% sensitivity and 85% specificity). Conclusions: Cepstral-spectral measures are much more accurate in discriminating between those with FA and NV peers as compared to time-based estimates. Communication Sciences and Disorders
28	Towards Development of Intelligibility Assessment for Dysphonic Speech Ishikawa, Keiko 16 June 2017 (has links) No description available. Speech Therapy Dysphonia Intelligibility Speech in noise Automatic speech analysis Cepstral analysis Landmark theory
29	Channel Compensation for Speaker Recognition Systems Neville, Katrina Lee, katrina.neville@rmit.edu.au January 2007 (has links) This thesis attempts to address the problem of how best to remedy different types of channel distortions on speech when that speech is to be used in automatic speaker recognition and verification systems. Automatic speaker recognition is when a person's voice is analysed by a machine and the person's identity is worked out by the comparison of speech features to a known set of speech features. Automatic speaker verification is when a person claims an identity and the machine determines if that claimed identity is correct or whether that person is an impostor. Channel distortion occurs whenever information is sent electronically through any type of channel whether that channel is a basic wired telephone channel or a wireless channel. The types of distortion that can corrupt the information include time-variant or time-invariant filtering of the information or the addition of 'thermal noise' to the information, both of these types of distortion can cause varying degrees of error in information being received and analysed. The experiments presented in this thesis investigate the effects of channel distortion on the average speaker recognition rates and testing the effectiveness of various channel compensation algorithms designed to mitigate the effects of channel distortion. The speaker recognition system was represented by a basic recognition algorithm consisting of: speech analysis, extraction of feature vectors in the form of the Mel-Cepstral Coefficients, and a classification part based on the minimum distance rule. Two types of channel distortion were investigated: Convolutional (or lowpass filtering) effects Addition of white Gaussian noise Three different methods of channel compensation were tested: Cepstral Mean Subtraction (CMS) RelAtive SpecTrAl (RASTA) Processing Constant Modulus Algorithm (CMA) The results from the experiments showed that for both CMS and RASTA processing that filtering at low cutoff frequencies, (3 or 4 kHz), produced improvements in the average speaker recognition rates compared to speech with no compensation. The levels of improvement due to RASTA processing were higher than the levels achieved due to the CMS method. Neither the CMS or RASTA methods were able to improve accuracy of the speaker recognition system for cutoff frequencies of 5 kHz, 6 kHz or 7 kHz. In the case of noisy speech all methods analysed were able to compensate for high SNR of 40 dB and 30 dB and only RASTA processing was able to compensate and improve the average recognition rate for speech corrupted with a high level of noise (SNR of 20 dB and 10 dB). Speaker recognition signal processing speech processing channel effects channel compensation channel equalisation adaptive filtering RelAtive SpecTral processing RASTA Cepstral Mean Subtraction Constant Modulus Algorithm speech filtering Mel-Frequency Cepstral coefficients
30	Classification of Affective Emotion in Musical Themes : How to understand the emotional content of the soundtracks of the movies? Diaz Banet, Paula January 2021 (has links) Music is created by composers to arouse different emotions and feelings in the listener, and in the case of soundtracks, to support the storytelling of scenes. The goal of this project is to seek the best method to evaluate the emotional content of soundtracks. This emotional content can be measured quantitatively thanks to Russell’s model of valence, arousal and dominance which converts moods labels into numbers. To conduct the analysis, MFCCs and VGGish features were extracted from the soundtracks and used as inputs to a CNN and an LSTM model, in order to study which one achieved a better prediction. A database of 6757 number of soundtracks with their correspondent VAD values was created to perform the mentioned analysis. As an ultimate purpose, the results of the experiments will contribute to the start-up Vionlabs to understand better the content of the movies and, therefore, make a more accurate recommendation on what users want to consume on Video on Demand platforms according to their emotions or moods. / Musik skapas av kompositörer för att väcka olika känslor och känslor hos lyssnaren, och när det gäller ljudspår, för att stödja berättandet av scener. Målet med detta projekt är att söka den bästa metoden för att utvärdera det emotionella innehållet i ljudspår. Detta känslomässiga innehåll kan mätas kvantitativt tack vare Russells modell av valens, upphetsning och dominans som omvandlar stämningsetiketter till siffror. För att genomföra analysen extraherades MFCC: er och VGGish-funktioner från ljudspåren och användes som ingångar till en CNN- och en LSTM-modell för att studera vilken som uppnådde en bättre förutsägelse. En databas med totalt 6757 ljudspår med deras korrespondent acrshort VAD-värden skapades för att utföra den nämnda analysen. Som ett yttersta syfte kommer resultaten av experimenten att bidra till att starta upp Vionlabs för att bättre förstå innehållet i filmerna och därför ge mer exakta rekommendationer på Video on Demand-plattformar baserat på användarnas känslor eller stämningar. Music emotion recognition Deep learning Feature extraction VGGish Mel-frequency Cepstral Coefficients. Music emotion recognition Deep learning Särdragsextraktion VGGish Mel-frequency Cepstral Coefficients Elektroteknik och elektronik

Search results