Global ETD Search

1	Robust speech features for speech recognition in hostile environments Toh, Aik January 1900 (has links) Speech recognition systems have improved in robustness in recent years with respect to both speaker and acoustical variability. Nevertheless, it is still a challenge to deploy speech recognition systems in real-world applications that are exposed to diverse and significant level of noise. Robustness and recognition accuracy are the essential criteria in determining the extent of a speech recognition system deployed in real-world applications. This work involves development of techniques and extensions to extract robust features from speech and achieve substantial performance in speech recognition. Robustness and recognition accuracy are the top concern in this research. In this work, the robustness issue is approached using the front-end processing, in particular robust feature extraction. The author proposes an unified framework for robust feature and presents a comprehensive evaluation on robustness in speech features. The framework addresses three distinct approaches: robust feature extraction, temporal information inclusion and normalization strategies. The author discusses the issue of robust feature selection primarily in the spectral and cepstral context. Several enhancement and extensions are explored for the purpose of robustness. This includes a computationally efficient approach proposed for moment normalization. In addition, a simple back-end approach is incorporated to improve recognition performance in reverberant environments. Speech features in this work are evaluated in three distinct environments that occur in real-world scenarios. The thesis also discusses the effect of noise on speech features and their parameters. The author has established that statistical properties play an important role in mismatches. The significance of the research is strengthened by the evaluation of robust approaches in more than one scenario and the comparison with the performance of the state-of-the-art features. The contributions and limitations of each robust feature in all three different environments are highlighted. The novelty of the work lies in the diverse hostile environments which speech features are evaluated for robustness. The author has obtained recognition accuracy of more than 98.5% for channel distortion. Recognition accuracy greater than 90.0% has also been maintained for reverberation time 0.4s and additive babble noise at SNR 10dB. The thesis delivers a comprehensive research on robust speech features for speech recognition in hostile environments supported by significant experimental results. Several observations, recommendations and relevant issues associated with robust speech features are presented. Robust speech features Speech recognition Noisy environments Speech processing
2	Statistické zpracování řečových parametrů / Statistical Processing of Speech Features Svozil, Martin January 2015 (has links) This diploma thesis deals with the speech signal processing and vowel analysis mainly to uncover differences in speech features depending on the emotional state of speaker. Created application ARePa for speech signal processing was developer in Matlab environment and contains Graphical User Interface (GUI) for better manipulation with ARePa and analysed records. The application includes a complete analysis of the speech signal and further comparison of current feature with feature values from database using histograms. Of course, the developer application allows the archivation of currently analysed records into database.
3	Classification of Parkinson’s Disease using MultiPass Lvq,Logistic Model Tree,K-Star for Audio Data set : Classification of Parkinson Disease using Audio Dataset Udaya Kumar, Magesh Kumar January 2011 (has links) Parkinson's disease (PD) is a degenerative illness whose cardinal symptoms include rigidity, tremor, and slowness of movement. In addition to its widely recognized effects PD can have a profound effect on speech and voice.The speech symptoms most commonly demonstrated by patients with PD are reduced vocal loudness, monopitch, disruptions of voice quality, and abnormally fast rate of speech. This cluster of speech symptoms is often termed Hypokinetic Dysarthria.The disease can be difficult to diagnose accurately, especially in its early stages, due to this reason, automatic techniques based on Artificial Intelligence should increase the diagnosing accuracy and to help the doctors make better decisions. The aim of the thesis work is to predict the PD based on the audio files collected from various patients.Audio files are preprocessed in order to attain the features.The preprocessed data contains 23 attributes and 195 instances. On an average there are six voice recordings per person, By using data compression technique such as Discrete Cosine Transform (DCT) number of instances can be minimized, after data compression, attribute selection is done using several WEKA build in methods such as ChiSquared, GainRatio, Infogain after identifying the important attributes, we evaluate attributes one by one by using stepwise regression.Based on the selected attributes we process in WEKA by using cost sensitive classifier with various algorithms like MultiPass LVQ, Logistic Model Tree(LMT), K-Star.The classified results shows on an average 80%.By using this features 95% approximate classification of PD is acheived.This shows that using the audio dataset, PD could be predicted with a higher level of accuracy. Parkinson Audio Data set MultiPass Lvq Logistic Model Tree K-Star Hypokinetic Dysarthria Weka Artificial Intelligence Speech Speech Features Acoustics
4	Vers un système indiquant la distance d'un locuteur par transformation de sa voix / Speech transformation for distance perception Fux, Thibaut 24 May 2012 (has links) Cette thèse porte sur la transformation de la voix d’un locuteur dans l’objectif d’indiquer la distance de celui-ci : une transformation en voix chuchotée pour indiquer une distance proche et une transformation en voix criée pour une distance plutôt éloignée. Nous effectuons dans un premier temps des analyses approfondies pour déterminer les paramètres les plus pertinentes dans une voix chuchotée et surtout dans une voix criée (beaucoup plus difficile). La contribution principale de cette partie est de montrer la pertinence des paramètres prosodiques dans la perception de l’effort vocal dans une voix criée. Nous proposons ensuite des descripteurs permettant de mieux caractériser les contours prosodiques. Pour la transformation proprement dite, nous proposons plusieurs nouvelles règles de transformation qui contrôlent de manière primordiale la qualité des voix transformées. Les résultats ont montré une très bonne qualité des voix chuchotées transformées ainsi que pour des voix criées pour des structures linguistiques relativement simples (CVC, CVCV, etc.). / This thesis focuses on speaker voice transformation in the aim to indicate the distance of it: a spokento-whispered voice transformation to indicate a close distance and a spoken-to-shouted voicetransformation for a rather far distance. We perform at first, in-depth analysis to determine mostrelevant features in whispered voices and especially in shouted voices (much harder). The maincontribution of this part is to show the relevance of prosodic parameters in the perception of vocaleffort in a shouted voice. Then, we propose some descriptors to better characterize the prosodiccontours. For the actual transformation, we propose several new transformation rules whichimportantly control the quality of transformed voice. The results showed a very good quality oftransformed whispered voices and transformed shouted voices for relatively simple linguisticstructures (CVC, CVCV, etc.). Perception de la distance Caractéristiques de la parole Effort vocal Son 3D Speech perception Speech features Vocal effort 3D-sound 620
5	Αναγνώριση ομιλητή και ομιλίας με χρήση κυματιδίων Σιαφαρίκας, Μιχαήλ 06 September 2010 (has links) Σκοπός της παρούσας διατριβής είναι η εκμετάλλευση των κυματιδίων με σκοπό την βελτίωση της απόδοσης συστημάτων αναγνώρισης ομιλητή και ομιλίας. Στα πλαίσια αυτά, εισάγονται τέσσερις νέοι τρόποι παραμετροποίησης του σήματος ομιλίας: (1) Η πρώτη μέθοδος προσαρμόζει την ανάλυση συχνότητας των πακέτων κυματιδίων για την προσέγγιση της ψυχοακουστικής επίδρασης των κρίσιμων ζωνών του ακουστικού συστήματος ενσωματώνοντας τις τελευταίες εξελίξεις για τον υπολογισμό τους. (2) Η δεύτερη μέθοδος εισάγει μια επέκταση του μετασχηματισμού πακέτων κυματιδίων, τον επικαλυπτόμενο μετασχηματισμό πακέτων κυματιδίων, ο οποίος χρησιμοποιείται για να δοθεί έμφαση στις περιοχές αλλαγής των κρίσιμων ζωνών από μια μικρότερη σε μια μεγαλύτερη τιμή. (3) Η τρίτη μέθοδος αξιολογεί τη συνεισφορά μη επικαλυπτόμενων ζωνών συχνοτήτων στην αναγνώριση ομιλητή και κατασκευάζεται ανάλογα ένας μετασχηματισμός πακέτων κυματιδίων ο οποίος προσαρμόζει την συχνοτική του ανάλυση σύμφωνα με την απόδοση κάθε μίας από τις ζώνες. (4) Η τέταρτη μέθοδος επιλέγει τη βέλτιστη βάση από το σύνολο των μετασχηματισμών που είναι διαθέσιμοι με τα πακέτα κυματιδίων με εφαρμογή την αναγνώριση ομιλητή και κριτήριο το μέτρο EER. Οι παραπάνω τέσσερις τρόποι παραμετροποίησης του σήματος ομιλίας αξιολογήθηκαν με το σύστημα αναγνώρισης ομιλητή WCL-1 του εργαστηρίου ενσύρματης τηλεπικοινωνίας του Πανεπιστημίου Πατρών στις βάσεις δεδομένων POLYCOST και NIST και αποδείχθηκε η ανωτερότητά τους τόσο σε σχέση με προηγούμενες μεθόδους των κυματιδίων όσο και σε σχέση με ευρέως χρησιμοποιούμενες παραμέτρους ομιλίας, όπως οι παράμετροι cepstral με βάση την κλίμακα mel (MFCC). Επιπλέον, στη διατριβή αναλύονται οι ιδιότητες των σημαντικότερων συναρτήσεων κυματιδίων, επιλέγεται η βέλτιστη για την αναπαράσταση του σήματος ομιλίας και πιστοποιείται στην πράξη αυτή η επιλογή. Τέλος, οι δύο πρώτες από τις προαναφερόμενες μεθόδους παραμετροποίησης τροποποιήθηκαν και επεκτάθηκαν κατάλληλα για την εφαρμογή στην αναγνώριση ομιλίας όπου αξιολογήθηκαν και διαπιστώθηκε η υπεροχή τους έναντι παραδοσιακών και ευρέως διαδεδομένων μεθόδων παραμετροποίησης του σήματος ομιλίας που στηρίζονται στον μετασχηματισμό Fourier. Το κύριο συμπέρασμα που προέκυψε από τη παρούσα διδακτορική διατριβή είναι ότι τα κυματίδια και συγκεκριμένα τα πακέτα κυματιδίων είναι δυνατόν να χρησιμοποιηθούν με επιτυχία στη βελτίωση της απόδοσης συστημάτων αναγνώρισης ομιλητή και ομιλίας. / The main goal of the present thesis is the exploitation of wavelets for the optimization of speaker and speech recognition systems performance. In this context, four new speech parameterization methods are introduced: (1) The first method adapts the frequency resolution of wavelet packet transform to the critical bandwidth of auditory filters incorporating the recent advances for their estimation. (2) The second method introduces a generalization of wavelet packet transform, named overlapping wavelet packet transform, which emphasizes those frequency sub-bands that critical bandwidth changes from a finer to a coarser value. (3) The third method evaluates the contribution of each one of eight non-overlapping frequency sub-bands, that the Nyquist interval is divided, to the speaker recognition task and a wavelet packet transform is constructed which adapts its frequency resolution according to the performance of each sub-band. (4) The fourth method introduces a new technique for seeking and selecting the best basis among all wavelet packet transforms available in the speaker recognition task taking as criterion the EER. The aforementioned four speech signal parameterizations were evaluated on the speaker verification system WCL-1 of Wire Communications Laboratory, University of Patras, utilizing the speaker recognition corpora POLYCOST and NIST and their superiority was proven over previous wavelet-based parameterizations as well as the widely used Mel Frequency Cepstral Coefficients (MFCC). Among the four proposed methods, it was proven that the second parameterization technique exhibited the best performance. Furthermore, the most important wavelet properties are thoroughly analyzed, the optimal is selected for the representation of the speech signal and this choice is experimentally verified. Finally, the first two parameterization methods were further modified and extended appropriately for application on the speech recognition task where their superiority was proven over traditionally and widely used speech parameterization techniques based on Fourier transform. The main conclusion that resulted in the present doctoral thesis is that wavelets and specifically wavelet packet transforms can be used successfully for the tasks of speaker and speech recognition. Αναγνώριση ομιλητή Επιβεβαίωση ομιλητή Αναγνώριση ομιλίας Κυματίδια Πακέτα κυματιδίων Παράμετροι ομιλίας 006.454 Speaker recognition Speaker verification Speech recognition Wavelets Wavelet packets Speech features Critical bands
6	Objektivizace Testu 3F - dysartrický profil pomocí akustické analýzy / Objectification of the Test 3F - dysarthric profile based on acoustic analysis Bezůšek, Marek January 2021 (has links) Test 3F is used to diagnose the extent of motor speech disorder – dysarthria for czech speakers. The evaluation of dysarthric speech is distorted by subjective assessment. The motivation behind this thesis is that there are not many automatic and objective analysis tools that can be used to evaluate phonation, articulation, prosody and respiration of speech disorder. The aim of this diploma thesis is to identify, implement and test acoustic features of speech that could be used to objectify and automate the evaluation. These features should be easily interpretable by the clinician. It is assumed that the evaluation could be more precise because of the detailed analysis that acoustic features provide. The performance of these features was tested on database of 151 czech speakers that consists of 51 healthy speakers and 100 patients. Statistical analysis and methods of machine learning were used to identify the correlation between features and subjective assesment. 27 of total 30 speech tasks of Test 3F were identified as suitable for automatic evaluation. Within the scope of this thesis only 10 tasks of Test 3F were tested because only a limited part of the database could be preprocessed. The result of statistical analysis is 14 features that were most useful for the test evaluation. The most significant features are: MET (respiration), relF0SD (intonation), relSEOVR (voice intensity – prosody). The lowest prediction error of the machine learning regression models was 7.14 %. The conclusion is that the evaluation of most of the tasks of Test 3F can be automated. The results of analysis of 10 tasks shows that the most significant factor in dysarthria evaluation is limited expiration, monotone voice and low variabilty of speech intensity.
7	Αναγνώριση ομιλητή / Speaker recognition Ganchev, Todor 25 June 2007 (has links) Η παρούσα διατριβή πραγματεύεται την αναγνώριση ομιλητή σε πραγματικές συνθήκες. Τα κύρια σημεία της εργασίας είναι: (1) αξιολόγηση διαφόρων προσεγγίσεων εξαγωγής χαρακτηριστικών παραμέτρων ομιλίας, (2) μείωση της ισχύος της περιβαλλοντικής επίδρασης στην απόδοση της αναγνώρισης ομιλητή, και (3) μελέτη τεχνικών κατηγοριοποίησης, εναλλακτικών προς τις υπάρχουσες. Συγκεκριμένα, στο (1), προτείνεται μια νέα δομή εξαγωγής παραμέτρων ομιλίας βασισμένη σε πακέτα κυματομορφών, κατάλληλα σχεδιασμένη για αναγνώριση ομιλητή. Εξάγεται με ένα αντικειμενικό τρόπο σε σχέση με την απόδοση αναγνώρισης ομιλητή, σε αντίθεση με την MFCC προσέγγιση, που βασίζεται στην προσέγγιση της αντίληψης της ανθρώπινης ακοής. Έπειτα, στο (2), δίνεται μια δομή για την εξαγωγή παραμέτρων βασισμένη στα MFCC, ανεκτική στο θόρυβο, για την βελτίωση της απόδοσης της αναγνώρισης ομιλητή σε πραγματικό περιβάλλον. Συνοπτικά, μια τεχνική μείωσης του θορύβου βασισμένη σε μοντέλο προσαρμοσμένη στο πρόβλημα της επιβεβαίωσης ομιλητή ενσωματώνεται απευθείας στη δομή υπολογισμού των MFCC. Αυτή η προσέγγιση επέδειξε σημαντικό πλεονέκτημα σε πραγματικό και ταχέως μεταβαλλόμενο περιβάλλον. Τέλος, στο (3), εισάγονται δύο νέοι κατηγοριοποιητές που αναφέρονται ως Locally Recurrent Probabilistic Neural Network (LR PNN), και Generalized Locally Recurrent Probabilistic Neural Network (GLR PNN). Είναι υβρίδια μεταξύ των Recurrent Neural Network (RNN) και Probabilistic Neural Network (PNN) και συνδυάζουν τα πλεονεκτήματα των γεννετικών και διαφορικών προσσεγγίσεων κατηγοριοποίησης. Επιπλέον, τα νέα αυτά νευρωνικά δίκτυα είναι ευαίσθητα σε παροδικές και ειδικές συσχετίσεις μεταξύ διαδοχικών εισόδων, και έτσι, είναι κατάλληλα για να αξιοποιήσουν την συσχέτιση παραμέτρων ομιλίας μεταξύ πλαισίων ομιλίας. Κατά την εξαγωγή των πειραμάτων, διαφάνηκε ότι οι αρχιτεκτονικές LR PNN και GLR PNN παρέχουν καλύτερη απόδοση, σε σχέση με τα αυθεντικά PNN. / This dissertation dials with speaker recognition in real-world conditions. The main accent falls on: (1) evaluation of various speech feature extraction approaches, (2) reduction of the impact of environmental interferences on the speaker recognition performance, and (3) studying alternative to the present state-of-the-art classification techniques. Specifically, within (1), a novel wavelet packet-based speech features extraction scheme fine-tuned for speaker recognition is proposed. It is derived in an objective manner with respect to the speaker recognition performance, in contrast to the state-of-the-art MFCC scheme, which is based on approximation of human auditory perception. Next, within (2), an advanced noise-robust feature extraction scheme based on MFCC is offered for improving the speaker recognition performance in real-world environments. In brief, a model-based noise reduction technique adapted for the specifics of the speaker verification task is incorporated directly into the MFCC computation scheme. This approach demonstrated significant advantage in real-world fast-varying environments. Finally, within (3), two novel classifiers referred to as Locally Recurrent Probabilistic Neural Network (LR PNN), and Generalized Locally Recurrent Probabilistic Neural Network (GLR PNN) are introduced. They are hybrids between Recurrent Neural Network (RNN) and Probabilistic Neural Network (PNN) and combine the virtues of the generative and discriminative classification approaches. Moreover, these novel neural networks are sensitive to temporal and special correlations among consecutive inputs, and therefore, are capable to exploit the inter-frame correlations among speech features derived for successive speech frames. In the experimentations, it was demonstrated that the LR PNN and GLR PNN architectures provide benefit in terms of performance, when compared to the original PNN. Αναγνώριση ομιλητή Επιβεβαίωση ομιλητή Παράμετροι ομιλίας Πακέτα κυματομορφών Καταστολή θορύβου 006.454 Speaker recognition Speaker verification Hybrid classifiers Probabilistic neural networks Recurrent neural networks Speech features Wavelet packets Noise suppression
8	Effective Speech Features for Cognitive Load Assessment: Classification and Regression Herms, Robert 03 June 2019 (has links) This thesis is about the effectiveness of speech features for cognitive load assessment, with particular attention being paid to new perspectives of this research area. A new cognitive load database, called CoLoSS, is introduced containing speech recordings of users who performed a learning task. Various acoustic features from different categories including prosody, voice quality, and spectrum are investigated in terms of their relevance. Moreover, Teager energy parameters, which have proven highly successful in stress detection, are introduced for cognitive load assessment and it is demonstrated how automatic speech recognition technology can be used to extract potential indicators. The suitability of the extracted features is systematically evaluated by recognition experiments with speaker-independent systems designed for discriminating between three levels of load. Additionally, a novel approach to speech-based cognitive load modelling is introduced, whereby the load is represented as a continuous quantity and its prediction can thus be regarded as a regression problem. / Die vorliegende Arbeit befasst sich mit der automatischen Erkennung von kognitiver Belastung auf Basis menschlicher Sprachmerkmale. Der Schwerpunkt liegt auf der Effektivität von akustischen Parametern, wobei die aktuelle Forschung auf diesem Gebiet um neuartige Ansätze erweitert wird. Hierzu wird ein neuer Datensatz – als CoLoSS bezeichnet – vorgestellt, welcher Sprachaufzeichnungen von Nutzern enthält und speziell auf Lernprozesse fokussiert. Zahlreiche Parameter der Prosodie, Stimmqualität und des Spektrums werden im Hinblick auf deren Relevanz analysiert. Darüber hinaus werden die Eigenschaften des Teager Energy Operators, welche typischerweise bei der Stressdetektion Verwendung finden, im Rahmen dieser Arbeit berücksichtigt. Ebenso wird gezeigt, wie automatische Spracherkennungssysteme genutzt werden können, um potenzielle Indikatoren zu extrahieren. Die Eignung der extrahierten Merkmale wird systematisch evaluiert. Dabei kommen sprecherunabhängige Klassifikationssysteme zur Unterscheidung von drei Belastungsstufen zum Einsatz. Zusätzlich wird ein neuartiger Ansatz zur sprachbasierten Modellierung der kognitiven Belastung vorgestellt, bei dem die Belastung eine kontinuierliche Größe darstellt und eine Vorhersage folglich als ein Regressionsproblem betrachtet werden kann. info:eu-repo/classification/ddc/000 ddc:000 info:eu-repo/classification/ddc/004 ddc:004 info:eu-repo/classification/ddc/006 ddc:006 Signalverarbeitung Sprachverarbeitung Maschinelles Lernen

Search results