Global ETD Search

11	Explicit Segmentation Of Speech For Indian Languages Ranjani, H G 03 1900 (has links) Speech segmentation is the process of identifying the boundaries between words, syllables or phones in the recorded waveforms of spoken natural languages. The lowest level of speech segmentation is the breakup and classification of the sound signal into a string of phones. The difficulty of this problem is compounded by the phenomenon of co-articulation of speech sounds. The classical solution to this problem is to manually label and segment spectrograms. In the first step of this two step process, a trained person listens to a speech signal, recognizes the word and phone sequence, and roughly determines the position of each phonetic boundary. The second step involves examining several features of the speech signal to place a boundary mark at the point where these features best satisfy a certain set of conditions specific for that kind of phonetic boundary. Manual segmentation of speech into phones is a highly time-consuming and painstaking process. Required for a variety of applications, such as acoustic analysis, or building speech synthesis databases for high-quality speech output systems, the time required to carry out this process for even relatively small speech databases can rapidly accumulate to prohibitive levels. This calls for automating the segmentation process. The state-of-art segmentation techniques use Hidden Markov Models (HMM) for phone states. They give an average accuracy of over 95% within 20 ms of manually obtained boundaries. However, HMM based methods require large training data for good performance. Another major disadvantage of such speech recognition based segmentation techniques is that they cannot handle very long utterances, Which are necessary for prosody modeling in speech synthesis applications. Development of Text to Speech (TTS) systems in Indian languages has been difficult till date owing to the non-availability of sizeable segmented speech databases of good quality. Further, no prosody models exist for most of the Indian languages. Therefore, long utterances (at the paragraph level and monologues) have been recorded, as part of this work, for creating the databases. This thesis aims at automating segmentation of very long speech sentences recorded for the application of corpus-based TTS synthesis for multiple Indian languages. In this explicit segmentation problem, we need to force align boundaries in any utterance from its known phonetic transcription. The major disadvantage of forcing boundary alignments on the entire speech waveform of a long utterance is the accumulation of boundary errors. To overcome this, we force boundaries between 2 known phones (here, 2 successive stop consonants are chosen) at a time. Here, the approach used is silence detection as a marker for stop consonants. This method gives around 89% (for Hindi database) accuracy and is language independent and training free. These stop consonants act as anchor points for the next stage. Two methods for explicit segmentation have been proposed. Both the methods rely on the accuracy of the above stop consonant detection stage. Another common stage is the recently proposed implicit method which uses Bach scale filter bank to obtain the feature vectors. The Euclidean Distance of the Mean of the Logarithm (EDML) of these feature vectors shows peaks at the point where the spectrum changes. The method performs with an accuracy of 87% within 20 ms of manually obtained boundaries and also achieves a low deletion and insertion rate of 3.2% and 21.4% respectively, for 100 sentences of Hindi database. The first method is a three stage approach. The first is the stop consonant detection stage followed by the next, which uses Quatieri’s sinusoidal model to classify sounds as voiced/unvoiced within 2 successive stop consonants. The final stage uses the EDML function of Bach scale feature vectors to further obtain boundaries within the voiced and unvoiced regions. It gives a Frame Error Rate (FER) of 26.1% for Hindi database. The second method proposed uses duration statistics of the phones of the language. It again uses the EDML function of Bach scale filter bank to obtain the peaks at the phone transitions and uses the duration statistics to assign probability to each peak being a boundary. In this method, the FER performance improves to 22.8% for the Hindi database. Both the methods are equally promising for the fact that they give low frame error rates. Results show that the second method outperforms the first, because it incorporates the knowledge of durations. For the proposed approaches to be useful, manual interventions are required at the output of each stage. However, this intervention is less tedious and reduces the time taken to segment each sentence by around 60% as compared to the time taken for manual segmentation. The approaches have been successfully tested on 3 different languages, 100 sentences each -Kannada, Tamil and English (we have used TIMIT database for validating the algorithms). In conclusion, a practical solution to the segmentation problem is proposed. Also, the algorithm being training free, language independent (ES-SABSF method) and speaker independent makes it useful in developing TTS systems for multiple languages reducing the segmentation overhead. This method is currently being used in the lab for segmenting long Kannada utterances, spoken by reading a set of 1115 phonetically rich sentences. Speech Processing Speech Segmentation Indian Languages - Speech Segmentation Stop-Consonant Detection Speech Segmentation - Algorithms Batch-Scale Filter Bank ES-SABSF Segmentation Method ES-DSBSF Segmentation Method Hidden Markov Models HMM) Text to Speech (TTS) Computer Science
12	Sélection de paramètres acoustiques pertinents pour la reconnaissance de la parole / Relevant acoustic feature selection for speech recognition Hacine-Gharbi, Abdenour 09 December 2012 (has links) L’objectif de cette thèse est de proposer des solutions et améliorations de performance à certains problèmes de sélection des paramètres acoustiques pertinents dans le cadre de la reconnaissance de la parole. Ainsi, notre première contribution consiste à proposer une nouvelle méthode de sélection de paramètres pertinents fondée sur un développement exact de la redondance entre une caractéristique et les caractéristiques précédemment sélectionnées par un algorithme de recherche séquentielle ascendante. Le problème de l’estimation des densités de probabilités d’ordre supérieur est résolu par la troncature du développement théorique de cette redondance à des ordres acceptables. En outre, nous avons proposé un critère d’arrêt qui permet de fixer le nombre de caractéristiques sélectionnées en fonction de l’information mutuelle approximée à l’itération j de l’algorithme de recherche. Cependant l’estimation de l’information mutuelle est difficile puisque sa définition dépend des densités de probabilités des variables (paramètres) dans lesquelles le type de ces distributions est inconnu et leurs estimations sont effectuées sur un ensemble d’échantillons finis. Une approche pour l’estimation de ces distributions est basée sur la méthode de l’histogramme. Cette méthode exige un bon choix du nombre de bins (cellules de l’histogramme). Ainsi, on a proposé également une nouvelle formule de calcul du nombre de bins permettant de minimiser le biais de l’estimateur de l’entropie et de l’information mutuelle. Ce nouvel estimateur a été validé sur des données simulées et des données de parole. Plus particulièrement cet estimateur a été appliqué dans la sélection des paramètres MFCC statiques et dynamiques les plus pertinents pour une tâche de reconnaissance des mots connectés de la base Aurora2. / The objective of this thesis is to propose solutions and performance improvements to certain problems of relevant acoustic features selection in the framework of the speech recognition. Thus, our first contribution consists in proposing a new method of relevant feature selection based on an exact development of the redundancy between a feature and the feature previously selected using Forward search algorithm. The estimation problem of the higher order probability densities is solved by the truncation of the theoretical development of this redundancy up to acceptable orders. Moreover, we proposed a stopping criterion which allows fixing the number of features selected according to the mutual information approximated at the iteration J of the search algorithm. However, the mutual information estimation is difficult since its definition depends on the probability densities of the variables (features) in which the type of these distributions is unknown and their estimates are carried out on a finite sample set. An approach for the estimate of these distributions is based on the histogram method. This method requires a good choice of the bin number (cells of the histogram). Thus, we also proposed a new formula of computation of bin number that allows minimizing the estimator bias of the entropy and mutual information. This new estimator was validated on simulated data and speech data. More particularly, this estimator was applied in the selection of the static and dynamic MFCC parameters that were the most relevant for a recognition task of the connected words of the Aurora2 base. Reconnaissance de la parole Paramètres acoustiques Coefficients MFCC Modèles de Markov cachés (MMC) Entropie Information mutuelle Histogramme Nombre de bins Sélection des paramètres Pertinence Redondance Biais Speech recognition Acoustic feature MFCC coefficient Hidden Markov models (HMM) Entropy Mutual information Histogram Bins number Feature selection Relevance Redundancy Bias

Search results

Explicit Segmentation Of Speech For Indian Languages

Sélection de paramètres acoustiques pertinents pour la reconnaissance de la parole / Relevant acoustic feature selection for speech recognition