Global ETD Search

11	Spectro-Temporal Features For Robust Automatic Speech Recognition Suryanarayana, Venkata K 01 1900 (has links) The speech signal is inherently characterized by its variations in time, which get reflected as variations in frequency. The specto temporal changes are due to changes in vocaltract, intonation, co-articulation and successive articulation of different phonetic sounds. In this thesis we are looking for improving the speech recognition performance through better feature parameters using a non-stationary model of speech. One effective means of modeling a general non-stationary signal is using the AM-FM model. AM-FM model can be extended to speech through a sub-band analysis, which can be mimic the auditory analysis. In this thesis, we explore new methods for estimating AM and FM parameters based on the non-uniform samples of the signal. The non-uniform sample approach along with adaptive window estimation provides for important advantage because of multi-resolution analysis. We develop several new methods based on ZC intervals, local extrema intervals and signal derivative at ZC’s as different sample measures of the signal and explore their effectiveness for instantaneous frequency (IF) and instantaneous envelope (IE) estimation. To deal with speech signal for automatic speech recognition, we explore the use of auditory motivated spectro temporal information through the use of an auditory filter bank and signal parameters (or features) are derived from the instantaneous energy in each band using the non-linear energy operator over a larger window length. The temporal correlation present in the signal is exploited by using DCT and keeping the lower few coefficients of DCT to keep the trend in the energy in each band. The DCT coefficients from different frequency bands are concatenated together, and a further spectral decorrelation is achieved through KLT (Karhunen-Loeve Transform) of the concatenated feature vector. The changes in the vocaltract are well captured by the change in the formant structure and to emphasize these details for ASR we have defined a temporal formant by using the AM-FM decomposition of sub-band speech. A uniform wideband non-overlaping filters are used for sub-band decomposition. The temporal formant is defined using the AM-FM parameters of each subband signal. The temporal evolution of a formant is represented by the lower order DCT coefficients of the temporal formant in each band and its use for ASR is explored. To address the robustness of ASR performance to environmental noisy conditions, we have used a hybrid approach of enhancing the speech signal using statistical models of the speech and noise. Use of GMM for statistical speech enhancement has been shown to be effective. It is found that the spectro-temporal features derived from enhanced speech provide further improvement to ASR performance. Speech Recognition Speech Signal Processing Automatic Speech Recognition (ASR) Robust Speech Recognition AM-FM Modeling Computer Science
12	Speech Signal Classification Using Support Vector Machines Sood, Gaurav 07 1900 (has links) Hidden Markov Models (HMMs) are, undoubtedly, the most employed core technique for Automatic Speech Recognition (ASR). Nevertheless, we are still far from achieving high‐performance ASR systems. Some alternative approaches, most of them based on Artificial Neural Networks (ANNs), were proposed during the late eighties and early nineties. Some of them tackled the ASR problem using predictive ANNs, while others proposed hybrid HMM/ANN systems. However, despite some achievements, nowadays, the dependency on Hidden Markov Models is a fact. During the last decade, however, a new tool appeared in the field of machine learning that has proved to be able to cope with hard classification problems in several fields of application: the Support Vector Machines (SVMs). The SVMs are effective discriminative classifiers with several outstanding characteristics, namely: their solution is that with maximum margin; they are capable to deal with samples of a very higher dimensionality; and their convergence to the minimum of the associated cost function is guaranteed. In this work a novel approach based upon probabilistic kernels in support vector machines have been attempted for speech data classification. The classification accuracy in case of support vector classification depends upon the kernel function used which in turn depends upon the data set in hand. But still as of now there is no way to know a priori which kernel will give us best results The kernel used in this work tries to normalize the time dimension by fitting a probability distribution over individual data points which normalizes the time dimension inherent to speech signals which facilitates the use of support vector machines since it acts on static data only. The divergence between these probability distributions fitted over individual speech utterances is used to form the kernel matrix. Vowel Classification, Isolated Word Recognition (Digit Recognition), have been attempted and results are compared with state of art systems. Speech Recognition Speech Signal Processing Automatic Speech Recognition Artificial Neural Networks Support Vector Machine Time Normalization Hidden Markov Models (HMMs) Computer Science
13	Speech Encryption Using Wavelet Packets Bopardikar, Ajit S 02 1900 (has links) The aim of speech scrambling algorithms is to transform clear speech into an unintelligible signal so that it is difficult to decrypt it in the absence of the key. Most of the existing speech scrambling algorithms tend to retain considerable residual intelligibility in the scrambled speech and are easy to break. Typically, a speech scrambling algorithm involves permutation of speech segments in time, frequency or time-frequency domain or permutation of transform coefficients of each speech block. The time-frequency algorithms have given very low residual intelligibility and have attracted much attention. We first study the uniform filter bank based time-frequency scrambling algorithm with respect to the block length and number of channels. We use objective distance measures to estimate the departure of the scrambled speech from the clear speech. Simulations indicate that the distance measures increase as we increase the block length and the number of channels. This algorithm derives its security only from the time-frequency segment permutation and it has been estimated that the effective number of permutations which give a low residual intelligibility is much less than the total number of possible permutations. In order to increase the effective number of permutations, we propose a time-frequency scrambling algorithm based on wavelet packets. By using different wavelet packet filter banks at the analysis and synthesis end, we add an extra level of security since the eavesdropper has to choose the correct analysis filter bank, correctly rearrange the time-frequency segments, and choose the correct synthesis bank to get back the original speech signal. Simulations performed with this algorithm give distance measures comparable to those obtained for the uniform filter bank based algorithm. Finally, we introduce the 2-channel perfect reconstruction circular convolution filter bank and give a simple method for its design. The filters designed using this method satisfy the paraunitary properties on a discrete equispaced set of points in the frequency domain. Electrical Communication Wavelets Speech - Signal Processing Speech Encryption Speech Scrambling Algorithm Filter Banks Wavelet Packets Scrambling Algorithms Wavelet Transform Speech Scrambling Wavelet Packet
14	Análise cepstral baseada em diferentes famílias transformada wavelet / Cepstral analysis based on different family of wavelet transform Fabrício Lopes Sanchez 02 December 2008 (has links) Este trabalho apresenta um estudo comparativo entre diferentes famílias de transformada Wavelet aplicadas à análise cepstral de sinais digitais de fala humana, com o objetivo específico de determinar o período de pitch dos mesmos e, ao final, propõe um algoritmo diferencial para realizar tal operação, levando-se em consideração aspectos importantes do ponto de vista computacional, tais como: desempenho, complexidade do algoritmo, plataforma utilizada, dentre outros. São apresentados também, os resultados obtidos através da implementação da nova técnica (baseada na transformada wavelet) em comparação com a abordagem tradicional (baseada na transformada de Fourier). A implementação da técnica foi testada em linguagem C++ padrão ANSI sob as plataformas Windows XP Professional SP3, Windows Vista Business SP1, Mac OSX Leopard e Linux Mandriva 10. / This work presents a comparative study between different family of wavelets applied on cepstral analysis of the digital speech human signal with specific objective for determining of pitch period of the same and in the end, proposes an differential algorithm to make such a difference operation take into consideration important aspects of computational point of view, such as: performance, algorithm complexity, used platform, among others. They are also present, the results obtained through of the technique implementation compared with the traditional approach. The technique implementation was tested in C++ language standard ANSI under the platform Windows XP Professional SP3 Edition, Windows Vista Business SP1, MacOSX Leopard and Linux Mandriva 10. Análise cepstral Período de pitch Processamento digital de sinais de fala Transformada discreta de Fourier Transformada discreta wavelet Cepstrum analysis Digital speech signal processing Discrete Fourier transforn Discrete wavelet transform Pitch period
15	Vokinesis : instrument de contrôle suprasegmental de la synthèse vocale / Vokinesis : an instrument for suprasegmental control of voice synthesis Delalez, Samuel 28 November 2017 (has links) Ce travail s'inscrit dans le domaine du contrôle performatif de la synthèse vocale, et plus particulièrement de la modification temps-réel de signaux de voix pré-enregistrés. Dans un contexte où de tels systèmes n'étaient en mesure de modifier que des paramètres de hauteur, de durée et de qualité vocale, nos travaux étaient centrés sur la question de la modification performative du rythme de la voix. Une grande partie de ce travail de thèse a été consacrée au développement de Vokinesis, un logiciel de modification performative de signaux de voix pré-enregistrés. Il a été développé selon 4 objectifs: permettre le contrôle du rythme de la voix, avoir un système modulaire, utilisable en situation de concert ainsi que pour des applications de recherche. Son développement a nécessité une réflexion sur la nature du rythme vocal et sur la façon dont il doit être contrôlé. Il est alors apparu que l'unité rythmique inter-linguistique de base pour la production du rythme vocale est de l'ordre de la syllabe, mais que les règles de syllabification sont trop variables d'un langage à l'autre pour permettre de définir un motif rythmique inter-linguistique invariant. Nous avons alors pu montrer que le séquencement précis et expressif du rythme vocal nécessite le contrôle de deux phases, qui assemblées forment un groupe rythmique: le noyau et la liaison rythmiques. Nous avons mis en place plusieurs méthodes de contrôle rythmique que nous avons testées avec différentes interfaces de contrôle. Une évaluation objective a permis de valider l'une de nos méthodes du point de vue de la précision du contrôle rythmique. De nouvelles stratégies de contrôle de la hauteur et de paramètres de qualité vocale avec une tablette graphique ont été mises en place. Une réflexion sur la pertinence de cette interface au regard de l'essor des nouvelles interfaces musicales continues nous a permis de conclure que la tablette est la mieux adaptée au contrôle expressif de l'intonation (parole), mais que les PMC (Polyphonic Multidimensional Controllers) sont mieux adaptés au contrôle de la mélodie (chant, ou autres instruments).Le développement de Vokinesis a également nécessité la mise en place de la méthode de traitement de signal VoPTiQ (Voice Pitch, Time and Quality modification), combinant une adaptation de l'algorithme RT-PSOLA et des techniques particulières de filtrage pour les modulations de qualité vocale. L'utilisation musicale de Vokinesis a été évaluée avec succès dans le cadre de représentations publiques du Chorus Digitalis, pour du chant de type variété ou musique contemporaine. L'utilisation dans un cadre de musique électro a également été explorée par l'interfaçage du logiciel de création musicale Ableton Live à Vokinesis. Les perspectives d'application sont multiples: études scientifiques (recherches en prosodie, en parole expressive, en neurosciences...), productions sonores et musicales, pédagogie des langues, thérapies vocales. / This work belongs to the field of performative control of voice synthesis, and more precisely of real-time modification of pre-recorded voice signals. In a context where such systems were only capable of modifying parameters such as pitch, duration and voice quality, our work was carried around the question of performative modification of voice rhythm. One significant part of this thesis has been devoted to the development of Vokinesis, a program for performative modification of pre-recorded voice. It has been developed under 4 goals: to allow for voice rhythm control, to obtain a modular system, usable in public performances situations as well as for research applications. To achieve this development, a reflexion about the nature of voice rhythm and how it should be controlled has been carried out. It appeared that the basic inter-linguistic rhtyhmic unit is syllable-sized, but that syllabification rules are too language-dependant to provide a invariant inter-linguistic rhythmic pattern. We showed that accurate and expressive sequencing of vocal rhythm is performed by controlling the timing of two phases, which together form a rhythmic group: the rhythmic nucleus and the rhythmic link. We developed several rhythm control methods, tested with several control interfaces. An objective evaluation showed that one of our methods allows for very accurate control of rhythm. New strategies for voice pitch and quality control with a graphic tablet have been established. A reflexion about the pertinence of graphic tablets for pitch control, regarding the rise of new continuous musical interfaces, lead us to the conclusion that they best fit intonation control (speech), but that PMC (Polyphonic Multidimensional controllers) are better for melodic control (singing, or other instruments).The development of Vokinesis also required the implementation of the VoPTiQ (Voice Pitch, Time and Quality modification) signal processing method, which combines an adaptation of the RT-PSOLA algorithm and some specific filtering techniques for voice quality modulations. The use of Vokinesis as a musical instrument has been successfully evaluated in public representations of the Chorus Digitalis ensemble, for various singing styles (from pop to contemporary music). Its use for electro music has also been explored by interfacing the Ableton Live composition environnment with Vokinesis. Application perspectives are diverse: scientific studies (research in prosody, expressive speech, neurosciences), sound and music production, language learning and teaching, speech therapies. Synthèse vocale Interaction humain-machine Informatique musicale Prosodie Traitement du signal vocal Vocal synthesis Human-Computer Interaction Computer music Speech prosody Speech signal processing
16	Music And Speech Analysis Using The 'Bach' Scale Filter-Bank Ananthakrishnan, G 04 1900 (has links) The aim of this thesis is to deﬁne a perceptual scale for the ‘Time-Frequency’ analysis of music signals. The equal tempered ‘Bach ’ scale is a suitable scale, since it covers most of the genres of music and the error is equally distributed for each semi-tone. However, it may be necessary to allow a tolerance of around 50 cents or half the interval of the Bach scale, so that the interval can accommodate other common intonation schemes. The thesis covers the formulation of the Bach scale ﬁlter-bank as a time-varying model. It makes a comparative study with other commonly used perceptual scales. Two applications for the Bach scale ﬁlter-bank are also proposed, namely automated segmentation of speech signals and transcription of singing voice for query-by-humming applications. Even though this ﬁlter-bank is suggested with a motivation from music, it could also be applied to speech. A method for automatically segmenting continuous speech into phonetic units is proposed. The results, obtained from the proposed method, show around 82% accuracy for the English and 85% accuracy for the Hindi databases. This is an improvement of around 2 -3% when the performance is compared with other popular methods in the literature. Interestingly, the Bach scale ﬁlters perform better than the ﬁlters designed for other common perceptual scales, such as Mel and Bark scales. ‘Musical transcription’ refers to the process of converting a musical rendering or performance into a set of symbols or notations. A query in a ‘query-by-humming system’ can be made in several ways, some of which are singing with words, or with arbitrary syllables, or whistling. Two algorithms are suggested to annotate a query. The algorithms are designed to be fairly robust for these various forms of queries. The ﬁrst algorithm is a frequency selection based method. It works on the basis of selecting the most likely frequency components at any given time instant. The second algorithm works on the basis of ﬁnding time-connected contours of high energy in the ‘Time-Frequency’ plane of the input signal. The time domain algorithm works better in terms of instantaneous pitch estimates. It results in an error of around 10 -15%, while the frequency domain method results in an error of around 12 -20%. A song rendered by two diﬀerent people will have quite a few diﬀerent properties. Their absolute pitches, rates of rendering, timbres based on voice quality and inaccuracies, may be diﬀerent. The thesis discusses a method to quantify the distance between two diﬀerent renderings of musical pieces. The distance function has been evaluated by attempting a search for a particular song from a database of a size of 315, made up of songs sung by both male and female singers and whistled queries. Around 90 % of the time, the correct song is found among the top ﬁve best choices picked. Thus, the Bach scale has been proposed as a suitable scale for representing the perception of music. It has been explored in two applications, namely automated segmentation of speech and transcription of singing voices. Using the transcription obtained, a measure of the distance between renderings of musical pieces has also been suggested. Speech Analysis Speech Processing Filter Bank Musical Transcription Speech Recognition Speech - Signal Processing Audio Signals Music - Pitch Tracking Algorithms Music Signals - Analysis Time Frequency Analysis Bach Scale Automated Speech Segmentation Computer Science
17	The language learning infant: Effects of speech input, vocal output, and feedback Gustavsson, Lisa January 2009 (has links) This thesis studies the characteristics of the acoustic signal in speech, especially in speech directed to infants and in infant vocal development, to gain insight on essential aspects of speech processing, speech production and communicative interaction in early language acquisition. Three sets of experimental studies are presented in this thesis. From a phonetic point of view they investigate the fundamental processes involved in first language acquisition. The first set (study 1.1 and study 1.2) investigated how linguistic structure in the speech signal can be derived and which strategy infants and adults use to process information depending on its presentation. The second set (study 2.1 and study 2.2) studied acoustic consequences of the anatomical geometry of the infant vocal tract and the development of sensory-motor control for articulatory strategies. The third set of studies (study 3.1 and study 3.2) explored the infant's interaction with the linguistic environment, specifically how vocal imitation and reinforcement may assist infants to converge towards adult-like speech. The first set of studies suggests that structure and quality of simultaneous sensory input impact on the establishment of initial linguistic representations. The second set indicates that the anatomy of the infant vocal tract does not constrain the production of adult-like speech sounds and that some degree of articulatory motor control is present from six months of age. The third set of studies suggests that the adult interprets and reinforces vocalizations produced by the infant in a developmentally-adjusted fashion that can guide the infant towards the sounds of the ambient language. The results are discussed in terms of essential aspects of early speech processing and speech production that can be accounted for by biological general purpose mechanisms in the language learning infant. / För att köpa boken skicka en beställning till exp@ling.su.se/ To order the book send an e-mail to exp@ling.su.se human language language acquisition perception production humanoid development model embodied system speech signal processing vocal tract morphology acoustic speech input information processing scaling interaction growth infant imitation feedback perceptual salience modeling Phonetics Fonetik
18	Timbre Perception of Time-Varying Signals Arthi, S January 2014 (has links) (PDF) Every auditory event provides an information-rich signal to the brain. The signal constitutes perceptual attributes of pitch, loudness, timbre, and also, conceptual attributes like location, emotions, meaning, etc. In the present work we examine the timbre perception of time-varying signals in particular. While stationary signal timbre, by-itself is complex perceptually, the time-varying signal timbre introduces an evolving pattern, adding to its multi-dimensionality. To characterize timbre, we conduct psycho-acoustic perception tests with normal-hearing human subjects. We focus on time-varying synthetic speech signals(can be extended to music) because listeners are perceptually consistent with speech. Also, we can parametrically control the timbre and pitch glides using linear time-varying models. In order to quantify the timbre change in time-varying signals, we define the JND(Just noticeable difference) of timbre using diphthongs, synthesized using time-varying formant frequency model. The diphthong JND is defined as a two dimensional contour on the plane of percentage change of formant frequencies of terminal vowels. Thus, we simplify the perceptual probing to a lower dimensional space, i.e, 2-D even for a diphthong, which is multi-parametric. We also study the impact of pitch glide on the timbre JND of the diphthong. It is observed that timbre JND is influenced by the occurrence of pitch glide. Focusing on the magnitude of perceptual timbre change, we design a MUSHRA-like listening test using the vowel continuum in the formant-frequency space. We provide explicit anchors for reference: 0% and 100%, thus quantifying the perceptual timbre change on a 1-D scale. We also propose an objective measure of timbre change and observe that there is good correlation between the objective measure and subjective human responses of percentage timbre change. Using the above experimental methodology, we studied the influence of pitch shift on timbre perception and observed that the perceptual timbre change increases with change in pitch. We used vowels and diphthongs with 5 different types of pitch glides-(i) Constant pitch,(ii) 3-semitone linearly-up,(iii) 3 semitone linearly-down, (iv)V–like pitch glide and (v) hat-like pitch glide. The present study shows that timbre change can be measured on a 1-D scale if the perturbation is along one-dimension. We observe that for bright vowels(/a/and/i/), linearly decreasing pitch glide(dull pitch glide)causes more timbre change than linearly increasing pitch glide(bright pitch glide).For dull vowels(/u/),it is vice-versa. To summarize, in congruent pitch glides cause more perceptual timbre change than congruent pitch glides.(Congruent pitch glide implies bright pitch glide in bright vowel or dull pitch glide in dull vowel and in congruent pitch glide implies bright pitch glide in dull vowel or dull pitch glide in bright vowel.) Experiments with quadratic pitch glides show that the decay portion of pitch glide affects timbre perception more than the attack portion in short duration signals with less or no sustained part. In case of time-varying timbre, bright diphthongs show patterns similar to bright vowels. Also, for bright diphthongs(/ai/), perceived timbre change is most with decreasing pitch glide(dull pitch glide). We also observed that listeners perceive more timbre change in constant pitch than in pitch glides, congruent with the timbre or pitch glides with quadratic changes. The main conclusion of this study is that pitch and timbre do interact and in congruent pitch glides cause more timbre change than congruent pitch glides. In the case of quadratic pitch glides, listener perception of vowels is influenced by the decay than the attack in pitch glide in short duration signals. In the case of time-varying timbre also, in congruent pitch glides cause the most timbre change, followed by constant pitch glide. For congruent pitch glides and quadratic pitch glides in time-varying timbre, the listeners perceive lesser timbre change than otherwise. Time-Varying Signals Signal Perception Timber Perception Time-Varying Synthetic Speech Signals Signal Processing Dipthongs Pitch Glides Speech Signal Processing Time-Varying Signal Timber Linear Time-varying (LTV) Model Pitch and Timbre Communication Engineering
19	Akustická analýza vět složitých na artikulaci u pacientů s Parkinsonovou nemocí / Acoustic analysis of sentences complicated for articulation in patients with Parkinson's disease Kiska, Tomáš January 2015 (has links) This work deals with a design of hypokinetic dysarthria analysis system. Hypokinetic dysarthria is a speech motor dysfunction that is present in approx. 90 % of patients with Parkinson’s disease. Next there is described Parkinson's disease and change of the speech signal by this disability. The following describes the symptoms, which are used for the diagnosis of Parkinson's disease (FCR, VSA, VAI, etc.). The work is mainly focused on parameterization techniques that can be used to diagnose or monitor this disease as well as estimate its progress. A protocol of dysarthric speech acquisition is described in this work too. In combination with acoustic analysis it can be used to estimate a grade of hypokinetic dysarthria in fields of faciokinesis, phonorespiration and phonetics (correlation with 3F test). Regarding the parameterization, new features based on method RASTA. The analysis is based on parametrization sentences complicated for articulation. Experimental dataset consists of 101 PD patients with different disease progress and 53 healthy controls. For classification with feature selection have selected method mRMR.
20	Analýza Parkinsonovy nemoci pomocí segmentálních řečových příznaků / Analysis of Parkinson's disease using segmental speech parameters Mračko, Peter January 2015 (has links) This project describes design of the system for diagnosis Parkinson’s disease based on speech. Parkinson’s disease is a neurodegenerative disorder of the central nervous system. One of the symptoms of this disease is disability of motor aspects of speech, called hypokinetic dysarthria. Design of the system in this work is based on the best known segmental features such as coefficients LPC, PLP, MFCC, LPCC but also less known such as CMS, ACW and MSC. From speech records of patients affected by Parkinson’s disease and also healthy controls are calculated these coefficients, further is performed a selection process and subsequent classification. The best result, which was obtained in this project reached classification accuracy 77,19%, sensitivity 74,69% and specificity 78,95%.

Search results