Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
561 |
Words have power: Speech recognition in interactive jewelry : a case study with newcome LGBT+ immigrantsPoikolainen Rosén, Anton January 2017 (has links)
This paper addresses a design exploration focusing on interactive jewelry conducted with newcome LGBT+ immigrants in Sweden, leading to a necklace named PoWo that is “powered” by the spoken word through a mobile application that reacts to customizable keywords triggering LED-lights in the necklace. Interactive jewelry is in this paper viewed as a medium with a simultaneous relation to wearer and spectator thus affording use on the themes of symbolism, emotion, body and communication. These themes are demonstrated through specific use scenarios of the necklace relating to the participants of the design exploration e.g. addressing consent, societal issues, meeting situations and expressions of love and sexuality. The potential of speech based interactive jewelry is investigated in this paper e.g. finding speech recognition in LED-jewelry to act as an amplifier of spoken words, actions and meaning; and as a visible extension of the smartphone and human body. In addition use qualities of visibility, ambiguity, continuity and fluency are discussed in relation to speech based LED-jewelry.
|
562 |
An acoustic comparison of the vowels and diphthongs of first-language and African- mother-tongue South African EnglishBrink, Janus Daniel 31 October 2005 (has links)
Speaker accent influences the accuracy of automatic speech recognition (ASR) systems. Knowledge of accent based acoustic variations can therefore be used in the develop¬ment of more robust systems. This project investigates the differences between first language (L1) and second language (L2) English in South Africa with respect to vowels and diphthongs. The study is specifically aimed at L2 English speakers with a native African mother tongue, for instance speakers of isi-Zulu, isi-Xhosa, Tswana or South Sotho. The vowel systems of English and African languages, as described in the linguistic literature, are compared to predict the expected deviations of L2 South African English from L1. A number of vowels and diphthongs from L1 and L2 speakers are acoustically compared and the results are correlated with the linguistic predictions. The comparison is firstly made in formant space using the first three formants found using the Split Levinson algorithm. The L1 vowel centroids and diphthong trajectories in this three-dimensional space are then compared to their L2 counterparts using analysis of variance. The second analysis method is based on simple hidden Markov models (HMMs) using Mel-scaled cepstral features. Each HMM models a vowel or diphthong from one of the two speaker groups and analysis of variance is again used to compare the L1 and L2 HMMs. Significant differences are found in the vowel and diphthong qualities of the two language groups which supports the linguistically predicted effects such as vowel substitution, peripheralisation and changes in diphthong strength. The long-term goal of this project is to enable the adaptation of existing L1 English recognition systems to perform equally well on South African L2 English. / Dissertation (MEng (Computer Engineering))--University of Pretoria, 2005. / Electrical, Electronic and Computer Engineering / unrestricted
|
563 |
Attelage de systèmes de transcription automatique de la parole / Attelage de systèmes de transcription automatique de la paroleBougares, Fethi 23 November 2012 (has links)
Nous abordons, dans cette thèse, les méthodes de combinaison de systèmesde transcription de la parole à Large Vocabulaire. Notre étude se concentre surl’attelage de systèmes de transcription hétérogènes dans l’objectif d’améliorerla qualité de la transcription à latence contrainte. Les systèmes statistiquessont affectés par les nombreuses variabilités qui caractérisent le signal dela parole. Un seul système n’est généralement pas capable de modéliserl’ensemble de ces variabilités. La combinaison de différents systèmes detranscription repose sur l’idée d’exploiter les points forts de chacun pourobtenir une transcription finale améliorée. Les méthodes de combinaisonproposées dans la littérature sont majoritairement appliquées a posteriori,dans une architecture de transcription multi-passes. Cela nécessite un tempsde latence considérable induit par le temps d’attente requis avant l’applicationde la combinaison.Récemment, une méthode de combinaison intégrée a été proposée. Cetteméthode est basée sur le paradigme de décodage guidé (DDA :Driven DecodingAlgorithm) qui permet de combiner différents systèmes durant le décodage. Laméthode consiste à intégrer des informations en provenance de plusieurs systèmes dits auxiliaires dans le processus de décodage d’un système dit primaire.Notre contribution dans le cadre de cette thèse porte sur un double aspect : d’une part, nous proposons une étude sur la robustesse de la combinaison par décodage guidé. Nous proposons ensuite, une amélioration efficacement généralisable basée sur le décodage guidé par sac de n-grammes,appelé BONG. D’autre part, nous proposons un cadre permettant l’attelagede plusieurs systèmes mono-passe pour la construction collaborative, à latenceréduite, de la sortie de l’hypothèse de reconnaissance finale. Nous présentonsdifférents modèles théoriques de l’architecture d’attelage et nous exposons unexemple d’implémentation en utilisant une architecture client/serveur distribuée. Après la définition de l’architecture de collaboration, nous nous focalisons sur les méthodes de combinaison adaptées à la transcription automatiqueà latence réduite. Nous proposons une adaptation de la combinaison BONGpermettant la collaboration, à latence réduite, de plusieurs systèmes mono-passe fonctionnant en parallèle. Nous présentons également, une adaptationde la combinaison ROVER applicable durant le processus de décodage via unprocessus d’alignement local suivi par un processus de vote basé sur la fréquence d’apparition des mots. Les deux méthodes de combinaison proposéespermettent la réduction de la latence de la combinaison de plusieurs systèmesmono-passe avec un gain significatif du WER. / This thesis presents work in the area of Large Vocabulary ContinuousSpeech Recognition (LVCSR) system combination. The thesis focuses onmethods for harnessing heterogeneous systems in order to increase theefficiency of speech recognizer with reduced latency.Automatic Speech Recognition (ASR) is affected by many variabilitiespresent in the speech signal, therefore single ASR systems are usually unableto deal with all these variabilities. Considering these limitations, combinationmethods are proposed as alternative strategies to improve recognitionaccuracy using multiple recognizers developed at different research siteswith different recognition strategies. System combination techniques areusually used within multi-passes ASR architecture. Outputs of two or moreASR systems are combined to estimate the most likely hypothesis amongconflicting word pairs or differing hypotheses for the same part of utterance.The contribution of this thesis is twofold. First, we study and analyze theintegrated driven decoding combination method which consists in guidingthe search algorithm of a primary ASR system by the one-best hypothesesof auxiliary systems. Thus we propose some improvements in order to makethe driven decoding more efficient and generalizable. The proposed methodis called BONG and consists in using Bag Of N-Gram auxiliary hypothesisfor the driven decoding.Second, we propose a new framework for low latency paralyzed single-passspeech recognizer harnessing. We study various theoretical harnessingmodels and we present an example of harnessing implementation basedon client/server distributed architecture. Afterwards, we suggest differentcombination methods adapted to the presented harnessing architecture:first we extend the BONG combination method for low latency paralyzedsingle-pass speech recognizer systems collaboration. Then we propose, anadaptation of the ROVER combination method to be performed during thedecoding process using a local vote procedure followed by voting based onword frequencies.
|
564 |
Auditory-based processing of communication soundsWalters, Thomas C. January 2011 (has links)
This thesis examines the possible benefits of adapting a biologically-inspired model of human auditory processing as part of a machine-hearing system. Features were generated by an auditory model, and used as input to machine learning systems to determine the content of the sound. Features were generated using the auditory image model (AIM) and were used for speech recognition and audio search. AIM comprises processing to simulate the human cochlea, and a 'strobed temporal integration' process which generates a stabilised auditory image (SAI) from the input sound. The communication sounds which are produced by humans, other animals, and many musical instruments take the form of a pulse-resonance signal: pulses excite resonances in the body, and the resonance following each pulse contains information both about the type of object producing the sound and its size. In the case of humans, vocal tract length (VTL) determines the size properties of the resonance. In the speech recognition experiments, an auditory filterbank was combined with a Gaussian fitting procedure to produce features which are invariant to changes in speaker VTL. These features were compared against standard mel-frequency cepstral coefficients (MFCCs) in a size-invariant syllable recognition task. The VTL-invariant representation was found to produce better results than MFCCs when the system was trained on syllables from simulated talkers of one range of VTLs and tested on those from simulated talkers with a different range of VTLs. The image stabilisation process of strobed temporal integration was analysed. Based on the properties of the auditory filterbank being used, theoretical constraints were placed on the properties of the dynamic thresholding function used to perform strobe detection. These constraints were used to specify a simple, yet robust, strobe detection algorithm. The syllable recognition system described above was then extended to produce features from profiles of the SAI and tested with the same syllable database as before. For clean speech, performance of the features was comparable to that of those generated from the filterbank output. However when pink noise was added to the stimuli, performance dropped more slowly as a function of signal-to-noise ratio when using the SAI-based AIM features, than when using either the filterbank-based features or the MFCCs, demonstrating the noise-robustness properties of the SAI representation. The properties of the auditory filterbank in AIM were also analysed. Three models of the cochlea were considered: the static gammatone filterbank, dynamic compressive gammachirp (dcGC) and the pole-zero filter cascade (PZFC). The dcGC and gammatone are standard filterbank models, whereas the PZFC is a filter cascade, which more accurately models signal propagation in the cochlea. However, while the architecture of the filterbanks is different, they have all been successfully fitted to psychophysical masking data from humans. The abilities of the filterbanks to measure pitch strength were assessed, using stimuli which evoke a weak pitch percept in humans, in order to ascertain whether there is any benefit in the use of the more computationally efficient PZFC.Finally, a complete sound effects search system using auditory features was constructed in collaboration with Google research. Features were computed from the SAI by sampling the SAI space with boxes of different scales. Vector quantization (VQ) was used to convert this multi-scale representation to a sparse code. The 'passive-aggressive model for image retrieval' (PAMIR) was used to learn the relationships between dictionary words and these auditory codewords. These auditory sparse codes were compared against sparse codes generated from MFCCs, and the best performance was found when using the auditory features.
|
565 |
Exploiting phonological constraints and automatic identification of speaker classes for Arabic speech recognitionAlsharhan, Iman January 2014 (has links)
The aim of this thesis is to investigate a number of factors that could affect the performance of an Arabic automatic speech understanding (ASU) system. The work described in this thesis belongs to the speech recognition (ASR) phase, but the fact that it is part of an ASU project rather than a stand-alone piece of work on ASR influences the way in which it will be carried out. Our main concern in this work is to determine the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recogniser. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassifiy it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and eventually provide a phonetic transcription that is sensitive to the local context for the ASR system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predened dictionary that typically does not reect the actual pronunciation of the word. Besides investigating the boundary effect on pronunciation, the research also seeks to address the problem of Arabic's complex morphology. Two solutions are proposed to tackle this problem, namely, using underspecified phonetic transcription to build the system, and using phonemes instead of words to build the hidden markov models (HMMS). The research also seeks to investigate several technical settings that might have an effect on the system's performance. These include training on the sub-population to minimise the variation caused by training on the main undifferentiated population, as well as investigating the correlation between training size and performance of the ASR system.
|
566 |
Automatic Recognition of Speech-Evoked Brainstem Responses to English VowelsSamimi, Hamed January 2015 (has links)
The objective of this study is to investigate automatic recognition of speech-evoked
auditory brainstem responses (speech-evoked ABR) to the five English vowels (/a/, /ae/, /ao (ɔ)/, /i/ and /u/). We used different automatic speech recognition methods to
discriminate between the responses to the vowels. The best recognition result was
obtained by applying principal component analysis (PCA) on the amplitudes of the first ten harmonic components of the envelope following response (based on spectral components at fundamental frequency and its harmonics) and of the frequency following response (based on spectral components in first formant region) and combining these two feature sets. With this combined feature set used as input to an artificial neural network, a recognition accuracy of 83.8% was achieved. This study could be extended to more complex stimuli to improve assessment of the auditory system for speech communication in hearing impaired individuals, and potentially help in the objective fitting of hearing aids.
|
567 |
The Variable Pronunciations of Word-final Consonant Clusters in a Force Aligned Corpus of Spoken FrenchMilne, Peter January 2014 (has links)
This thesis project examined both schwa insertion and simplification following word-final consonant clusters in a large corpus of spoken French. Two main research questions were addressed. Can a system of forced alignment reliably reproduce pronunciation judgements that closely match those of a human researcher? How do variables, such as speech style, following context, motivation for simplification and speech rate, affect the variable pronunciations of word-final consonant clusters? This project describes the creation and testing of a novel system of forced alignment capable of segmenting recorded French speech. The results of comparing the pronunciation judgements between automatic and manual methods of recognition suggest that a system of forced alignment using speaker adapted acoustic models performed better than other acoustic models; produced results that are likely to be similar to the results produced by manual identification; and that the results of forced alignment are not likely to be affected by changes in speech style or speech rate. This project also described the application of forced alignment on a corpus of natural language spoken French. The results presented in this large sample corpus analysis suggest that the dialectal differences between Québec and France are not as simple as ``simplification in Québec, schwa insertion in France". While the results presented here suggest that the process of simplification following a word-final consonant cluster is similar in both dialects, the process of schwa insertion is likely to be different in each dialect. In both dialects, word-final consonant cluster simplification is more frequent in a preconsonantal context; is most likely in a spontaneous or less formal speech style and in that speech style is positively associated with higher speaking rates. Schwa insertion following a word-final consonant cluster displays much stronger dialectal differences. Schwa insertion in the dialect from France is strongly affected by following context and possibly speech style. Schwa insertion in the dialect from Québec is not affected by following context and is strongly predicted by a lack of consonant cluster simplification.
|
568 |
Pronunciation support for Arabic learnersAlsabaan, Majed Soliman K. January 2015 (has links)
The aim of the thesis is to find out whether providing feedback to Arabic language learners will help them improve their pronunciation, particularly of words involving sounds that are not distinguished in their native languages. In addition, it aims to find out, if possible, what type of feedback will be most helpful. In order to achieve this aim, we developed a computational tool with a number of component sub tools. These tools involve the implementation of several substantial pieces of software. The first task was to ensure the system we were building could distinguish between the more challenging sounds when they were produced by a native speaker, since without that it will not be possible to classify learners’ attempts at these sounds. To this end, a number of experiments were carried out with the hidden Markov model toolkit (the HTK), a well known speech recognition toolkit, in order to ensure that it can distinguish between the confusable sounds, i.e. the ones that people have difficulty with. The developed computational tool analyses the differences between the user’s pronunciation and that of a native speaker by using grammar of minimal pairs, where each utterance is treated as coming from a family of similar words. This provides the ability to categorise learners’ errors - if someone is trying to say cat and the recogniser thinks they have said cad then it is likely that they are voicing the final consonant when it should be unvoiced. Extensive testing shows that the system can reliably distinguish such minimal pairs when they are produced by a native speaker, and that this approach does provide effective diagnostic information about errors. The tool provides feedback in three different sub-tools: as an animation of the vocal tract, as a synthesised version of the target utterance, and as a set of written instructions. The tool was evaluated by placing it in a classroom setting and asking 50 Arabic students to use the different versions of the tool. Each student had a thirty minute session with the tool, working their way through a set of pronunciation exercises at their own pace. The results of this group showed that their pronunciation does improve over the course of the session, though it was not possible to determine whether the improvement is sustained over an extended period. The evaluation was done from three points of view: quantitative analysis, qualitative analysis, and using a questionnaire. Firstly, the quantitative analysis gives raw numbers telling whether a learner had improved their pronunciation or not. Secondly, the qualitative analysis shows a behaviour pattern of what a learner did and how they used the tool. Thirdly, the questionnaire gives feedback from learners and their comments about the tool. We found that providing feedback does appear to help Arabic language learners, but we did not have enough data to see which form of feedback is most helpful. However, we provided an informative analysis of behaviour patterns to see how Arabic students used the tool and interacted with it, which could be useful for more data analysis.
|
569 |
A First Study on Hidden Markov Models and one Application in Speech RecognitionServitja Robert, Maria January 2016 (has links)
Speech is intuitive, fast and easy to generate, but it is hard to index and easy to forget. What is more, listening to speech is slow. Text is easier to store, process and consume, both for computers and for humans, but writing text is slow and requires some intention. In this thesis, we study speech recognition which allows converting speech into text, making it easier both to create and to use information. Our tool of study is Hidden Markov Models which is one of the most important machine learning models in speech and language processing. The aim of this thesis is to do a rst study in Hidden Markov Models and understand their importance, particularly in speech recognition. We will go through three fundamental problems that come up naturally with Hidden Markov Models: to compute a likelihood of an observation sequence, to nd an optimal state sequence given an observation sequence and the model, and to adjust the model parameters. A solution to each problem will be given together with an example and the corresponding simulations using MatLab. The main importance lies in the last example, in which a rst approach to speech recognition will be done.
|
570 |
Identifying Speaker State from Multimodal CuesYang, Zixiaofan January 2021 (has links)
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities.
The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer.
In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers.
The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments.
|
Page generated in 0.0919 seconds