• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 341
  • 40
  • 24
  • 14
  • 10
  • 10
  • 9
  • 9
  • 9
  • 9
  • 9
  • 9
  • 8
  • 4
  • 3
  • Tagged with
  • 504
  • 504
  • 504
  • 181
  • 125
  • 101
  • 89
  • 49
  • 48
  • 44
  • 41
  • 41
  • 40
  • 39
  • 38
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
351

Automatic Recognition of Speech-Evoked Brainstem Responses to English Vowels

Samimi, Hamed January 2015 (has links)
The objective of this study is to investigate automatic recognition of speech-evoked auditory brainstem responses (speech-evoked ABR) to the five English vowels (/a/, /ae/, /ao (ɔ)/, /i/ and /u/). We used different automatic speech recognition methods to discriminate between the responses to the vowels. The best recognition result was obtained by applying principal component analysis (PCA) on the amplitudes of the first ten harmonic components of the envelope following response (based on spectral components at fundamental frequency and its harmonics) and of the frequency following response (based on spectral components in first formant region) and combining these two feature sets. With this combined feature set used as input to an artificial neural network, a recognition accuracy of 83.8% was achieved. This study could be extended to more complex stimuli to improve assessment of the auditory system for speech communication in hearing impaired individuals, and potentially help in the objective fitting of hearing aids.
352

The Variable Pronunciations of Word-final Consonant Clusters in a Force Aligned Corpus of Spoken French

Milne, Peter January 2014 (has links)
This thesis project examined both schwa insertion and simplification following word-final consonant clusters in a large corpus of spoken French. Two main research questions were addressed. Can a system of forced alignment reliably reproduce pronunciation judgements that closely match those of a human researcher? How do variables, such as speech style, following context, motivation for simplification and speech rate, affect the variable pronunciations of word-final consonant clusters? This project describes the creation and testing of a novel system of forced alignment capable of segmenting recorded French speech. The results of comparing the pronunciation judgements between automatic and manual methods of recognition suggest that a system of forced alignment using speaker adapted acoustic models performed better than other acoustic models; produced results that are likely to be similar to the results produced by manual identification; and that the results of forced alignment are not likely to be affected by changes in speech style or speech rate. This project also described the application of forced alignment on a corpus of natural language spoken French. The results presented in this large sample corpus analysis suggest that the dialectal differences between Québec and France are not as simple as ``simplification in Québec, schwa insertion in France". While the results presented here suggest that the process of simplification following a word-final consonant cluster is similar in both dialects, the process of schwa insertion is likely to be different in each dialect. In both dialects, word-final consonant cluster simplification is more frequent in a preconsonantal context; is most likely in a spontaneous or less formal speech style and in that speech style is positively associated with higher speaking rates. Schwa insertion following a word-final consonant cluster displays much stronger dialectal differences. Schwa insertion in the dialect from France is strongly affected by following context and possibly speech style. Schwa insertion in the dialect from Québec is not affected by following context and is strongly predicted by a lack of consonant cluster simplification.
353

Hopfield Networks as an Error Correcting Technique for Speech Recognition

Bireddy, Chakradhar 05 1900 (has links)
I experimented with Hopfield networks in the context of a voice-based, query-answering system. Hopfield networks are used to store and retrieve patterns. I used this technique to store queries represented as natural language sentences and I evaluated the accuracy of the technique for error correction in a spoken question-answering dialog between a computer and a user. I show that the use of an auto-associative Hopfield network helps make the speech recognition system more fault tolerant. I also looked at the available encoding schemes to convert a natural language sentence into a pattern of zeroes and ones that can be stored in the Hopfield network reliably, and I suggest scalable data representations which allow storing a large number of queries.
354

Articulation modelling of vowels in dysarthric and non-dysarthric speech

Albalkhi, Rahaf 25 May 2020 (has links)
People with motor function disorders that cause dysarthric speech find difficulty using state-of- the-art automatic speech recognition (ASR) systems. These systems are developed based on non- dysarthric speech models, which explains the poor performance when used by individuals with dysarthria. Thus, a solution is needed to compensate for the poor performance of these systems. This thesis examines the possibility of quantifying vowels of dysarthric and non-dysarthric speech into codewords regardless of inter-speaker variability and possible to be implemented on limited- processing-capability machines. I show that it is possible to model all possible vowels and vowel- like sounds that a North American speaker can produce if the frequencies of the first and second formants are used to encode these sounds. The proposed solution is aligned with the use of neural networks and hidden Markov models to build an acoustic model in conventional ASR systems. A secondary finding of this study includes the feasibility of reducing the set of ten most common vowels in North American English to eight vowels only. / Graduate / 2021-05-11
355

Grapheme-based continuous speech recognition for some of the under- resourced languages of Limpopo Province

Manaileng, Mabu Johannes January 2015 (has links)
Thesis (M.Sc. (Computer Science)) -- University of Limpopo, 2015 / This study investigates the potential of using graphemes, instead of phonemes, as acoustic sub-word units for monolingual and cross-lingual speech recognition for some of the under-resourced languages of the Limpopo Province, namely, IsiNdebele, Sepedi and Tshivenda. The performance of a grapheme-based recognition system is compared to that of phoneme-based recognition system. For each selected under-resourced language, automatic speech recognition (ASR) system based on the use of hidden Markov models (HMMs) was developed using both graphemes and phonemes as acoustic sub-word units. The ASR framework used models emission distributions by 16 Gaussian Mixture Models (GMMs) with 2 mixture increments. A third-order n-gram language model was used in all experiments. Identical speech datasets were used for each experiment per language. The LWAZI speech corpora and the National Centre for Human Language Technologies (NCHLT) speech corpora were used for training and testing the tied-state context-dependent acoustic models. The performance of all systems was evaluated at the word-level recognition using word error rate (WER). The results of our study show that grapheme-based continuous speech recognition, which copes with the problem of low-quality or unavailable pronunciation dictionaries, is comparable to phoneme-based recognition for the selected under-resourced languages in both the monolingual and cross-lingual speech recognition tasks. The study significantly demonstrates that context-dependent grapheme-based sub-word units can be reliable for small and medium-large vocabulary speech recognition tasks for these languages. / Telkom SA
356

Speech Recognition Using a Synthesized Codebook

Smith, Lloyd A. (Lloyd Allen) 08 1900 (has links)
Speech sounds generated by a simple waveform synthesizer were used to create a vector quantization codebook for use in speech recognition. Recognition was tested over the TI-20 isolated word data base using a conventional DTW matching algorithm. Input speech was band limited to 300 - 3300 Hz, then passed through the Scott Instruments Corp. Coretechs process, implemented on a VET3 speech terminal, to create the speech representation for matching. Synthesized sounds were processed in software by a VET3 signal processing emulation program. Emulation and recognition were performed on a DEC VAX 11/750. The experiments were organized in 2 series. A preliminary experiment, using no vector quantization, provided a baseline for comparison. The original codebook contained 109 vectors, all derived from 2 formant synthesized sounds. This codebook was decimated through the course of the first series of experiments, based on the number of times each vector was used in quantizing the training data for the previous experiment, in order to determine the smallest subset of vectors suitable for coding the speech data base. The second series of experiments altered several test conditions in order to evaluate the applicability of the minimal synthesized codebook to conventional codebook training. The baseline recognition rate was 97%. The recognition rate for synthesized codebooks was approximately 92% for sizes ranging from 109 to 16 vectors. Accuracy for smaller codebooks was slightly less than 90%. Error analysis showed that the primary loss in dropping below 16 vectors was in coding of voiced sounds with high frequency second formants. The 16 vector synthesized codebook was chosen as the seed for the second series of experiments. After one training iteration, and using a normalized distortion score, trained codebooks performed with an accuracy of 95.1%. When codebooks were trained and tested on different sets of speakers, accuracy was 94.9%, indicating that very little speaker dependence was introduced by the training.
357

Identifying Speaker State from Multimodal Cues

Yang, Zixiaofan January 2021 (has links)
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities. The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer. In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers. The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments.
358

Strojový překlad mluvené řeči přes fonetickou reprezentaci zdrojové řeči / Spoken Language Translation via Phoneme Representation of the Source Language

Polák, Peter January 2020 (has links)
We refactor the traditional two-step approach of automatic speech recognition for spoken language translation. Instead of conventional graphemes, we use phonemes as an intermediate speech representation. Starting with the acoustic model, we revise the cross-lingual transfer and propose a coarse-to-fine method providing further speed-up and performance boost. Further, we review the translation model. We experiment with source and target encoding, boosting the robustness by utilizing the fine-tuning and transfer across ASR and SLT. We empirically document that this conventional setup with an alternative representation not only performs well on standard test sets but also provides robust transcripts and translations on challenging (e.g., non-native) test sets. Notably, our ASR system outperforms commercial ASR systems. 1
359

User-centered Visualizations of Transcription Uncertainty in AI-generated Subtitles of News Broadcast

Karlsson, Fredrik January 2020 (has links)
AI-generated subtitles have recently started to automate the process of subtitling with automatic speech recognition. However, people may not perceive that the transcription is based on probabilities and may entail errors. For news that is broadcast live may this be controversial and cause misinterpretation. A user-centered design approach was performed investigating three possible solutions towards visualizing transcription uncertainties in real-time presentation. Based on the user needs, one proposed solution was used in a qualitative comparison with AI- generated subtitles without visualizations. The results suggest that visualization of uncertainties support users’ interpretation of AI-generated subtitles and helps to identify possible errors. However, it does not improve the transcription intelligibility. The result also suggests that unnoticed transcription errors during news broadcast is perceived as critical and decrease trust towards the news. Uncertainty visualizations may increase trust and prevent the risk of misinterpretation with important information.
360

Cross-lingual and Multilingual Automatic Speech Recognition for Scandinavian Languages

Černiavski, Rafal January 2022 (has links)
Research into Automatic Speech Recognition (ASR), the task of transforming speech into text, remains highly relevant due to its countless applications in industry and academia. State-of-the-art ASR models are able to produce nearly perfect, sometimes referred to as human-like transcriptions; however, accurate ASR models are most often available only in high-resource languages. Furthermore, the vast majority of ASR models are monolingual, that is, only able to handle one language at a time. In this thesis, we extensively evaluate the quality of existing monolingual ASR models for Swedish, Danish, and Norwegian. In addition, we search for parallels between monolingual ASR models and the cognition of foreign languages in native speakers of these languages. Lastly, we extend the Swedish monolingual model to handle all three languages. The research conducted in this thesis project is divided into two main sections, namely monolingual and multilingual models. In the former, we analyse and compare the performance of monolingual ASR models for Scandinavian languages in monolingual and cross-lingual settings. We compare these results against the levels of mutual intelligibility of Scandinavian languages in native speakers of Swedish, Danish, and Norwegian to see whether the monolingual models favour the same languages as native speakers. We also examine the performance of the monolingual models on the regional dialects of all three languages and perform qualitative analysis of the most common errors. As for multilingual models, we expand the most accurate monolingual ASR model to handle all three languages. To do so, we explore the most suitable settings via trial models. In addition, we propose an extension to the well-established Wav2Vec 2.0-CTC architecture by incorporating a language classification component. The extension enables the usage of language models, thus boosting the overall performance of the multilingual models. The results reported in this thesis suggest that in a cross-lingual setting, monolingual ASR models for Scandinavian languages perform better on the languages that are easier to comprehend for native speakers. Furthermore, the addition of a statistical language model boosts the performance of ASR models in monolingual,  cross-lingual, and multilingual settings. ASR models appear to favour certain regional dialects, though the gap narrows in a multilingual setting. Contrary to our expectations, our multilingual model performs comparably with the monolingual Swedish ASR models and outperforms the Danish and Norwegian models. The multilingual architecture proposed in this thesis project is fairly simple yet effective. With greater computational resources at hand, further extensions offered in the conclusions might improve the models further.

Page generated in 0.0912 seconds