Global ETD Search

361	Automatic Speech Recognition for low-resource languages using Wav2Vec2 : Modern Standard Arabic (MSA) as an example of a low-resource language Zouhair, Taha January 2021 (has links) The need for fully automatic translation at DigitalTolk, a Stockholm-based company providing translation services, leads to exploring Automatic Speech Recognition as a first step for Modern Standard Arabic (MSA). Facebook AI recently released a second version of its Wav2Vec models, dubbed Wav2Vec 2.0, which uses deep neural networks and provides several English pretrained models along with a multilingual model trained in 53 different languages, referred to as the Cross-Lingual Speech Representation (XLSR-53). The small English and the XLSR-53 pretrained models are tested, and the results stemming from them discussed, with the Arabic data from Mozilla Common Voice. In this research, the small model did not yield any results and may have needed more unlabelled data to train whereas the large model proved to be successful in predicting the audio recordings in Arabic and a Word Error Rate of 24.40% was achieved, an unprecedented result. The small model turned out to be not suitable for training especially on languages other than English and where the unlabelled data is not enough. On the other hand, the large model gave very promising results despite the low amount of data. The large model should be the model of choice for any future training that needs to be done on low resource languages such as Arabic. Computer Sciences Datavetenskap (datalogi)
362	Deep networks for sign language video caption Zhou, Mingjie 12 August 2020 (has links) In the hearing-loss community, sign language is a primary tool to communicate with people while there is a communication gap between hearing-loss people with normal hearing people. Sign language is different from spoken language. It has its own vocabulary and grammar. Recent works concentrate on the sign language video caption which consists of sign language recognition and sign language translation. Continuous sign language recognition, which can bridge the communication gap, is a challenging task because of the weakly supervised ordered annotations where no frame-level label is provided. To overcome this problem, connectionist temporal classification (CTC) is the most widely used method. However, CTC learning could perform badly if the extracted features are not good. For better feature extraction, this thesis presents the novel self-attention-based fully-inception (SAFI) networks for vision-based end-to-end continuous sign language recognition. Considering the length of sign words differs from each other, we introduce the fully inception network with different receptive fields to extract dynamic clip-level features. To further boost the performance, the fully inception network with an auxiliary classifier is trained with aggregation cross entropy (ACE) loss. Then the encoder of self-attention networks as the global sequential feature extractor is used to model the clip-level features with CTC. The proposed model is optimized by jointly training with ACE on clip-level feature learning and CTC on global sequential feature learning in an end-to-end fashion. The best method in the baselines achieves 35.6% WER on the validation set and 34.5% WER on the test set. It employs a better decoding algorithm for generating pseudo labels to do the EM-like optimization to fine-tune the CNN module. In contrast, our approach focuses on the better feature extraction for end-to-end learning. To alleviate the overfitting on the limited dataset, we employ temporal elastic deformation to triple the real-world dataset RWTH- PHOENIX-Weather 2014. Experimental results on the real-world dataset RWTH- PHOENIX-Weather 2014 demonstrate the effectiveness of our approach which achieves 31.7% WER on the validation set and 31.2% WER on the test set. Even though sign language recognition can, to some extent, help bridge the communication gap, it is still organized in sign language grammar which is different from spoken language. Unlike sign language recognition that recognizes sign gestures, sign language translation (SLT) converts sign language to a target spoken language text which normal hearing people commonly use in their daily life. To achieve this goal, this thesis provides an effective sign language translation approach which gains state-of-the-art performance on the largest real-life German sign language translation database, RWTH-PHOENIX-Weather 2014T. Besides, a direct end-to-end sign language translation approach gives out promising results (an impressive gain from 9.94 to 13.75 BLEU and 9.58 to 14.07 BLEU on the validation set and test set) without intermediate recognition annotations. The comparative and promising experimental results show the feasibility of the direct end-to-end SLT
363	The automatic recognition of emotions in speech Manamela, Phuti, John January 2020 (has links) Thesis(M.Sc.(Computer Science)) -- University of Limpopo, 2020 / Speech emotion recognition (SER) refers to a technology that enables machines to detect and recognise human emotions from spoken phrases. In the literature, numerous attempts have been made to develop systems that can recognise human emotions from their voice, however, not much work has been done in the context of South African indigenous languages. The aim of this study was to develop an SER system that can classify and recognise six basic human emotions (i.e., sadness, fear, anger, disgust, happiness, and neutral) from speech spoken in Sepedi language (one of South Africa’s official languages). One of the major challenges encountered, in this study, was the lack of a proper corpus of emotional speech. Therefore, three different Sepedi emotional speech corpora consisting of acted speech data have been developed. These include a RecordedSepedi corpus collected from recruited native speakers (9 participants), a TV broadcast corpus collected from professional Sepedi actors, and an Extended-Sepedi corpus which is a combination of Recorded-Sepedi and TV broadcast emotional speech corpora. Features were extracted from the speech corpora and a data file was constructed. This file was used to train four machine learning (ML) algorithms (i.e., SVM, KNN, MLP and Auto-WEKA) based on 10 folds validation method. Three experiments were then performed on the developed speech corpora and the performance of the algorithms was compared. The best results were achieved when Auto-WEKA was applied in all the experiments. We may have expected good results for the TV broadcast speech corpus since it was collected from professional actors, however, the results showed differently. From the findings of this study, one can conclude that there are no precise or exact techniques for the development of SER systems, it is a matter of experimenting and finding the best technique for the study at hand. The study has also highlighted the scarcity of SER resources for South African indigenous languages. The quality of the dataset plays a vital role in the performance of SER systems. / National research foundation (NRF) and Telkom Center of Excellence (CoE) Speech emotion recognition Machine learning Feature extraction Classification Emotional speech database Automatic speech recognition Machine learning
364	Semi-Supervised Learning with Sparse Autoencoders in Automatic Speech Recognition / Semi-övervakad inlärning med glesa autoencoders i automatisk taligenkänning DHAKA, AKASH KUMAR January 2016 (has links) This work is aimed at exploring semi-supervised learning techniques to improve the performance of Automatic Speech Recognition systems. Semi-supervised learning takes advantage of unlabeled data in order to improve the quality of the representations extracted from the data.The proposed model is a neural network where the weights are updated by minimizing the weighted sum of a supervised and an unsupervised cost function, simultaneously. These costs are evaluated on the labeled and unlabeled portions of the data set, respectively. The combined cost is optimized through mini-batch stochastic gradient descent via standard backpropagation.The model was tested on a phone classification task on the TIMIT American English data set and on a written digit classification task on the MNIST data set. Our results show that the model outperforms a network trained with standard backpropagation on the labelled material alone. The results are also in line with state-of-the-art graph-based semi-supervised training methods. / Detta arbete syftar till att utforska halvövervakade inlärningstekniker (semi-supervised learning techniques) för att förbättra prestandan hos automatiska taligenkänningssystem.Halvövervakad maskininlärning använder sig av data ej märkt med klasstillhörighetsinformation för att förbättra kvaliteten hos den från datan extraherade representationen.Modellen som beskrivs i arbetet är ett neuralt nätverk där vikterna uppdateras genom att samtidigt minimera den viktade summan av en övervakad och en oövervakad kostnadsfunktion.Dessa kostnadsfunktioner evalueras på den märkta respektive den omärkta datamängden.De kombinerade kostnadsfunktionerna optimeras genom gradient descent med hjälp av traditionell backpropagation.Modellen har evaluerats genom en fonklassificeringsuppgift på datamängden TIMIT American English, samt en sifferklassificeringsuppgift på datamängden MNIST.Resultaten visar att modellen presterar bättre än ett nätverk tränat med backpropagation på endast märkt data.Resultaten är även konkurrenskraftiga med rådande state of the art, grafbaserade halvövervakade inlärningsmetoder. machine learning automatic speech recognition semi supervised learning Computer Sciences Datavetenskap (datalogi)
365	Robust Dialog Management Through A Context-centric Architecture Hung, Victor C. 01 January 2010 (has links) This dissertation presents and evaluates a method of managing spoken dialog interactions with a robust attention to fulfilling the human user’s goals in the presence of speech recognition limitations. Assistive speech-based embodied conversation agents are computer-based entities that interact with humans to help accomplish a certain task or communicate information via spoken input and output. A challenging aspect of this task involves open dialog, where the user is free to converse in an unstructured manner. With this style of input, the machine’s ability to communicate may be hindered by poor reception of utterances, caused by a user’s inadequate command of a language and/or faults in the speech recognition facilities. Since a speech-based input is emphasized, this endeavor involves the fundamental issues associated with natural language processing, automatic speech recognition and dialog system design. Driven by ContextBased Reasoning, the presented dialog manager features a discourse model that implements mixed-initiative conversation with a focus on the user’s assistive needs. The discourse behavior must maintain a sense of generality, where the assistive nature of the system remains constant regardless of its knowledge corpus. The dialog manager was encapsulated into a speech-based embodied conversation agent platform for prototyping and testing purposes. A battery of user trials was performed on this agent to evaluate its performance as a robust, domain-independent, speech-based interaction entity capable of satisfying the needs of its users. Artificial intelligence Automatic speech recognition Interactive computer systems Knowledge management Electrical and Computer Engineering Electrical and Electronics Engineering
366	Phoneme-based Video Indexing Using Phonetic Disparity Search Barth, Carlos Leon 01 January 2010 (has links) This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search. Automatic speech recognition Phonemics Speech processing systems Video recordings Electrical and Computer Engineering Electrical and Electronics Engineering
367	Automatic Speech Separation for Brain-Controlled Hearing Technologies Han, Cong January 2024 (has links) Speech perception in crowded acoustic environments is particularly challenging for hearing impaired listeners. While assistive hearing devices can suppress background noises distinct from speech, they struggle to lower interfering speakers without knowing the speaker on which the listener is focusing. The human brain has a remarkable ability to pick out individual voices in a noisy environment like a crowded restaurant or a busy city street. This inspires the brain-controlled hearing technologies. A brain-controlled hearing aid acts as an intelligent filter, reading wearers’ brainwaves and enhancing the voice they want to focus on. Two essential elements form the core of brain-controlled hearing aids: automatic speech separation (SS), which isolates individual speakers from mixed audio in an acoustic scene, and auditory attention decoding (AAD) in which the brainwaves of listeners are compared with separated speakers to determine the attended one, which can then be amplified to facilitate hearing. This dissertation focuses on speech separation and its integration with AAD, aiming to propel the evolution of brain-controlled hearing technologies. The goal is to help users to engage in conversations with people around them seamlessly and efficiently. This dissertation is structured into two parts. The first part focuses on automatic speech separation models, beginning with the introduction of a real-time monaural speech separation model, followed by more advanced real-time binaural speech separation models. The binaural models use both spectral and spatial features to separate speakers and are more robust to noise and reverberation. Beyond performing speech separation, the binaural models preserve the interaural cues of separated sound sources, which is a significant step towards immersive augmented hearing. Additionally, the first part explores using speaker identifications to improve the performance and robustness of models in long-form speech separation. This part also delves into unsupervised learning methods for multi-channel speech separation, aiming to improve the models' ability to generalize to real-world audio. The second part of the dissertation integrates speech separation introduced in the first part with auditory attention decoding (SS-AAD) to develop brain-controlled augmented hearing systems. It is demonstrated that auditory attention decoding with automatically separated speakers is as accurate and fast as using clean speech sounds. Furthermore, to better align the experimental environment of SS-AAD systems with real-life scenarios, the second part introduces a new AAD task that closely simulates real-world complex acoustic settings. The results show that the SS-AAD system is capable of improving speech intelligibility and facilitating tracking of the attended speaker in realistic acoustic environments. Finally, this part presents employing self-supervised learned speech representation in the SS-AAD systems to enhance the neural decoding of attentional selection. Electrical engineering Automatic speech recognition Hearing aids--Design and construction Speech perception Auditory selective attention
368	Preserving subsegmental variation in modeling word segmentation (or, the raising of baby Mondegreen) Rytting, Christopher Anton 05 January 2007 (has links) No description available. word segmentation first language acquisition automatic speech recognition connectionist models Modern Greek
369	Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition Heintz, Ilana 01 September 2010 (has links) No description available. Computer Science Linguistics Technology Arabic Language Modeling Automatic Speech Recognition Morphology
370	Improving Automatic Transcription Using Natural Language Processing Kiefer, Anna 01 March 2024 (has links) (PDF) Digital Democracy is a CalMatters and California Polytechnic State University initia-tive to promote transparency in state government by increasing access to the Califor-nia legislature. While Digital Democracy is made up of many resources, one founda-tional step of the project is obtaining accurate, timely transcripts of California Senateand Assembly hearings. The information extracted from these transcripts providescrucial data for subsequent steps in the pipeline. In the context of Digital Democracy,upleveling is when humans verify, correct, and annotate the transcript results afterthe legislative hearings have been automatically transcribed. The upleveling processis done with the assistance of a software application called the Transcription Tool.The human upleveling process is the most costly and time-consuming step of the Dig-ital Democracy pipeline. In this thesis, we hypothesize that we can make significantreductions to the time needed for upleveling by using Natural Language Processing(NLP) systems and techniques. The main contribution of this thesis is engineeringa new automatic transcription pipeline. Specifically, this thesis integrates a new au-tomatic speech recognition service, a new speaker diarization model, additional textpost-processing changes, and a new process for speaker identification. To evaluate the system’s improvements, we measure the accuracy and speed of the newly integrated features and record editor upleveling time both before and after the additions. Natural Language Processing Named Entity Recognition Transcription Automatic Speech Recognition Artificial Intelligence and Robotics Software Engineering

Search results