Global ETD Search

691	Nízko-dimenzionální faktorizace pro "End-To-End" řečové systémy / Low-Dimensional Matrix Factorization in End-To-End Speech Recognition Systems Gajdár, Matúš January 2020 (has links) The project covers automatic speech recognition with neural network training using low-dimensional matrix factorization. We are describing time delay neural networks with factorization (TDNN-F) and without it (TDNN) in Pytorch language. We are comparing the implementation between Pytorch and Kaldi toolkit, where we achieve similar results during experiments with various network architectures. The last chapter describes the impact of a low-dimensional matrix factorization on End-to-End speech recognition systems and also a modification of the system with TDNN(-F) networks. Using specific network settings, we were able to achieve better results with systems using factorization. Additionally, we reduced the complexity of training by decreasing network parameters with the use of TDNN(-F) networks.
692	Vad Innebär Det Att Skriva I Skolan? : Diktera – en digital möjlighet i en lärmiljö för alla Toresson, Anna-Karin January 2021 (has links) This is a study of quantitative and qualitative methods that aims to gain increased knowledge about primary school students and what it means to write. The study examines if dictation provides a digital opportunity in a learning environment for everyone. The study is a case study. The study has a mixed-methods design with an explanatory Sequential Design. The study is based on empirical methods that consists of two quantitative and two qualitative methods. The quantitative methods are measurement of LIX value of student texts and the students' grades. The qualitative methods are a questionnaire to seven students in eighth grade and a semi-structured interview with a teacher. The study's theoretical framework rests on a socio-cultural perspective, with Vygotsky's theories about language and communication and Säljö´s thoughts about artefacts and dictation as a writing tool. The study uses a hermeneutic perspective to describe the qualitative parts of the study. This perspective is used to describe an interaction between theory and method analysis that provides an opportunity for a deeper understanding. The results of the study show that students think that dictation is a functional writing tool. The results from the questionnaire show that students think it is important to plan their writing before dictation. Furthermore, students discover that they must adapt their voice to the dictation program. By learning the software, the students´ develop their writing ability. Finally, students note that the processing is different and requires different strategies for correcting than traditional writing does. Perhaps the biggest obstacle in itself is that the transcriber needs to have access to a quiet place. The knowledge contribution that is added to the problem area and previous research is a deeper understanding of the factors that affect students' writing through dictation. The study is important and relevant to the teaching profession and contributes to the fact that dictation can be a way of writing for students. The experiences from this study can be a support for teachers in developing their schools´ learning environment. Coupled with teachers' broad repertoire in writing and writing development, this will give more students the opportunity to reach approved knowledge requirements in Swedish compulsory school as Nilholm assert. / <p>Digital presentation</p> dictation school speech recognition text recognition diktering röstigenkänning tal-till- text Didactics Didaktik Learning Lärande Pedagogical Work Pedagogiskt arbete
693	Rozpoznávání řeči s pomocí nástroje Sphinx-4 / Speech recognition using Sphinx-4 Kryške, Lukáš January 2014 (has links) This diploma thesis is aimed to find an effective method for continuous speech recognition. To be more accurate, it uses speech-to-text recognition for a keyword spotting discipline. This solution is able to be applicable for phone calls analysis or for a similar application. Most of the diploma thesis describes and implements speech recognition framework Sphinx-4 which uses Hidden Markov models (HMM) to define a language acoustic models. It is explained how these models can be trained for a new language or for a new language dialect. Finally there is in detail described how to implement the keyword spotting in the Java language.
694	Optimalizace rozpoznávání řeči pro mobilní zařízení / Optimization of Voice Recognition for Mobile Devices Tomec, Martin January 2010 (has links) This work deals with optimization of keyword spotting algorithms on processor architecture ARM Cortex-A8. At first it describes this architecture and especially the NEON unit for vector computing. In addition it briefly describes keyword spotting algorithms and also there is proposed optimization of these algorithms for described architecture. Main part of this work is implementation of these optimizations and analysis of their impact on performance.
695	Learning representations of speech from the raw waveform / Apprentissage de représentations de la parole à partir du signal brut Zeghidour, Neil 13 March 2019 (has links) Bien que les réseaux de neurones soient à présent utilisés dans la quasi-totalité des composants d’un système de reconnaissance de la parole, du modèle acoustique au modèle de langue, l’entrée de ces systèmes reste une représentation analytique et fixée de la parole dans le domaine temps-fréquence, telle que les mel-filterbanks. Cela se distingue de la vision par ordinateur, un domaine où les réseaux de neurones prennent en entrée les pixels bruts. Les mel-filterbanks sont le produit d’une connaissance précieuse et documentée du système auditif humain, ainsi que du traitement du signal, et sont utilisées dans les systèmes de reconnaissance de la parole les plus en pointe, systèmes qui rivalisent désormais avec les humains dans certaines conditions. Cependant, les mel-filterbanks, comme toute représentation fixée, sont fondamentalement limitées par le fait qu’elles ne soient pas affinées par apprentissage pour la tâche considérée. Nous formulons l’hypothèse qu’apprendre ces représentations de bas niveau de la parole, conjontement avec le modèle, permettrait de faire avancer davantage l’état de l’art. Nous explorons tout d’abord des approches d’apprentissage faiblement supervisé et montrons que nous pouvons entraîner un unique réseau de neurones à séparer l’information phonétique de celle du locuteur à partir de descripteurs spectraux ou du signal brut et que ces représentations se transfèrent à travers les langues. De plus, apprendre à partir du signal brut produit des représentations du locuteur significativement meilleures que celles d’un modèle entraîné sur des mel-filterbanks. Ces résultats encourageants nous mènent par la suite à développer une alternative aux mel-filterbanks qui peut être entraînée à partir des données. Dans la seconde partie de cette thèse, nous proposons les Time-Domain filterbanks, une architecture neuronale légère prenant en entrée la forme d’onde, dont on peut initialiser les poids pour répliquer les mel-filterbanks et qui peut, par la suite, être entraînée par rétro-propagation avec le reste du réseau de neurones. Au cours d’expériences systématiques et approfondies, nous montrons que les Time-Domain filterbanks surclassent systématiquement les melfilterbanks, et peuvent être intégrées dans le premier système de reconnaissance de la parole purement convolutif et entraîné à partir du signal brut, qui constitue actuellement un nouvel état de l’art. Les descripteurs fixes étant également utilisés pour des tâches de classification non-linguistique, pour lesquelles elles sont d’autant moins optimales, nous entraînons un système de détection de dysarthrie à partir du signal brut, qui surclasse significativement un système équivalent entraîné sur des mel-filterbanks ou sur des descripteurs de bas niveau. Enfin, nous concluons cette thèse en expliquant en quoi nos contributions s’inscrivent dans une transition plus large vers des systèmes de compréhension du son qui pourront être appris de bout en bout. / While deep neural networks are now used in almost every component of a speech recognition system, from acoustic to language modeling, the input to such systems are still fixed, handcrafted, spectral features such as mel-filterbanks. This contrasts with computer vision, in which a deep neural network is now trained on raw pixels. Mel-filterbanks contain valuable and documented prior knowledge from human auditory perception as well as signal processing, and are the input to state-of-the-art speech recognition systems that are now on par with human performance in certain conditions. However, mel-filterbanks, as any fixed representation, are inherently limited by the fact that they are not fine-tuned for the task at hand. We hypothesize that learning the low-level representation of speech with the rest of the model, rather than using fixed features, could push the state-of-the art even further. We first explore a weakly-supervised setting and show that a single neural network can learn to separate phonetic information and speaker identity from mel-filterbanks or the raw waveform, and that these representations are robust across languages. Moreover, learning from the raw waveform provides significantly better speaker embeddings than learning from mel-filterbanks. These encouraging results lead us to develop a learnable alternative to mel-filterbanks, that can be directly used in replacement of these features. In the second part of this thesis we introduce Time-Domain filterbanks, a lightweight neural network that takes the waveform as input, can be initialized as an approximation of mel-filterbanks, and then learned with the rest of the neural architecture. Across extensive and systematic experiments, we show that Time-Domain filterbanks consistently outperform melfilterbanks and can be integrated into a new state-of-the-art speech recognition system, trained directly from the raw audio signal. Fixed speech features being also used for non-linguistic classification tasks for which they are even less optimal, we perform dysarthria detection from the waveform with Time-Domain filterbanks and show that it significantly improves over mel-filterbanks or low-level descriptors. Finally, we discuss how our contributions fall within a broader shift towards fully learnable audio understanding systems. Reconnaissance de la parole Signal audio Apprentissage profond Réseau de neurones Speech recognition Audio signal processing Deep learning Neural network 006.35
696	Forced alignment pomocí neuronových sítí / Forced Alignment via Neural Networks Beňovič, Marek January 2020 (has links) Watching videos with subtitles in the original language is one of the most effective ways of learning a foreign language. Highlighting words at the moment they are pronounced helps to synchronize visual and auditory perception and increases learning efficiency. The method for aligning orthographic transcriptions to audio recordings is known as forced alignment. This work implements a tool for aligning transcript of YouTube videos with the speech in their audio recording, providing a web user interface with video player presenting the results. It integrates two state-of-the-art forced aligners based on Kaldi, first using standard HMM approach, second based on neural networks and compares their accuracy. Integrated aligners also provide a phone level alignment, which can be used for training statistical models in further speech recognition research. Work describes implementation and architectural concepts the tool is based on, which can be used in various software projects. 1
697	A Swedish wav2vec versus Google speech-to-text Lagerlöf, Ester January 2022 (has links) As the automatic speech recognition technology is becoming more advanced, the possibilities of in which fields it can operate are growing. The best automatic speech recognition technologies today are mainly based on - and made for - the English language. However, the national library of Sweden recently released open-source wav2vec models purposefully with the Swedish language in mind. With the interest of investigating their performance, one of their models is chosen to assess how well they transcribe the Swedish news broadcasts ”kvart-i-fem”-ekot, comparing its results with Google speech-to-text. The results present wav2vec as the prominent model for this type of audio data, securing a word error rate average that is 9 percentage points less than Google-speech-to-text. A part of this performance could be attributed to the self-supervising method the wav2vec model uses to access large amounts of unlabeled data in its training. In spite of this, both models displayed difficulty with transcribing audio that has poor quality such as disturbing background noise and stationary sounds. Words like abbreviations and names was also difficult for them both to correctly transcribe. Google speech-to-text did however perform better than the wav2vec model on this part. ASR automatic speech recognition speech-to-text wav2vec Google speech-to-text model comparison Probability Theory and Statistics Sannolikhetsteori och statistik
698	AI’s implications for International Entrepreneurship in the digital and pandemic world : From external and internal perspectives Lampic Aaltonen, Ibb, Fust, Fiona January 2022 (has links) In the fourth industrial revolution, technological advancement and digital transformation are inevitable, which impact individuals, organizations, and governments tremendously and extensively. The current ongoing pandemic covid -19 has been a catalyst that accelerates the pace and scale of embracing digitalization, which leads to a dramatic shift in the business environment. Artificial Intelligence (AI) attracts increasing attention based on the business opportunities and values that can be created from both internal and external aspects alike. Grounding on the digital context, AI as an enabler from the external perspective; AI as a core resource from the internal perspective, the research attempts to identify 1) AI's implication on international entrepreneurs' possibilities to explore business opportunities;2) AI's significance for international entrepreneurs to enhance performance and generate value alike in the international market. The research conducts qualitative research based on six case studies to examine and explore the aforementioned research area. The research supports the theoretical framework that AI as an enabler provides international entrepreneurs with conducive conditions to testify and experiment with new business initiatives, which positively impacts spurring innovation and opening a new wide window of business opportunity across borders. In parallel, the research is consistent with the theories that AI is one of the valuable resources from the resource-based view, making contributions to SMEs’ enhanced performance, which paves the way for international entrepreneurs to stay in the business competition. In addition, the study proposes a combination of entrepreneurs' heuristic approaches in making strategic decisions with the assistance of AI in uncertain circumstances is crucial in conducting business in the digital environment. The research highlights the integration of innovation resources from external and internal aspects alike to stimulate and catalyse the growth of international entrepreneurship in the digital industry in the established markets. The research accentuates pandemic Covid-19 causes the changes in the digital environment, which affects international entrepreneurial activities. The article concludes with the above-mentioned circumstances' implications on international entrepreneurship, proposing a theoretical framework and providing an agenda for future research in the area. Economics and Business Ekonomi och näringsliv
699	Development of a text-independent automatic speaker recognition system Mokgonyane, Tumisho Billson January 2021 (has links) Thesis (M. Sc. (Computer Science)) -- University of Limpopo, 2021 / The task of automatic speaker recognition, wherein a system verifies or identifies speakers from a recording of their voices, has been researched for several decades. However, research in this area has been carried out largely on freely accessible speaker datasets built on languages that are well-resourced like English. This study undertakes automatic speaker recognition research focused on a low-resourced language, Sepedi. As one of the 11 official languages in South Africa, Sepedi is spoken by at least 2.8 million people. Pre-recorded voices were acquired from a speech and language national repository, namely, the National Centre for Human Language Technology (NCHLT), were we selected the Sepedi NCHLT Speech Corpus. The open-source pyAudioAnalysis python library was used to extract three types of acoustic features of speech namely, time, frequency and cepstral domain features, from the acquired speech data. The effects and compatibility of these acoustic features was investigated. It was observed that combining the three acoustic features of speech had a more significant effect than using individual features as far as speaker recognition accuracy is concerned. The study also investigated the performance of machine learning algorithms on low-resourced languages such as Sepedi. Five machine learning (ML) algorithms implemented on Scikit-learn namely, K-nearest neighbours (KNN), support vector machines (SVM), random forest (RF), logistic regression (LR), and multi-layer perceptrons (MLP) were used to train different classifier models. The GridSearchCV algorithm, also implemented on Scikit-learn, was used to deduce ideal hyper-parameters for each of the five ML algorithms. The classifier models were evaluated on recognition accuracy and the results show that the MLP classifier, with a recognition accuracy of 98%, outperforms KNN, RF, LR and SVM classifiers. A graphical user interface (GUI) is developed and the best performing classifier model, MLP, is deployed on the developed GUI intended to be used for real time speaker identification and verification tasks. Participants were recruited to the GUI performance and acceptable results were obtained Automatic speaker recognition Recording of voices Graphical user interface Automatic speech recognition Speech processing systems Icons (Computer graphics)
700	Probabilistic Modelling of Hearing : Speech Recognition and Optimal Audiometry Stadler, Svante January 2009 (has links) Hearing loss afflicts as many as 10\% of our population.Fortunately, technologies designed to alleviate the effects ofhearing loss are improving rapidly, including cochlear implantsand the increasing computing power of digital hearing aids. Thisthesis focuses on theoretically sound methods for improvinghearing aid technology. The main contributions are documented inthree research articles, which treat two separate topics:modelling of human speech recognition (Papers A and B) andoptimization of diagnostic methods for hearing loss (Paper C).Papers A and B present a hidden Markov model-based framework forsimulating speech recognition in noisy conditions using auditorymodels and signal detection theory. In Paper A, a model of normaland impaired hearing is employed, in which a subject's pure-tonehearing thresholds are used to adapt the model to the individual.In Paper B, the framework is modified to simulate hearing with acochlear implant (CI). Two models of hearing with CI arepresented: a simple, functional model and a biologically inspiredmodel. The models are adapted to the individual CI user bysimulating a spectral discrimination test. The framework canestimate speech recognition ability for a given hearing impairmentor cochlear implant user. This estimate could potentially be usedto optimize hearing aid settings.Paper C presents a novel method for sequentially choosing thesound level and frequency for pure-tone audiometry. A Gaussianmixture model (GMM) is used to represent the probabilitydistribution of hearing thresholds at 8 frequencies. The GMM isfitted to over 100,000 hearing thresholds from a clinicaldatabase. After each response, the GMM is updated using Bayesianinference. The sound level and frequency are chosen so as tomaximize a predefined objective function, such as the entropy ofthe probability distribution. It is found through simulation thatan average of 48 tone presentations are needed to achieve the sameaccuracy as the standard method, which requires an average of 135presentations. auditory models probabilistic modelling speech modelling human speech recognition hearing aids cochlear implants psychoacoustics diagnostic methods optimal experiments audiometry

Search results