Development of a cloud platform for automatic speech recognition / Development of a cloud platform for automatic speech recognition

Klejch, Ondřej January 2015 (has links)
This thesis presents a cloud platform for automatic speech recognition, CloudASR, built on top of Kaldi speech recognition toolkit. The platform sup- ports both batch and online speech recognition mode and it has an annotation interface for transcription of the submitted recordings. The key features of the platform are scalability, customizability and easy deployment. Benchmarks of the platform show that the platform achieves comparable performance with Google Speech API in terms of latency and it can achieve better accuracy on limited domains. Furthermore, the benchmarks show that the platform is able to handle more than 1000 parallel requests given enough computational resources. 1

Achieving Automatic Speech Recognition for Swedish using the Kaldi toolkit / Automatisk taligenkänning på svenska med verktyget Kaldi

Mossberg, Zimon January 2016 (has links)
The meager offering of online commercial Swedish Automatic Speech Recognition ser-vices prompts the effort to develop a speech recognizer for Swedish using the open sourcetoolkit Kaldi and publicly available NST speech corpus. Using a previous Kaldi recipeseveral GMM-HMM models are trained and evaluated against commercial options toallow for reasoning of the performance of a customized solution for Automatic SpeechRecognition to that of commercial services. The evaluation takes both accuracy andcomputational speed into consideration. Initial results of the evaluation indicate a sys-tematic bias in the selected test set confirmed by a follow up investigative evaluation.The conclusion is that building a speech recognizer for Swedish using the NST corpusand Kaldi without expert knowledge is feasible but requires further work. / En taligenkännare för svenska utvecklas med målet att utvärdera hur en taligenkännareutvecklad med fritt tillgängliga verktyg står sig mot kommersiella taligenkänningstjänster.Verktyget som används är det öppna källkodsverktyget Kaldi och som träningsdataanvänds det offentligt tillgängliga talkorpuset för svenska från NST. De framtagna mod-ellerna jämförs mot kommersielt tillgängliga tjänster för taligenkänning på svenska.Tidiga resultat i jämförelsen indikerar ett systemiskt jäv i den valda testdata, vilketbekräftas av en uppföljande undersökande utvärdering. Slutsatsen av arbetet är attutsikterna att ta fram en taligenkännare för svenska är goda men kräver omfattandearbete.

Uticaj morfoloških obeležja na modelovanje jezika primenom neuronskih mreža u sistemima za prepoznavanje govora / Influence of Morphological Features on Language Modeling With Neural Networks in Speech Recognition Systems

Pakoci Edvin 30 December 2019 (has links)
<p>Automatsko prepoznavanje govora je tehnologija koja računarima<br />omogućava pretvaranje izgovorenih reči u tekst. Ona se može<br />primeniti u mnogim savremenim sistemima koji uključuju komunikaciju<br />između čoveka i mašine. U ovoj disertaciji detaljno je opisana jedna<br />od dve glavne komponente sistema za prepoznavanje govora, a to je<br />jezički model, koji specificira rečnik sistema, kao i pravila prema<br />kojim se pojedinačne reči mogu povezati u rečenicu. Srpski jezik spada<br />u grupu visoko inflektivnih i morfološki bogatih jezika, što znači<br />da koristi veći broj različitih završetaka reči za izražavanje<br />željene gramatičke, sintaksičke ili semantičke funkcije date reči.<br />Ovakvo ponašanje često dovodi do velikog broja grešaka sistema za<br />prepoznavanje govora kod kojih zbog dobrog akustičkog poklapanja<br />prepoznavač pogodi osnovni oblik reči, ali pogreši njen završetak.<br />Taj završetak može da označava drugu morfološku kategoriju, na<br />primer, padež, rod ili broj. U radu je predstavljen novi alat za<br />modelovanje jezika, koji uz identitet reči u modelu može da koristi<br />dodatna leksička i morfološka obeležja reči, čime je testirana<br />hipoteza da te dodatne informacije mogu pomoći u prevazilaženju<br />značajnog broja grešaka prepoznavača koje su posledica<br />inflektivnosti srpskog jezika.</p> / <p>Automatic speech recognition is a technology that allows computers to<br />convert spoken words into text. It can be applied in various areas which<br />involve communication between humans and machines. This thesis primarily<br />deals with one of two main components of speech recognition systems - the<br />language model, that specifies the vocabulary of the system, as well as the<br />rules by which individual words can be linked into sentences. The Serbian<br />language belongs to a group of highly inflective and morphologically rich<br />languages, which means that it uses a number of different word endings to<br />express the desired grammatical, syntactic, or semantic function of the given<br />word. Such behavior often leads to a significant number of errors in speech<br />recognition systems where due to good acoustic matching the recognizer<br />correctly guesses the basic form of the word, but an error occurs in the word<br />ending. This word ending may indicate a different morphological category, for<br />example, word case, grammatical gender, or grammatical number. The<br />thesis presents a new language modeling tool which, along with the word<br />identity, can also model additional lexical and morphological features of the<br />word, thus testing the hypothesis that this additional information can help<br />overcome a significant number of recognition errors that result from the high<br />inflectivity of the Serbian language.</p>

Speech to Text for Swedish using KALDI / Tal till text, utvecklandet av en svensk taligenkänningsmodell i KALDI

Kullmann, Emelie January 2016 (has links)
The field of speech recognition has during the last decade left the re- search stage and found its way in to the public market. Most computers and mobile phones sold today support dictation and transcription in a number of chosen languages.  Swedish is often not one of them. In this thesis, which is executed on behalf of the Swedish Radio, an Automatic Speech Recognition model for Swedish is trained and the performance evaluated. The model is built using the open source toolkit Kaldi.  Two approaches of training the acoustic part of the model is investigated. Firstly, using Hidden Markov Model and Gaussian Mixture Models and secondly, using Hidden Markov Models and Deep Neural Networks. The later approach using deep neural networks is found to achieve a better performance in terms of Word Error Rate. / De senaste åren har olika tillämpningar inom människa-dator interaktion och främst taligenkänning hittat sig ut på den allmänna marknaden. Många system och tekniska produkter stöder idag tjänsterna att transkribera tal och diktera text. Detta gäller dock främst de större språken och sällan finns samma stöd för mindre språk som exempelvis svenskan. I detta examensprojekt har en modell för taligenkänning på svenska ut- vecklas. Det är genomfört på uppdrag av Sveriges Radio som skulle ha stor nytta av en fungerande taligenkänningsmodell på svenska. Modellen är utvecklad i ramverket Kaldi. Två tillvägagångssätt för den akustiska träningen av modellen är implementerade och prestandan för dessa två är evaluerade och jämförda. Först tränas en modell med användningen av Hidden Markov Models och Gaussian Mixture Models och slutligen en modell där Hidden Markov Models och Deep Neural Networks an- vänds, det visar sig att den senare uppnår ett bättre resultat i form av måttet Word Error Rate.

Rozpoznávání řeči pomocí KALDI / Rozpoznávání řeči pomocí KALDI

Plátek, Ondřej January 2014 (has links)
The topic of this thesis is to implement efficient decoder for speech recognition training system ASR Kaldi (http://kaldi.sourceforge.net/). Kaldi is already deployed with decoders, but they are not convenient for dialogue systems. The main goal of this thesis to develop a real time decoder for a dialogue system, which minimize latency and optimize speed. Methods used for speeding up the decoder are not limited to multi-threading decoding or usage of GPU cards for general computations. Part of this work is devoted to training an acoustic model and also testing it in the "Vystadial" dialogue system. Powered by TCPDF (www.tcpdf.org)

Automatic Speech Recognition Model for Swedish using Kaldi

Wang, Yihan January 2020 (has links)
With the development of intelligent era, speech recognition has been a hottopic. Although many automatic speech recognition(ASR) tools have beenput into the market, a considerable number of them do not support Swedishbecause of its small number. In this project, a Swedish ASR model basedon Hidden Markov Model and Gaussian Mixture Models is established usingKaldi which aims to help ICA Banken complete the classification of aftersalesvoice calls. A variety of model patterns have been explored, whichhave different phoneme combination methods and eigenvalue extraction andprocessing methods. Word Error Rate and Real Time Factor are selectedas evaluation criteria to compare the recognition accuracy and speed ofthe models. As far as large vocabulary continuous speech recognition isconcerned, triphone is much better than monophone. Adding feature transformationwill further improve the speed of accuracy. The combination oflinear discriminant analysis, maximum likelihood linear transformand speakeradaptive training obtains the best performance in this implementation. Fordifferent feature extraction methods, mel-frequency cepstral coefficient ismore conducive to obtain higher accuracy, while perceptual linear predictivetends to improve the overall speed. / Det existerar flera lösningar för automatisk transkribering på marknaden, menen stor del av dem stödjer inte svenska på grund utav det relativt få antalettalare. I det här projektet så skapades automatisk transkribering för svenskamed Hidden Markov models och Gaussian mixture models genom att användaKaldi. Detta för att kunna möjliggöra för ICABanken att klassificera samtal tillsin kundtjänst. En mängd av modellvariationer med olika fonemkombinationsmetoder,egenvärdesberäkning och databearbetningsmetoder har utforskats.Word error rate och real time factor är valda som utvärderingskriterier föratt jämföra precisionen och hastigheten mellan modellerna. När det kommertill kontinuerlig transkribering för ett stort ordförråd så resulterar triphonei mycket bättre prestanda än monophone. Med hjälp utav transformationerså förbättras både precisionen och hastigheten. Kombinationen av lineardiscriminatn analysis, maximum likelihood linear transformering och speakeradaptive träning resulterar i den bästa prestandan i denna implementation.För olika egenskapsextraktioner så bidrar mel-frequency cepstral koefficiententill en bättre precision medan perceptual linear predictive tenderar att ökahastigheten.

Nízko-dimenzionální faktorizace pro "End-To-End" řečové systémy / Low-Dimensional Matrix Factorization in End-To-End Speech Recognition Systems

Gajdár, Matúš January 2020 (has links)
The project covers automatic speech recognition with neural network training using low-dimensional matrix factorization. We are describing time delay neural networks with factorization (TDNN-F) and without it (TDNN) in Pytorch language. We are comparing the implementation between Pytorch and Kaldi toolkit, where we achieve similar results during experiments with various network architectures. The last chapter describes the impact of a low-dimensional matrix factorization on End-to-End speech recognition systems and also a modification of the system with TDNN(-F) networks. Using specific network settings, we were able to achieve better results with systems using factorization. Additionally, we reduced the complexity of training by decreasing network parameters with the use of TDNN(-F) networks.

Forced alignment pomocí neuronových sítí / Forced Alignment via Neural Networks

Beňovič, Marek January 2020 (has links)
Watching videos with subtitles in the original language is one of the most effective ways of learning a foreign language. Highlighting words at the moment they are pronounced helps to synchronize visual and auditory perception and increases learning efficiency. The method for aligning orthographic transcriptions to audio recordings is known as forced alignment. This work implements a tool for aligning transcript of YouTube videos with the speech in their audio recording, providing a web user interface with video player presenting the results. It integrates two state-of-the-art forced aligners based on Kaldi, first using standard HMM approach, second based on neural networks and compares their accuracy. Integrated aligners also provide a phone level alignment, which can be used for training statistical models in further speech recognition research. Work describes implementation and architectural concepts the tool is based on, which can be used in various software projects. 1

Zobrazení a analýza aktivit neuronové sítě ve skrytých vrstvách / Activity of Neural Network in Hidden Layers - Visualisation and Analysis

Fábry, Marko January 2016 (has links)
Goal of this work was to create system capable of visualisation of activation function values, which were produced by neurons placed in hidden layers of neural networks used for speech recognition. In this work are also described experiments comparing methods for visualisation, visualisations of neural networks with different architectures and neural networks trained with different types of input data. Visualisation system implemented in this work is based on previous work of Mr. Khe Chai Sim and extended with new methods of data normalization. Kaldi toolkit was used for neural network training data preparation. CNTK framework was used for neural network training. Core of this work - the visualisation system was implemented in scripting language Python.

Automatic Speech Recognition in Somali

Gabriel, Naveen January 2020 (has links)
The field of speech recognition during the last decade has left the research stage and found its way into the public market, and today, speech recognition software is ubiquitous around us. An automatic speech recognizer understands human speech and represents it as text. Most of the current speech recognition software employs variants of deep neural networks. Before the deep learning era, the hybrid of hidden Markov model and Gaussian mixture model (HMM-GMM) was a popular statistical model to solve speech recognition. In this thesis, automatic speech recognition using HMM-GMM was trained on Somali data which consisted of voice recording and its transcription. HMM-GMM is a hybrid system in which the framework is composed of an acoustic model and a language model. The acoustic model represents the time-variant aspect of the speech signal, and the language model determines how probable is the observed sequence of words. This thesis begins with background about speech recognition. Literature survey covers some of the work that has been done in this field. This thesis evaluates how different language models and discounting methods affect the performance of speech recognition systems. Also, log scores were calculated for the top 5 predicted sentences and confidence measures of pre-dicted sentences. The model was trained on 4.5 hrs of voiced data and its corresponding transcription. It was evaluated on 3 mins of testing data. The performance of the trained model on the test set was good, given that the data was devoid of any background noise and lack of variability. The performance of the model is measured using word error rate(WER) and sentence error rate (SER). The performance of the implemented model is also compared with the results of other research work. This thesis also discusses why log and confidence score of the sentence might not be a good way to measure the performance of the resulting model. It also discusses the shortcoming of the HMM-GMM model, how the existing model can be improved, and different alternatives to solve the problem.

