Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
721 |
[en] ASSESSMENT OF FINE-TUNING ON END-TO-END SPEECH RECOGNITION MODELS / [pt] AVALIAÇÃO DE AJUSTE FINO EM MODELOS DE PONTA A PONTA PARA RECONHECIMENTO DE FALAJONATAS DOS SANTOS GROSMAN 04 November 2022 (has links)
[pt] Utilizar representações fornecidas por um grande modelo pré-treinado
tornou-se a principal estratégia para alcançar o estado da arte nas mais variadas
tarefas. Um grande modelo pré-treinado recentemente proposto, wav2vec
2.0, foi seminal para vários outros trabalhos sobre pré-treinamento de grandes
modelos em dados de fala. Muitos modelos estão sendo pré-treinados usando a
mesma arquitetura baseada em transformer que o wav2vec 2.0 e estão obtendo
o estado da arte em várias tarefas relacionadas à fala. No entanto, poucos trabalhos
propuseram maiores análises sobre o comportamento desses modelos
em diferentes cenários de fine-tuning. Nosso trabalho visa analisar esse modelo
sobre dois aspectos diferentes. O primeiro é sobre a transferibilidade entre línguas
desses modelos. Nossos experimentos nos mostraram que o tamanho dos
dados usados durante o pré-treinamento desses modelos não é tão crucial para
a transferibilidade quanto a diversidade. Percebemos que o desempenho das
línguas indo-europeias é superior ao das línguas não indo-europeias nos modelos
avaliados. Vimos uma transferência positiva de conhecimento entre línguas
usando modelos monolinguais, o que foi percebido em todos os idiomas que usamos,
mas foi mais evidente quando o idioma usado durante o pré-treinamento
era mais semelhante ao idioma do fine-tuning. O segundo aspecto que investigamos
em nosso trabalho é quão bem esses modelos se comportam em cenários
de desbalanceamento de dados, onde há um subconjunto mais representativo
no conjunto de dados do fine-tuning. Nossos resultados mostraram que o desbalanceamento
dos dados no fine-tuning geralmente afeta o resultado final dos modelos, com melhor desempenho nos subconjuntos mais representativos. No entanto, uma maior variabilidade no conjunto de treinamento favorece o desempenhodo modelo para um subconjunto mais representativo. Porém essamaior variabilidade nos dados não favoreceu os idiomas não vistos durante o treinamento. Observamos também que os modelos parecem mais robustos em lidar com o desbalanceamento de gênero do que idade ou sotaque. Com esses achados, esperamos ajudar a comunidade científica na utilização de modelos pré-treinados existentes, bem como auxiliar no pré-treinamento de novosmodelos. / [en] Using representations given by a large pre-trained model has become
the primary strategy to reach the state-of-the-art in the most varied tasks. A
recently proposed large pre-trained model, wav2vec 2.0, was seminal for several
other works on pre-training large models on speech data. Many models are
being pre-trained using the same transformer-based architecture as wav2vec
2.0 and are getting state-of-the-art in various speech-related tasks. However,
few works have proposed further analysis of these models in different finetuning
scenarios. Our work investigates these models concerning two different
aspects. The first is about the cross-lingual transferability of these models. Our
experiments showed us that the size of data used during the pre-training of
these models is not as crucial to the transferability as the diversity. We noticed
that the performance of Indo-European languages is superior to non-Indo-
European languages in the evaluated models. We have seen a positive crosslingual
transfer of knowledge using monolingual models, which was noticed
in all the languages we used but was more evident when the language used
during the pre-training was more similar to the downstream task language. The
second aspect we investigated in our work is how well these models perform
in data imbalance scenarios, where there is a more representative subset in
the fine-tuning dataset. Our results showed that data imbalance in fine-tuning
generally affects the final result of the models, with better performance in
the most representative subsets. However, greater variability in the training
set favors model performance for a more representative subset. Nevertheless,
this greater variability in the data did not favor languages not seen during
training. We also observed that the models seem more robust in dealing with
gender imbalance than age or accent. With these findings, we hope to help the
scientific community in the use of existing pre-trained models, as well as assist
in the pre-training of new models.
|
722 |
Utveckling av applikation för röststyrning vid inventeringHall, Melvin January 2023 (has links)
The manufacturing industry has an important role in Sweden's economy and has been producing high quality goods that are exported all over the world for a long time. By using modern technologies such as advanced warehouse systems and digital tools, companies can increase productivity and reduce costs. An example of such modern technology is Automatic Speech Recognition (ASR). Most of the previous research conducted in the field of ASR has focused on analyzing and addressing various kind of problems related to the performance of an ASR-system. Furthermore, there are also a number of studies regarding how ASR has been used in the manufacturing industry, and more specifically, to facilitate order picking. In this work, the use of speech recognition was investigated as a possible method to facilitate and streamline the inventory process. To investigate this, a prototype for a web application has been developed. The application enables a user, through speech recognition, to speech the article number together with the available quantity in the warehouse. Subsequently, the user receives a confirmation both visually and through sound of which the application automatically registers it in the Monitor ERP software. Data has been collected by observing user tests and conducting interviews with indi-viduals who all have some connection to the warehouse at different manufacturing companies. The results indicated that the inventory process could become more ef-fective by using the application. However, some deficiencies were identified during the user tests, which means that the prototype needs further development and increased robustness to be used as a tool during inventory management. / Tillverkningsindustrin är viktig del av Sveriges ekonomi och har under en lång tid producerat högkvalitativa varor som exporteras över hela världen. Genom att använda moderna teknologier som avancerade lagersystem och digitala verktyg, kan företagen öka produktiviteten och minska kostnaderna. Ett exempel på en sådan modern teknologi är Automatic Speech Recognition (ASR). Inom området ASR har en betydande del av den tidigare forskningen ägnats åt att analysera och adressera olika problem relaterade till prestandan hos ett ASR-system. Vidare så finns det även ett antal arbeten kring hur ASR använts inom tillverknings-industrin, och mer specifikt, till att effektivisera orderplockningsprocessen inom in-dustrin. I det här arbetet undersöktes huruvida röststyrning kan användas som ett verktyg för att underlätta samt effektivisera inventeringsprocessen. Detta genomfördes tillsam-mans med företaget Monitor ERP, där en prototyp till en webbapplikation har utvecklats. Prototypen ska med hjälp av röststyrning möjliggöra för en användare att säga artikelnummer samt vilket antal som finns i lagret. Därefter ska användaren få en bekräftelse både visuellt och genom ljud, varav applikationen automatisk registrerar in det till Monitors system. Data har samlats in genom att observera användartester samt utföra intervjuer med personer som alla har någon koppling till lagret på olika tillverkningsföretag. Resultatet visade på att inventeringsprocessen kan bli effektivare genom att utföra inventering med hjälp av applikationen. Däremot upptäcktes en del brister under användartesterna vilket betyder att prototypen behöver utvecklas och bli mer robust för att kunna användas som verktyg under inventering.
|
723 |
Automatic Voice Trading Surveillance : Achieving Speech and Named Entity Recognition in Voice Trade Calls Using Language Model Interpolation and Named Entity AbstractionSundberg, Martin, Ohlsson, Mikael January 2023 (has links)
This master thesis explores the effectiveness of interpolating a larger generic speech recognition model with smaller domain-specific models to enable transcription of domain-specific conversations. The study uses a corpus within the financial domain collected from the web and processed by abstracting named entities such as financial instruments, numbers, as well as names of people and companies. By substituting each named entity with a tag representing the entity type in the domain-specific corpus, each named entity can be replaced during the hypothesis search by words added to the systems pronunciation dictionary. Thus making instruments and other domain-specific terms a matter of extension by configuration. A proof-of-concept automatic speech recognition system with the ability to transcribe and extract named entities within the constantly changing domain of voice trading was created. The system achieved a 25.08 Word Error Rate and 0.9091 F1-score using stochastic and neural net based language models. The best configuration proved to be a combination of both stochastic and neural net based domain-specific models interpolated with a generic model. This shows that even though the models were trained using the same corpus, different models learned different aspects of the material. The study was deemed successful by the authors as the Word Error Rate was improved by model interpolation and all but one named entities were found in the test recordings by all configurations. By adjusting the amount of influence the domain-specific models had against the generic model, the results improved the transcription accuracy at the cost of named entity recognition, and vice versa. Ultimately, the choice of configuration depends on the business case and the importance of named entity recognition versus accurate transcriptions.
|
724 |
Swedish Language End-to-End Automatic Speech Recognition for Media Monitoring using Deep LearningNyblom, Hector January 2022 (has links)
In order to extract relevant information from speech recordings, the general approach is to first convert the audio into transcribed text. The text can then be analysed using well researched methods. NewsMachine AB provides customers with an overview of how they are represented in media by analysing articles in text form. Their plans to scale up their monitoring of publicly available speech recordings was the basis for the thesis. In this thesis I compare three end-to-end Automatic Speech Recognition (ASR) models. I do so in order to find the model that currently works best for transcribing Swedish language radio recordings, considering accuracy and inference speed (computational complexity). The results show that the QuartzNet architecture is the fastest, but pre-trained wav2vec models provided by KBLab on Swedish speech have by far the best accuracy. The KBLab model was used for further fine-tuning on subsets with varying amount of training data from radio recordings. The results show that further fine-tuning the KBLab models on low-resource Swedish speech domains achieves impressive accuracy. With just 5 hours of training data, the result is 11.5% Word Error Rate and 3.8% Character Error Rate. A final model was fine-tuned on all 35 hours of the radio domain dataset, resulting in model achieving 10.4% Word Error Rate and 3.5% Character Error Rate. The thesis presents a complete pipeline able to convert any length of audio into a transcription. Segmentation of audio is performed as a pre-processing step, segmenting the audio based on silence. The silence represents when a sentence stops and a new begins. The audio segments are passed to the final fine-tuned ASR model, and are concatenated for the complete punctuated transcript. This implementation allowed for punctuation, and also timestamping, when sentences occur in the audio. The results show that the complete pipeline performs well on high quality audio recordings. But when introduced to noisy and disruptive audio, there is work needed to achieve optimal performance.
|
725 |
Parallel Viterbi Search For Continuous Speech Recognition On A Multi-Core ArchitectureParihar, Naveen 11 December 2009 (has links)
State-of-the-art speech-recognition systems can successfully perform simple tasks in real-time on most computers, when the tasks are performed in controlled and noiseree environments. However, current algorithms and processors are not yet powerful enough for real-time large-vocabulary conversational speech recognition in noisy, real-world environments. Parallel processing can improve the real-time performance of speech recognition systems and increase their applicability, and developing an effective approach to parallelization is especially important given the recent trend toward multi-core processor design. In this dissertation, we introduce methods for parallelizing a single-pass across-word n-gram lexical-tree based Viterbi recognizer, which is the most popular architecture for Viterbi-based large vocabulary continuous speech recognition. We parallelize two different open-source implementations of such a recognizer, one developed at Mississippi State University and the other developed at Rheinisch-Westfalische Technische Hochschule University in Germany. We describe three methods for parallelization. The first, called parallel fast likelihood computation, parallelizes likelihood computations by decomposing mixtures among CPU cores, so that each core computes the likelihood of the set of mixtures allocated to it. A second method, lexical-tree division, parallelizes the search management component of a speech recognizer by dividing the lexical tree among the cores. A third and alternative method for parallelizing the search-management component of a speech recognizer, called lexical-tree copies decomposition, dynamically distributes the active lexical-tree copies among the cores. All parallelization methods were tested on two and four cores of an Intel Core2 Quad processor and significantly improved real-time performance. Several challenges for parallelizing a lexical-tree based Viterbi speech recognizer are also identified and discussed.
|
726 |
A Neurophysiologically-Inspired Statistical Language ModelDehdari, Jonathan 02 October 2014 (has links)
No description available.
|
727 |
The Development of Auditory “Spectral Attention Bands” in ChildrenYoungdahl, Carla L. 15 October 2015 (has links)
No description available.
|
728 |
Domain Adaptation with N-gram Language Models for Swedish Automatic Speech Recognition : Using text data augmentation to create domain-specific n-gram models for a Swedish open-source wav2vec 2.0 model / Domänanpassning Med N-gram Språkmodeller för Svensk Taligenkänning : Datautökning av text för att skapa domänspecifika n-gram språkmodeller för en öppen svensk wav2vec 2.0 modellEnzell, Viktor January 2022 (has links)
Automatic Speech Recognition (ASR) enables a wide variety of practical applications. However, many applications have their own domain-specific words, creating a gap between training and test data when used in practice. Domain adaptation can be achieved through model fine-tuning, but it requires domain-specific speech data paired with transcripts, which is labor intensive to produce. Fortunately, the dependence on audio data can be mitigated to a certain extent by incorporating text-based language models during decoding. This thesis explores approaches for creating domain-specific 4-gram models for a Swedish open-source wav2vec 2.0 model. The three main approaches extend a social media corpus with domain-specific data to estimate the models. The first approach utilizes a relatively small set of in-domain text data, and the second approach utilizes machine transcripts from another ASR system. Finally, the third approach utilizes Named Entity Recognition (NER) to find words of the same entity type in a corpus to replace with in-domain words. The 4-gram models are evaluated by the error rate (ERR) of recognizing in-domain words in a custom dataset. Additionally, the models are evaluated by the Word Error Rate (WER) on the Common Voice test set to ensure good overall performance. Compared to not having a language model, the base model improves the WER on Common Voice by 2.55 percentage points and the in-domain ERR by 6.11 percentage points. Next, adding in-domain text to the base model results in a 2.61 WER improvement and a 10.38 ERR improvement over not having a language model. Finally, adding in-domain machine transcripts and using the NER approach results in the same 10.38 ERR improvement as adding in-domain text but slightly less significant WER improvements of 2.56 and 2.47, respectively. These results contribute to the exploration of state-of-the-art Swedish ASR and have the potential to enable the adoption of open-source ASR models for more use cases. / Automatisk taligenkänning (ASR) möjliggör en mängd olika praktiska tillämpningar. Men många tillämpningsområden har sin egen uppsättning domänspecifika ord vilket kan skapa problem när en taligenkänningsmodell används på data som skiljer sig från träningsdatan. Taligenkänningsmodeller kan anpassas till nya domäner genom fortsatt träning med taldata, men det kräver tillgång till domänspecifik taldata med tillhörande transkript, vilket är arbetskrävande att producera. Lyckligtvis kan beroendet av ljuddata mildras till viss del genom användande av textbaserade språkmodeller tillsammans med taligenkänningsmodellerna. Detta examensarbete utforskar tillvägagångssätt för att skapa domänspecifika 4-gram-språkmodeller för en svensk wav2vec 2.0-modell som tränats av Kungliga Biblioteket. Utöver en basmodell så används tre huvudsakliga tillvägagångssätt för att utöka en korpus med domänspecifik data att träna modellerna från. Det första tillvägagångssättet använder en relativt liten mängd domänspecifik textdata, och det andra tillvägagångssättet använder transkript från ett annat ASR-system (maskintranskript). Slutligen använder det tredje tillvägagångssättet Named Entity Recognition (NER) för att hitta ord av samma entitetstyp i en korpus som sedan ersätts med domänspecifika ord. Språkmodellerna utvärderas med ett nytt domänspecifikt evalueringsdataset samt på testdelen av Common Voice datasetet. Jämfört med att inte ha en språkmodell förbättrar basmodellen Word Error Rate (WER) på Common Voice med 2,55 procentenheter och Error Rate (ERR) inom domänen med 6,11 procentenheter. Att lägga till domänspecifik text till basmodellens korpus resulterar i en 2,61 WER-förbättringochen10,38 ERR-förbättring jämfört med att inte ha en språkmodell. Slutligen, att lägga till domänspecifika maskintranskript och att använda NER-metoden resulterar i samma 10.38 ERR-förbättringar som att lägga till domänspecifik text men något mindre WER-förbättringar på 2.56 respektive 2.47 procentenheter. Den här studien bidrar till svensk ASR och kan möjliggöra användandet av öppna taligenkänningsmodeller för fler användningsområden.
|
729 |
M8 the Four-legged Robot / M8 den fyrbenta robotenANFLO, FREDRIK January 2020 (has links)
In recent times robots are becoming more and more common. They are everywhere. Walking, running, swimming, flying and many of them have much in common with the creatures inhabiting this planet. A lot of it in order to make them appeal more to us, instead of simply being portrayed as stone cold machines. Continuing on the path evolution has laid out before us seems to be a wise decision to make, aspiring to efficiently utilize our knowledge about science and engineering with the vision of improving our future. With the intention to simulate a four legged animal and evaluate the means of interacting with one´s surrounding, a quadruped locomotion system together with two types of sound and voice interacting systems have been assessed. A demonstrator was built to test the real world problems and decide what kind of interacting that is most beneficial. The results indicate that voice commands and speech recognition, rather than sounds from the environment are more practical and robust as a way of interacting with one´s surroundings. / På senare tider har robotar blivit mer och mer vanliga. De är överallt. Gående, springande, simmande, flygande och många av dem har mycket gemensamt med de varelser som lever på denna jord. Mycket av detta för att tilltala oss mer, istället för att framstå som enbart iskalla maskiner. Att fortsätta på den väg som evolutionen har lagt framför oss verkar vara ett vist beslut att ta, i strävan efter att effektivt utnyttja våra kunskaper i vetenskap och ingenjörskonst med visionen om att förbättra vår framtid. Med målet att simulera ett fyrbent djur och utvärdera möjligheterna till att interagera med ens omgivning, har ett fyrbent förflyttningssystem tillsammans med två typer av ljud och röstsystem tagits fram. En prototyp kontruerades för att testa de problem som uppstår i den verkliga värden och för att kunna bedöma vilket sätt att interagera som visar vara sig mest fördelaktigt. Resultaten indikerar att röstkommandon och röstigenkänning, snarare än ljuddetektion från omgivningen är mer praktiska och robusta som ett sätt att interagera med sin närmiljö.
|
730 |
VATS : Voice-Activated Targeting System / VATS : Röstaktiverat IdentifieringssystemMELLO, SIMON January 2020 (has links)
Machine learning implementations in computer vision and speech recognition are wide and growing; both low- and high-level applications being required. This paper takes a look at the former and if basic implementations are good enough for real-world applications. To demonstrate this, a simple artificial neural network coded in Python and already existing libraries for Python are used to control a laser pointer via a servomotor and an Arduino, to create a voice-activated targeting system. The neural network trained on MNIST data consistently achieves an accuracy of 0.95 ± 0.01 when classifying MNIST test data, but also classifies captured images correctly if noise-levels are low. This also applies to the speech recognition, rarely giving wrong readings. The final prototype achieves success in all domains except turning the correctly classified images into targets that the Arduino can read and aim at, failing to merge the computer vision and speech recognition. / Maskininlärning är viktigt inom röstigenkänning och datorseende, för både små såväl som stora applikationer. Syftet med det här projektet är att titta på om enkla implementationer av maskininlärning duger för den verkligen världen. Ett enkelt artificiellt neuronnät kodat i Python, samt existerande programbibliotek för Python, används för att kontrollera en laserpekare via en servomotor och en Arduino, för att skapa ett röstaktiverat identifieringssystem. Neuronnätet tränat på MNIST data når en precision på 0.95 ± 0.01 när den försöker klassificera MNIST test data, men lyckas även klassificera inspelade bilder korrekt om störningen är låg. Detta gäller även för röstigenkänningen, då den sällan ger fel avläsningar. Den slutliga prototypen lyckas i alla domäner förutom att förvandla bilder som klassificerats korrekt till mål som Arduinon kan läsa av och sikta på, vilket betyder att prototypen inte lyckas sammanfoga röstigenkänningen och datorseendet.
|
Page generated in 0.0773 seconds