Global ETD Search

1	En undersökning av AI-verktyget Whisper som potentiell ersättare till det manuella arbetssättet inom undertextframtagning / A Study of the AI-tool Whisper as a Potential Substitute to the Manual Process of Subtitling Kaka, Mailad Waled Kider, Oummadi, Yassin January 2023 (has links) Det manuella arbetssättet för undertextframtagning är en tidskrävande och kostsam process. Arbetet undersöker AI-verktyget Whisper och dess potential att ersätta processen som används idag. Processen innefattar både transkribering och översättning. För att verktyget ska kunna göra denna transkribering och översättning behöver den i första hand kunna omvandla tal till text. Detta kallas för taligenkänning och är baserat på upptränade språkmodeller. Precisionen för transkriberingen kan mätas med ordfelfrekvens (Word Error Rate – WER) och för översättningen med COMET-22. Resultaten visade sig klara av Microsofts krav för maximalt tillåten WER och anses därför vara tillräckligt bra för användning. Resultaten indikerade även att de maskinproducerade översättningarna uppnår tillfredställande kvalitet. Undertextframtagning, som är det andra steget i processen, visade sig Whisper ha svårare för när det gäller skapandet av undertexter. Detta gällde både för transkriberingen i originalspråk samt den engelsköversatta versionen. Kvaliteten på undertexternas formatering, som mäts med SubER-metoden, kan tolkas som för låga för att anses vara användbara. Resultaten låg i intervallet 59 till 96% vilket innebär hur stor del av den automatiskt tillverkade undertexten behöver korrigeras för att matcha referensen. Den övergripande slutsatsen man kan dra är att Whisper eventuellt kan ersätta den faktiska transkriberings -och översättningsprocessen, då den både är snabbare och kostar mindre resurser än det manuella tillvägagångssättet. Den är dock inte i skrivande stund tillräcklig för att ersätta undertextframtagningen. / The manual process of subtitling creation is a time consuming and costly process. This study examines the AI-tool Whisper and its potential of substituting the process used today. The process consists of both speech recognition and speech translation. For the tool to accomplish the transcription and translation, it first needs to be able to convert speech-to-text. This is called speech recognition and is based on trained speech models. The precision for the transcription can be measured using the Word Error Rate (WER), while the translation uses COMET-22 for measuring precision. The results met the requirements for maximal allowed WER-value and were therefore considered to be usable. The results also indicated that the machine produced translations reached satisfactory quality. Subtitle creation, which is the second part of the process, turned out to be more of a challenge for Whisper. This applied to both the transcription in the original language and the English translated version. The quality of the subtitling format, measured using the SubER-method, can be interpreted as too low to be considered useful. The results were in the interval of 59 to 96% which informs how large part of the automatically created subtitle need to be corrected to match the reference. The conclusion one can draw is that Whisper could eventually substitute the actual transcription and translation process, since it is both faster and costs less resources than the manual process. Though it is not good enough, in the moment of writing, to substitute the subtitling creation. Manual subtitling creation AI Whisper speech recognition speech translation Word Error Rate COMET-22 SubER Manuell undertextframtagning AI Whisper taligenkänning talöversättning Word Error Rate COMET-22 SubER
2	Examining Machine Learning as an alternative for scalable video analysis / En utvärdering av maskininlärning som alternativ för skalbar videoanalys Ragnar, Niclas, Tolic, Zoran January 2019 (has links) Video is a large part of today’s society where surveillance cameras represent the biggest source of big data, and real-time entertainment is the largest network traffic category. There is currently a large interest in analysing the contents of video where video analysis is mainly conducted by people. This increase in video has for instance made it difficult for professional editors to analyse movies and series in a scalable way, and alternative solutions are needed. The media technology company June, want to explore scalable alternatives for extracting metadata from video. With recent advances in Machine Learning and the rise of machine-learning-asa-service platforms, June wished more specifically to explore how these Machine Learning services can be utilised for extracting metadata from videos, and from it construct a summary regarding its contents. This work examined Machine Learning as an option for scalable video summarisation which resulted in developing and evaluating an application that utilised transcription, summarisation, and translation services to produce a text based summarisation of video. Furthermore to examine the services current state of affairs, multiple services from different providers were tested, evaluated and compared to each other. Lastly, in order to evaluate the summarisation services an evaluation model was developed. The test results showed that the translation services were the only service that produced good results. Transcription and summarisation performed poorly in the tests which renders the suggested solution of combining the three services for video summarisation as impractical. / Video är en stor del av dagens samhälle där bland annat övervakningskameror är den största källan av data och underhållning i realtid är den kategori som står för mest nätverkstrafik. Det finns i dagsläget ett stort intresse i att analysera innehållet av video, denna videoanalys utförs även främst av människor. Ökningen av video har gjort det svårt för exempelvis professionella redaktörer att hinna analysera filmer och serier och mer skalbara alternativ behövs. Mediaföretaget June vill utforska alternativ för att extrahera metadata från video på ett skalbart sätt. Med de senaste framstegen inom maskininlärning och framväxten av machine-learningas-a-service plattformar, önskar June mer specifikt att utforska hur maskininlärning kan nyttjas för att extrahera metadata från video och med det konstruera en sammanfattning av innehållet. Det utförda arbetet undersökte maskininlärning som skalbart alternativ för att kunna sammanfatta videos innehåll. Arbetet resulterade i utvecklandet samt utvärderingen av en applikation som nyttjade maskininlärningstjänster för transkribering, sammanfattning samt översättning för att producera en textbaserad sammanfattning av videos innehåll. För att utvärdera tjänsternas nuvarande tillstånd så testades samt utvärderades tjänster från olika leverantörer för att sedan jämföras mot varandra. Slutligen framtogs en egenutvecklad modell för att kunna utvärdera tjänsterna för sammanfattning. Testresultaten visade att tjänsterna för översättning var de enda tjänsterna som gav bra resultat. Tjänsterna för transkribering och sammanfattning gav dåliga resultat vilket gör den föreslagna lösningen av att kombinera de tre tjänsterna för att sammanfatta videoinnehåll som opraktisk. Machine Learning MLaaS Microsoft Google DeepAI Aylien video analysis transcription translation summarisation Word Error Rate BLEU Maskininlärning MLaaS Microsoft Google videoanalys transkribering översättning sammanfattning word error rate BLEU Computer Sciences Datavetenskap (datalogi)
3	Automatic Speech Recognition System for Somali in the interest of reducing Maternal Morbidity and Mortality. Laryea, Joycelyn, Jayasundara, Nipunika January 2020 (has links) Developing an Automatic Speech Recognition (ASR) system for the Somali language, though not novel, is not actively explored; hence there has been no success in a model for conversational speech. Neither are related works accessible as open-source. The unavailability of digital data is what labels Somali as a low resource language and poses the greatest impediment to the development of an ASR for Somali. The incentive to develop an ASR system for the Somali language is to contribute to reducing the Maternal Mortality Rate (MMR) in Somalia. Researchers acquire interview audio data regarding maternal health and behaviour in the Somali language; to be able to engage the relevant stakeholders to bring about the needed change, these audios must be transcribed into text, which is an important step towards translation into any language. This work investigates available ASR for Somali and attempts to develop a prototype ASR system to convert Somali audios into Somali text. To achieve this target, we first identified the available open-source systems for speech recognition and selected the DeepSpeech engine for the implementation of the prototype. With three hours of audio data, the accuracy of transcription is not as required and cannot be deployed for use. This we attribute to insufficient training data and estimate that the effort towards an ASR for Somali will be more significant by acquiring about 1200 hours of audio to train the DeepSpeech engine Automatic Speech Recognition (ASR) DeepSpeech Natural Language Processing (NLP) Word Error Rate (WER) Character Error Rate (CER) Social Sciences Samhällsvetenskap
4	A Comparative Analysis of Whisper and VoxRex on Swedish Speech Data Fredriksson, Max, Ramsay Veljanovska, Elise January 2024 (has links) With the constant development of more advanced speech recognition models, the need to determine which models are better in specific areas and for specific purposes becomes increasingly crucial. Even more so for low-resource languages such as Swedish, dependent on the progress of models for the large international languages. Lagerlöf (2022) conducted a comparative analysis between Google’s speech-to-text model and NLoS’s VoxRex B, concluding that VoxRex was the best for Swedish audio. Since then, OpenAI released their Automatic Speech Recognition model Whisper, prompting a reassessment of the preferred choice for transcribing Swedish. In this comparative analysis using data from Swedish radio news segments, Whisper performs better than VoxRex in tests on the raw output, highly affected by more proficient sentence constructions. It is not possible to conclude which model is better regarding pure word prediction. However, the results favor VoxRex, displaying a lower variability, meaning that even though Whisper can predict full text better, the decision of what model to use should be determined by the user’s needs. ASR Automatic Speech Recognition Swedish Speech Recognition Speech Recognition Models Speech-to-Text Whisper VoxRex Wav2Vec Model Comparison Transformer Models Neural Networks Machine Learning WER Word Error Rate Transcription Probability Theory and Statistics Sannolikhetsteori och statistik
5	Röststyrning i industriella miljöer : En undersökning av ordfelsfrekvens för olika kombinationer mellan modellarkitekturer, kommandon och brusreduceringstekniker / Voice command in industrial environments : An investigation of Word Error Rate for different combinations of model architectures, commands and noise reduction techniques Eriksson, Ulrika, Hultström, Vilma January 2024 (has links) Röststyrning som användargränssnitt kan erbjuda flera fördelar jämfört med mer traditionella styrmetoder. Det saknas dock färdiga lösningar för specifika industriella miljöer, vilka ställer särskilda krav på att korta kommandon tolkas korrekt i olika grad av buller och med begränsad eller ingen internetuppkoppling. Detta arbete ämnade undersöka potentialen för röststyrning i industriella miljöer. Ett koncepttest genomfördes där ordfelsfrekvens (på engelska Word Error Rate eller kortare WER) användes för att utvärdera träffsäkerheten för olika kombinationer av taligenkänningsarkitekturer, brusreduceringstekniker samt kommandolängder i verkliga bullriga miljöer. Undersökningen tog dessutom hänsyn till Lombard-effekten. Resultaten visar att det för samtliga testade miljöer finns god potential för röststyrning med avseende på träffsäkerheten. Framför allt visade DeepSpeech, en djupinlärd taligenkänningsmodell med rekurrent lagerstruktur, kompletterad med domänspecifika språkmodeller och en riktad kardioid-mikrofon en ordfelsfrekvens på noll procent i vissa scenarier och sällan över fem procent. Resultaten visar även att utformningen av kommandon påverkar ordfelsfrekvensen. För en verklig implementation i industriell miljö behövs ytterligare studier om säkerhetslösningar, inkluderande autentisering och hantering av risker med falskt positivt tolkade kommandon. / Voice command as a user interface can offer several advantages over more traditional control methods. However, there is a lack of ready-made solutions for specific industrial environments, which place particular demands on short commands being interpreted correctly in varying degrees of noise and with limited or no internet connection. This work aimed to investigate the potential for voice command in industrial environments. A proof of concept was conducted where Word Error Rate (WER) was used to evaluate the accuracy of various combinations of speech recognition architectures, noise reduction techniques, and command lengths in authentic noisy environments. The investigation also took into account the Lombard effect. The results indicate that for all tested environments there is good potential for voice command with regard to accuracy. In particular, DeepSpeech, a deep-learned speech recognition model with recurrent layer structure, complemented with domain-specific language models and a directional cardioid microphone, showed WER values of zero percent in certain scenarios and rarely above five percent. The results also demonstrate that the design of commands influences WER. For a real implementation in an industrial environment, further studies are needed on security solutions, including authentication and management of risks with false positive interpreted commands. voice command automatic speech recognition speech-to-text industry noise microphone hidden Markov models neural networks transformers word error rate röststyrning taligenkänning tal-till-text industri buller mikrofon dolda Markov-modeller neurala nätverk transformers ordfelsfrekvens Computer Sciences Datavetenskap (datalogi)
6	Mining of Textual Data from the Web for Speech Recognition / Mining of Textual Data from the Web for Speech Recognition Kubalík, Jakub January 2010 (has links) Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.

1

Page generated in 0.0432 seconds