481 |
Variation phonologique régionale en interaction conversationnelle / Mental representations of regional phonological variation in conversational interactionAubanel, Vincent 21 January 2011 (has links)
C'est dans l'interaction sociale, lieu d'occurrence premier du langage parlé (Local, 2003) que la parole est apprise, qu'elle est produite quotidiennement et qu'elle évolue. De nouvelles approches interdisciplinaires de l'étude de la parole, notamment la sociophonétique ou les récents développements de l'interaction conversationnelle, ouvrent de nouvelles perspectives dans la modélisation du traitement de la parole. Une question centrale à cette entreprise est la caractérisation des représentations mentales associées aux sons de la parole. Pour traiter cette question, nous utilisons l'approche exemplariste du traitement de la parole, qui propose que les sons de la parole sont mémorisés en incorporant des informations contextuelles détaillées. Nous présentons une nouvelle tâche interactionnelle, GMUP (pour "Group ’em up"), destinée à recueillir les réalisations de matériel phonologique finement contrôlé produit par deux interactants dans un cadre expérimental écologiquement valide. Les variables phonologiques décrivent les différences existant entre deux variétés de français parlé, le français standard et le français méridional. Des outils de reconnaissance automatique de la parole ont été développés pour évaluer la convergence phonétique, observable de l'évolution des représentations mentales, à deux niveaux de granularité : au niveau catégoriel de la variable phonologique et au niveau plus fin, subphonémique. L’emploi de mesures acoustiques détaillées à grande échelle permet de caractériser finement les différences inter-individuelles dans l'évolution de la forme des réalisations acoustiques associées aux représentations mentales en interaction conversationnelle. / It is in social interaction, the primary site of the occurrence of spoken language (Local, 2003) that speech is learned, that it is produced everyday and that it evolves. New interdisciplinary approaches to the study of speech, particularly in sociophonetics and in recent developments in conversational interaction, open new avenues for modeling speech processing. A central question in this enterprise relates to the caracterization of the mental representations of speech sounds. We address this question using the exemplarist approach of speech processing, which proposes that speech sounds are stored in memory along with detailed contextual information. We present a new interactional task, GMUP (which stands for "Group ’em up"), designed to collect realizations of highly-controlled phonological material produced by two interactants in an ecologically valid experimental setting. The phonological variables describe differences between two varieties of spoken French, Northern French and Southern French. Automatic speech recognition tools were developed to evaluate phonetic convergence, an observable of the evolution of the mental representations of speech, at two levels of granularity: at the categorical level of the phonological variable and at a more fine-grained, subphonemic level. The use of large-scale detailed acoustic measures allows us to finely caracterize interindividual differences in the evolution of the acoustic realizations associated with the mental representations of speech in conversational interaction.
|
482 |
Rozpoznávácí sítě založené na konečných stavových převodnících pro dopředné a zpětné dekódování v rozpoznávání řeči / Finite-state based recognition networks for forward-backward speech decodingHannemann, Mirko Unknown Date (has links)
Pomocí matematického formalismu váhovaných konečných stavových převodníků (weighted finite state transducers WFST) může být formulována řada úloh včetně automatického rozpoznávání řeči (automatic speech recognition ASR). Dnešní ASR systémy široce využívají složených pravděpodobnostních modelů nazývaných dekódovací grafy nebo rozpoznávací sítě. Ty jsou z jednotlivých komponent konstruovány pomocí WFST operací, např. kompozice. Každá komponenta je zde zdrojem znalostí a omezuje vyhledávání nejlepší cesty ve složeném grafu v operaci zvané dekódování. Využití koherentního teoretického rámce garantuje, že výsledná struktura bude optimální podle definovaného kritéria. WFST mohou být v rámci daného polookruhu (semi-ring) optimalizovány pomocí determinizace a minimalizace. Aplikací těchto algoritmů získáme optimální strukturu pro prohledávání, optimální distribuce vah je pak získána aplikací "weight pushing" algoritmu. Cílem této práce je zdokonalit postupy a algoritmy pro konstrukci optimálních rozpoznávacích sítí. Zavádíme alternativní weight pushing algoritmus, který je vhodný pro důležitou třídu modelů -- převodníky jazykového modelu (language model transducers) a obecně pro všechny cyklické WFST a WFST se záložními (back-off) přechody. Představujeme také způsob konstrukce rozpoznávací sítě vhodné pro dekódování zpětně v čase, které prokazatelně produkuje ty samé pravděpodobnosti jako dopředná síť. K tomuto účelu jsme vyvinuli algoritmus pro exaktní reverzi back-off jazykových modelů a převodníků, které je reprezentují. Pomocí zpětných rozpoznávacích sítí optimalizujeme dekódování: ve statickém dekodéru je využíváme pro dvoustupňové dekódování (dopředné a zpětné vyhledávání). Tento přístup --- "sledovací" dekódování (tracked decoding) --- umožnuje zahrnout výsledky vyhledávání z prvního stupně do druhého stupně tak, že se sledují hypotézy obsažené v rozpoznávacím grafu (lattice) prvního stupně. Výsledkem je podstatné zrychlení dekódování, protože tato technika umožnuje prohledávat s variabilním prohledávacím paprskem (search beam) -- ten je povětšinou mnohem užší než u základního přístupu. Ukazujeme rovněž, že uvedenou techniku je možné využít v dynamickém dekodéru tím, že postupně zjemňujeme rozpoznávání. To navíc vede i k částečné paralelizaci dekódování.
|
483 |
Automatické hodnocení anglické výslovnosti nerodilých mluvčích / Automatic Pronunciation Evaluation of Non-Native English SpeakersGazdík, Peter January 2019 (has links)
Computer-Assisted Pronunciation Training (CAPT) is becoming more and more popular these days. However, the accuracy of existing CAPT systems is still quite low. Therefore, this diploma thesis focuses on improving existing methods for automatic pronunciation evaluation on the segmental level. The first part describes common techniques for this task. Afterwards, we proposed the system based on two approaches. Finally, performed experiments show significant improvement over the reference system.
|
484 |
Integrace hlasových technologií na mobilní platformy / Integration of Voice Technologies on Mobile PlatformsČerničko, Sergij January 2013 (has links)
The goal of the thesis is being familiar with methods a techniques used in speech processing. Describe the current state of research and development of speech technology. Project and implement server speech recognizer that uses BSAPI. Integrate client that will use server for speech recognition to mobile dictionaries of Lingea company.
|
485 |
Модел бежичних акустичких сензора за командовање гласом у паметним кућама / Model bežičnih akustičkih senzora za komandovanje glasom u pametnim kućama / Model of wireless acoustic sensors for voice commands in smart homesČetić Nenad 05 June 2020 (has links)
<p>Основни циљ истраживања у дисертацији је испитивање примене дистрибуираних акустичких сензора на проблем аутоматског препознавања говора. Услед реверберације говора у затвореном простору долази до проблема да снимљени сигнал садржи више одјека и шума него директног звука. Овај проблем се решава применом различитих алгоритама базираних на обради сигнала помоћу микрофонских низова и решетки. На основу аналитичког модела простора са реверберацијом реализован је систем дистрибуираних акустичких сензора за подршку говорне комуникације у паметним кућама. Добијени резултати указују да<br />је једноставним дистрибуираним сензорима могуће остварити тачност препознавања говора попут тачности код комерцијалних система.</p> / <p>Osnovni cilj istraživanja u disertaciji je ispitivanje primene distribuiranih akustičkih senzora na problem automatskog prepoznavanja govora. Usled reverberacije govora u zatvorenom prostoru dolazi do problema da snimljeni signal sadrži više odjeka i šuma nego direktnog zvuka. Ovaj problem se rešava primenom različitih algoritama baziranih na obradi signala pomoću mikrofonskih nizova i rešetki. Na osnovu analitičkog modela prostora sa reverberacijom realizovan je sistem distribuiranih akustičkih senzora za podršku govorne komunikacije u pametnim kućama. Dobijeni rezultati ukazuju da<br />je jednostavnim distribuiranim senzorima moguće ostvariti tačnost prepoznavanja govora poput tačnosti kod komercijalnih sistema.</p> / <p>The main goal of the dissertation is to examine the application of distributed acoustic sensors to the problem of automatic speech recognition. Due to the reverberation of indoor speech, there is a problem that the recorded signal contains more echoes and noise than direct sound. This problem is solved by applying different algorithms based on signal from microphone arrays and microphone grids. Based on the analytical model of the reverberated space, a system of distributed acoustic sensors was realized to support voice communication in smart home environment. Results show that simple<br />distributed sensors can achieve recognition performance similar to those found in state–of-the-art systems available on the market.</p>
|
486 |
Automatic Annotation of Speech: Exploring Boundaries within Forced Alignment for Swedish and Norwegian / Automatisk Anteckning av Tal: Utforskning av Gränser inom Forced Alignment för Svenska och NorskaBiczysko, Klaudia January 2022 (has links)
In Automatic Speech Recognition, there is an extensive need for time-aligned data. Manual speech segmentation has been shown to be more laborious than manual transcription, especially when dealing with tens of hours of speech. Forced alignment is a technique for matching a signal with its orthographic transcription with respect to the duration of linguistic units. Most forced aligners, however, are language-dependent and trained on English data, whereas under-resourced languages lack the resources to develop an acoustic model required for an aligner, as well as manually aligned data. An alternative solution to the training of new models can be cross-language forced alignment, in which an aligner trained on one language is used for aligning data in another language. This thesis aimed to evaluate state-of-the-art forced alignment algorithms available for Swedish and test whether a Swedish model could be applied for aligning Norwegian. Three approaches for forced aligners were employed: (1) one forced aligner based on Dynamic Time Warping and text-to-speech synthesis Aeneas, (2) two forced aligners based on Hidden Markov Models, namely the Munich AUtomatic Segmentation System (WebMAUS) and the Montreal Forced Aligner (MFA) and (3) Connectionist Temporal Classification (CTC) segmentation algorithm with two pre-trained and fine-tuned Wav2Vec2 Swedish models. First, small speech test sets for Norwegian and Swedish, covering different types of spontaneousness in the speech, were created and manually aligned to create gold-standard alignments. Second, the performance of the Swedish dataset was evaluated with respect to the gold standard. Finally, it was tested whether Swedish forced aligners could be applied for aligning Norwegian data. The performance of the aligners was assessed by measuring the difference between the boundaries set in the gold standard from that of the comparison alignment. The accuracy was estimated by calculating the proportion of alignments below a particular threshold proposed in the literature. It was found that the performance of the CTC segmentation algorithm with Wav2Vec2 (VoxRex) was superior to other forced alignment systems. The differences between the alignments of two Wav2Vec2 models suggest that the training data may have a larger influence on the alignments, than the architecture of the algorithm. In lower thresholds, the traditional HMM approach outperformed the deep learning models. Finally, findings from the thesis have demonstrated promising results for cross-language forced alignment using Swedish models to align related languages, such as Norwegian.
|
487 |
Tal till text för relevant metadatataggning av ljudarkiv hos Sveriges Radio / Speech to text for relevant metadata tagging of audio archive at Sveriges RadioJansson, Annika January 2015 (has links)
Tal till text för relevant metadatataggning av ljudarkiv hos Sveriges Radio Sammanfattning Under åren 2009-2013 har Sveriges Radio digitaliserat sitt programarkiv. Sveriges Radios ambition är att mer material från de 175 000 timmar radio som sänds varje år ska arkiveras. Det är en relativt tidsödande process att göra allt material sökbart och det är långt ifrån säkert att kvaliteten på dessa data är lika hög hos alla objekt. Frågeställningen som har behandlats för detta examensarbete är: Vilka tekniska lösningar finns för att utveckla ett system åt Sveriges Radio för automatisk igenkänning av svenskt tal till text utifrån deras ljudarkiv? System inom tal till text har analyserats och undersökts för att ge Sveriges Radio en aktuell sammanställning inom området. Intervjuer med andra liknande organisationer som arbetar inom området har utförts för att se hur långt de har kommit i sin utveckling av det berörda ämnet. En litteraturstudie har genomförts på de senare forskningsrapporterna inom taligenkänning för att jämföra vilket system som skulle passa Sveriges Radio behov och krav bäst att gå vidare med. Det Sveriges Radio bör koncentrera sig på först för att kunna bygga en ASR, Automatic Speech Recognition, är att transkribera sitt ljudmaterial. Där finns det tre alternativ, antingen transkribera själva genom att välja ut ett antal program med olika inriktning för att få en så stor bredd som möjligt på innehållet, gärna med olika talare för att sedan även kunna utveckla vidare för igenkänning av talare. Enklaste sättet är att låta olika yrkeskategorier som lägger in inslagen/programmen i systemet göra det. Andra alternativet är att starta ett liknade projekt som BBC har gjort och ta hjälp av allmänheten. Tredje alternativet är att köpa tjänsten för transkribering. Mitt råd är att fortsätta utvärdera systemet Kaldi, eftersom det har utvecklats mycket på senaste tiden och verkar vara relativt lätt att utvidga. Även den öppna källkod som Lingsoft använder sig av är intressant att studera vidare. / Speech to text for relevant metadata tagging of audio archive at Sveriges Radio Abstract In the years 2009-2013, Sveriges Radio digitized its program archive. Sveriges Radio's ambition is that more material from the 175 000 hours of radio they broadcast every year should be archived. This is a relatively time-consuming process to make all materials to be searchable and it's far from certain that the quality of the data is equally high on all items. The issue that has been treated for this thesis is: What opportunities exist to develop a system to Sveriges Radio for Swedish speech to text? Systems for speech to text has been analyzed and examined to give Sveriges Radio a current overview in this subject. Interviews with other similar organizations working in the field have been performed to see how far they have come in their development of the concerned subject. A literature study has been conducted on the recent research reports in speech recognition to compare which system would match Sveriges Radio's needs and requirements best to get on with. What Sveriges Radio should concentrate at first, in order to build an ASR, Automatic Speech Recognition, is to transcribe their audio material. Where there are three alternatives, either transcribe themselves by selecting a number of programs with different orientations to get such a large width as possible on the content, preferably with different speakers and then also be able to develop further recognition of the speaker. The easiest way is to let different professions who make the features/programs in the system do it. Other option is to start a similar project that the BBC has done and take help of the public. The third option is to buy the service for transcription. My advice is to continue evaluate the Kaldi system, because it has evolved significantly in recent years and seems to be relatively easy to extend. Also the open-source that Lingsoft uses is interesting to study further.
|
488 |
Automatisk taligenkänning som metod för att undersöka artikulationshastighet i svenska / Automatic speech recognition as a method to investigate articulation rate in SwedishMartin Björkdahl, Liv January 2022 (has links)
Den senaste tidens utveckling inom automatisk taligenkänning har lett till mindre resurskrävan-de och mer effektiva modeller. Detta innebär nya möjligheter för forskning kring spontant tal.I den här studien används Kungliga Bibliotekets svenska version av Wav2Vec 2.0 och en tal-korpus skapas utifrån ljudklipp från Sveriges Radio för att undersöka artikulationshastighet ispontant tal. Artikulationshastighet har setts ha en negativ korrelation till informationsdensiteti tidigare studier. Utifrån Uniform Information Density-hypotesens antagande; att talare strävarefter att jämna ut distributionen av information i ett yttrande, undersöks om de sammanlagdadependenslängderna mellan alla huvud och dependenter i meningar är korrelerat med artiku-lationshastigheten. Studien visar att metoden där artikulationshastighet beräknas med hjälp avKB:s Wav2Vec 2.0 leder till systematiskt högre artikulationshastighet än vid en manuell beräk-ning. Samt att korrelationen mellan antal stavelser i ett ord och artikulationshastighet blir denomvända mot vad tidigare studier med manuella metoder visat. Hypotesen att längre depen-denslängd skulle vara relaterat till högre artikulationshastighet får inget stöd i studien. Iställetses en motsatt effekt av minskande artikulationshastighet i relation till ökande dependenslängd.Studien belyser behovet av en modell specialiserad för beräkning av duration för att vidare ut-forska artikulationshastighet genom automatisk taligenkänning. / The last few years progress within automatic speech recognition has led to models that are lessresource demanding and more effective. This means new possibilities in the research regardingspontaneous speech. In this study, KB:s Swedish version of Wav2Vec 2.0 is used to create aspeech corpus and investigate articulation rate in spontaneous speech, with data from SverigesRadio. This study aims to investigate if this is a good method. It has been observed in previousstudies that articulation rate is negatively correlated to information density. With the uniforminformation density hypothesis; that speakers aim to distribute information evenly in an utteran-ce, as a base - this study aims to investigate whether the sum of the word dependency lengths insentences is correlated to articulation rate. The result shows that the method of calculating ar-ticulation rate with KB:s Wav2Vec 2.0 leads to systematically higher articulation rates comparedto results of a manual method. The hypothesis that longer dependency lengths would correlatewith higher articulation rates is not supported in the results. Instead the opposite effect can be observed. The study shows the need for a model specialized in calculating duration for futureresearch regarding articulation rate with automatic speech recognition.KeywordsASR, automatic speech recognition, UID, articulation rate, dependency length, dependecy mi-nimization, corpus studies, information density
|
489 |
Mispronunciation Detection with SpeechBlender Data Augmentation Pipeline / Uttalsfelsdetektering med SpeechBlender data-förstärkningElkheir, Yassine January 2023 (has links)
The rise of multilingualism has fueled the demand for computer-assisted pronunciation training (CAPT) systems for language learning, CAPT systems make use of speech technology advancements and offer features such as learner assessment and curriculum management. Mispronunciation detection (MD) is a crucial aspect of CAPT, aimed at identifying and correcting mispronunciations in second language learners’ speech. One of the significant challenges in developing MD models is the limited availability of labeled second-language speech data. To overcome this, the thesis introduces SpeechBlender - a fine-grained data augmentation pipeline designed to generate mispronunciations. The SpeechBlender targets different regions of a phonetic unit and blends raw speech signals through linear interpolation, resulting in erroneous pronunciation instances. This method provides a more effective sample generation compared to traditional cut/paste methods. The thesis explores also the use of pre-trained automatic speech recognition (ASR) systems for mispronunciation detection (MD), and examines various phone-level features that can be extracted from pre-trained ASR models and utilized for MD tasks. An deep neural model was proposed, that enhance the representations of extracted acoustic features combined with positional phoneme embeddings. The efficacy of the augmentation technique is demonstrated through a phone-level pronunciation quality assessment task using only non-native good pronunciation speech data. Our proposed technique achieves state-of-the-art results, with Speechocean762 Dataset [54], on ASR dependent MD models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [17]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. In this thesis, we developed the first Arabic pronunciation learning corpus for Arabic AraVoiceL2 to demonstrate the generality of our proposed model and augmentation technique. We used the corpus to evaluate the effectiveness of our approach in improving mispronunciation detection for non-native Arabic speakers learning. Our experiments showed promising results, with a 4.6% increase in F1-score for the Arabic AraVoiceL2 testset, demonstrating the effectiveness of our model and augmentation technique in improving pronunciation learning for non-native speakers of Arabic. / Den ökande flerspråkigheten har ökat efterfrågan på datorstödda CAPT-system (Computer-assisted pronunciation training) för språkinlärning. CAPT-systemen utnyttjar taltekniska framsteg och erbjuder funktioner som bedömning av inlärare och läroplanshantering. Upptäckt av felaktigt uttal är en viktig aspekt av CAPT som syftar till att identifiera och korrigera felaktiga uttal i andraspråkselevernas tal. En av de stora utmaningarna när det gäller att utveckla MD-modeller är den begränsade tillgången till märkta taldata för andraspråk. För att övervinna detta introduceras SpeechBlender i avhandlingen - en finkornig dataförstärkningspipeline som är utformad för att generera feluttalningar. SpeechBlender är inriktad på olika regioner i en fonetisk enhet och blandar råa talsignaler genom linjär interpolering, vilket resulterar i felaktiga uttalsinstanser. Denna metod ger en effektivare provgenerering jämfört med traditionella cut/paste-metoder. I avhandlingen undersöks användningen av förtränade system för automatisk taligenkänning (ASR) för upptäckt av felaktigt uttal. I avhandlingen undersöks olika funktioner på fonemnivå som kan extraheras från förtränade ASR-modeller och användas för att upptäcka felaktigt uttal. En LSTM-modell föreslogs som förbättrar representationen av extraherade akustiska egenskaper i kombination med positionella foneminbäddningar. Effektiviteten hos förstärkning stekniken demonstreras genom en uppgift för bedömning av uttalskvaliteten på fonemnivå med hjälp av taldata som endast innehåller taldata som inte är av inhemskt ursprung och som ger ett bra uttal, Vår föreslagna teknik uppnår toppresultat med Speechocean762-dataset [54], på ASR-beroende modeller för upptäckt av felaktigt uttal på fonemnivå, med en ökning av Pearsonkorrelationskoefficienten (PCC) med 2,0% jämfört med den tidigare toppnivån [17]. Dessutom visar vi en förbättring på 5,0% på fonemnivå jämfört med vår baslinje. Vi observerade också en ökning av F1-poängen med 4,6% med arabiska AraVoiceL2-testset.
|
490 |
Computational auditory scene analysis and robust automatic speech recognitionNarayanan, Arun 14 November 2014 (has links)
No description available.
|
Page generated in 0.1251 seconds