Spelling suggestions: "subject:"speaker recognition"" "subject:"peaker recognition""
61 |
Teknik för dokumentering avmöten och konferenser / Technology for documenting meetings and conferencesStojanovic, Milan January 2019 (has links)
Documentation of meetings and conferences is performed at most companies by one or more people sitting at a computer and typing what has been said during the meeting. This may lead to typing mistakes or incorect perception by the person who records. The human factor is quite large. This work will focus on developing proposals for new technologies that reduce or eliminate the human factor, thus improving the documentation of meetings and conferences. It represents a problem for many companies and institutions, including Seavus Stockholm, where this study is conducted. It is assumed that most of the companies do not document their meetings and conferences in video or audio format, so this study will therefore only be about text-based documentation.The aim of this study was to investigate how to implement new features and build a modern conference system, using modern technologies and new applications to improve the documentation of meetings and conferences. Speech to text in combination with speech recognition is something that has not yet been implemented for such a purpose, and it can facilitate documenting meetings and conferences.To complete the study, several methods were combined to achieve the desired goals. First, the projects scope and objectives were defined. Then, based on analysis of the observations made in the company documenting process, a design proposal was created. Following this, interviews with the stakeholders were conducted where the proposals were presented and a requirement specification was created. Then the theory was studied to create an understanding of how different techniques work to then design and create a proposal for the architecture.The result of this study contains a proposal for architecture that shows that it is possible to implement these techniques to improve the documentation process. Furthermore, possible use cases and interaction diagrams are presented that show how the system may work.Although the proof of the concept is considered to be satisfactory, additional work and testing is needed to fully implement and integrate the concept into reality. / Dokumentering av möten och konferenser utförs på de flesta företag av en eller flera personer som sitter vid en dator och antecknar det som har sagts under mötet. Det kan medföra att det som skrivs ner inte stämmer med det som har sagts eller att det uppfattades felaktigt av personen som antecknar. Den mänskliga faktorn är ganska stor. Detta arbete kommer att fokusera på att ta fram förslag på nya tekniker som minskar eller eliminerar den mänskliga faktorn, och därmed förbättrar dokumenteringen av möten och konferenser. Det föreställer ett problem för många företag och institutioner, däribland för Seavus Stockholm, där denna studie utförs. Det antas att de flesta företag inte dokumenterar deras möten och konferenser i video eller ljudformat, och därmed kommer denna studie bara att handla om dokumentering i textformat.Målet med denna studie var att undersöka hur man, med hjälp av moderna tekniker och nya tillämpningar, kan implementera nya funktioner och bygga ett modernt konferenssystem, för att förbättra dokumenteringen av möten och konferenser. Tal till text i kombination med talarigenkänning är något som ännu inte har implementerats för ett sådant ändamål, och det kan underlätta dokumenteringen av möten och konferenser.För att slutföra studien kombinerades flera metoder för att uppnå de önskade målen.Först definierades projektens omfattning och mål. Därefter, baserat på analys och observationer av företagets dokumenteringsprocess, skapades ett designförslag. Därefter genomfördes intervjuer med intressenterna där förslagen presenterades och en kravspecifikation skapades. Då studerades teorin för att skapa förståelse för hur olika tekniker arbetar, för att sedan designa och skapa ett förslag till arkitekturen.Resultatet av denna studie innehåller ett förslag till arkitektur, som visar att det är möjligt att implementera dessa tekniker för att förbättra dokumentationsprocessen. Dessutom presenteras möjliga användningsfall och interaktionsdiagram som visar hur systemet kan fungera.Även om beviset av konceptet anses vara tillfredsställande, ytterligare arbete och test behövs för att fullt ut implementera och integrera konceptet i verkligheten.
|
62 |
Analysis of speaking time and content of the various debates of the presidential campaign : Automated AI analysis of speech time and content of presidential debates based on the audio using speaker detection and topic detection / Analys av talartid och innehåll i de olika debatterna under presidentvalskampanjen. : Automatiserad AI-analys av taltid och innehåll i presidentdebatter baserat på ljudet med hjälp av talardetektering och ämnesdetektering.Valentin Maza, Axel January 2023 (has links)
The field of artificial intelligence (AI) has grown rapidly in recent years and its applications are becoming more widespread in various fields, including politics. In particular, presidential debates have become a crucial aspect of election campaigns and it is important to analyze the information exchanged in these debates in an objective way to let voters choose without being influenced by biased data. The objective of this project was to create an automatic analysis tool for presidential debates using AI. The main challenge of the final system was to determine the speaking time of each candidate and to analyze what each candidate said, to detect the topics discussed and to calculate the time spent on each topic. This thesis focus mainly on the speaker detection part of this system. In addition, the high overlap rate in the debates, where candidates cut each other off, posed a significant challenge for speaker diarization, which aims to determine who speaks when. This problem was considered appropriate for a Master’s thesis project, as it involves a combination of advanced techniques in AI and speech processing, making it an important and difficult task. The application to political debates and the accompanying overlapping pathways makes this task both challenging and innovative. There are several ways to solve the problem of speaker detection. We have implemented classical approaches that involve segmentation techniques, speaker representation using embeddings such as i-vectors or x-vectors, and clustering. Yet, due to speech overlaps, the End-to-end solution was implemented using pyannote-audio (an open-source toolkit written in Python for speaker diarization) and the diarization error rate was significantly reduced after refining the model using our own labeled data. The results of this project showed that it was possible to create an automated presidential debate analysis tool using AI. Specifically, this thesis has established a state of the art of speaker detection taking into account the particularities of the politics such as the high speaker overlap rate. / AI-området (artificiell intelligens) har vuxit snabbt de senaste åren och dess tillämpningar blir alltmer utbredda inom olika områden, inklusive politik. Särskilt presidentdebatter har blivit en viktig aspekt av valkampanjerna och det är viktigt att analysera den information som utbyts i dessa debatter på ett objektivt sätt så att väljarna kan välja utan att påverkas av partiska uppgifter. Målet med detta projekt var att skapa ett automatiskt analysverktyg för presidentdebatter med hjälp av AI. Den största utmaningen för det slutliga systemet var att bestämma taltid för varje kandidat och att analysera vad varje kandidat sa, att upptäcka diskuterade ämnen och att beräkna den tid som spenderades på varje ämne. Denna avhandling fokuserar huvudsakligen på detektering av talare i detta system. Dessutom innebar den höga överlappningsgraden i debatterna, där kandidaterna avbröt varandra, en stor utmaning för talardarization, som syftar till att fastställa vem som talar när. Detta problem ansågs lämpligt för ett examensarbete, eftersom det omfattar en kombination av avancerade tekniker inom AI och talbehandling, vilket gör det till en viktig och svår uppgift. Tillämpningen på politiska debatter och den åtföljande överlappande vägar gör denna uppgift både utmanande och innovativ. Det finns flera sätt att lösa problemet med att upptäcka talare. Vi har genomfört klassiska metoder som innefattar segmentering tekniker, representation av talare med hjälp av inbäddningar som i-vektorer eller x-vektorer och klustering. På grund av talöverlappningar implementerades dock Endto-end-lösningen med pyannote-audio (en verktygslåda med öppen källkod skriven i Python för diarisering av talare) och diariseringsfelprocenten reducerades avsevärt efter att modellen förfinats med hjälp av våra egna märkta data. Resultaten av detta projekt visade att det var möjligt att skapa ett automatiserat verktyg för analys av presidentdebatten med hjälp av AI. Specifikt har denna avhandling etablerat en state of the art av talardetektion med hänsyn till politikens särdrag såsom den höga överlappningsfrekvensen av talare.
|
63 |
Sequential organization in computational auditory scene analysisShao, Yang 21 September 2007 (has links)
No description available.
|
64 |
Биометријско обележје за препознавање говорника: дводимензионална информациона ентропија говорног сигнала / Biometrijsko obeležje za prepoznavanje govornika: dvodimenzionalna informaciona entropija govornog signala / A novel solution for indoor human presence and motion detection in wireless sensor networks based on the analysis of radio signals propagationBožilović Boško 26 September 2016 (has links)
<p>Mотив за истраживање је унапређење процеса аутоматског препознавања говорника без обзира на садржај изговоренoг текста.<br />Циљ ове докторске дисертације је дефинисање новог биометријског обележја за препознавање говорника независно од изговореног текста − дводимензионалне информационе ентропије говорног сигнала.<br />Дефинисање новог обележја се врши искључиво у временском домену, па је рачунарска сложеност алгоритма за његово издвајање знатно мања у односу на обележја која се издвајају у фреквенцијском домену. Оцена перформанси дводимензионалне информационе ентропије је урађена над репрезентативним скупом случајно одабраних говорника. Показано је да предложено обележје има малу варијабилност унутар говорног сигнала једног говорника, а велику варијабилност између говорних сигнала различитих говорника.</p> / <p>Motiv za istraživanje je unapređenje procesa automatskog prepoznavanja govornika bez obzira na sadržaj izgovorenog teksta.<br />Cilj ove doktorske disertacije je definisanje novog biometrijskog obeležja za prepoznavanje govornika nezavisno od izgovorenog teksta − dvodimenzionalne informacione entropije govornog signala.<br />Definisanje novog obeležja se vrši isključivo u vremenskom domenu, pa je računarska složenost algoritma za njegovo izdvajanje znatno manja u odnosu na obeležja koja se izdvajaju u frekvencijskom domenu. Ocena performansi dvodimenzionalne informacione entropije je urađena nad reprezentativnim skupom slučajno odabranih govornika. Pokazano je da predloženo obeležje ima malu varijabilnost unutar govornog signala jednog govornika, a veliku varijabilnost između govornih signala različitih govornika.</p> / <p>Тhe motivation for the research is the improvement of the automatic speaker recognition process regardless of the content of spoken text.<br />The objective of this dissertation is to define a new biometric text-independent speaker recognition feature − the two-dimensional informational entropy of speech signal.<br />Definition of the new feature is performed in time domain exclusively, so the computing complexity of the algorithm for feature extraction is significantly lower in comparison to feature extraction in spectral domain. Performance analysis of two-dimensional information entropy is performed on the representative set of randomly chosen speakers. It has been shown that new feature has small within-speaker variability and significant between-speaker variability.</p>
|
65 |
Channel Compensation for Speaker Recognition SystemsNeville, Katrina Lee, katrina.neville@rmit.edu.au January 2007 (has links)
This thesis attempts to address the problem of how best to remedy different types of channel distortions on speech when that speech is to be used in automatic speaker recognition and verification systems. Automatic speaker recognition is when a person's voice is analysed by a machine and the person's identity is worked out by the comparison of speech features to a known set of speech features. Automatic speaker verification is when a person claims an identity and the machine determines if that claimed identity is correct or whether that person is an impostor. Channel distortion occurs whenever information is sent electronically through any type of channel whether that channel is a basic wired telephone channel or a wireless channel. The types of distortion that can corrupt the information include time-variant or time-invariant filtering of the information or the addition of 'thermal noise' to the information, both of these types of distortion can cause varying degrees of error in information being received and analysed. The experiments presented in this thesis investigate the effects of channel distortion on the average speaker recognition rates and testing the effectiveness of various channel compensation algorithms designed to mitigate the effects of channel distortion. The speaker recognition system was represented by a basic recognition algorithm consisting of: speech analysis, extraction of feature vectors in the form of the Mel-Cepstral Coefficients, and a classification part based on the minimum distance rule. Two types of channel distortion were investigated: Convolutional (or lowpass filtering) effects Addition of white Gaussian noise Three different methods of channel compensation were tested: Cepstral Mean Subtraction (CMS) RelAtive SpecTrAl (RASTA) Processing Constant Modulus Algorithm (CMA) The results from the experiments showed that for both CMS and RASTA processing that filtering at low cutoff frequencies, (3 or 4 kHz), produced improvements in the average speaker recognition rates compared to speech with no compensation. The levels of improvement due to RASTA processing were higher than the levels achieved due to the CMS method. Neither the CMS or RASTA methods were able to improve accuracy of the speaker recognition system for cutoff frequencies of 5 kHz, 6 kHz or 7 kHz. In the case of noisy speech all methods analysed were able to compensate for high SNR of 40 dB and 30 dB and only RASTA processing was able to compensate and improve the average recognition rate for speech corrupted with a high level of noise (SNR of 20 dB and 10 dB).
|
66 |
Αναγνώριση ομιλητή / Speaker recognitionGanchev, Todor 25 June 2007 (has links)
Η παρούσα διατριβή πραγματεύεται την αναγνώριση ομιλητή σε πραγματικές συνθήκες. Τα κύρια σημεία της εργασίας είναι: (1) αξιολόγηση διαφόρων προσεγγίσεων εξαγωγής χαρακτηριστικών παραμέτρων ομιλίας, (2) μείωση της ισχύος της περιβαλλοντικής επίδρασης στην απόδοση της αναγνώρισης ομιλητή, και (3) μελέτη τεχνικών κατηγοριοποίησης, εναλλακτικών προς τις υπάρχουσες. Συγκεκριμένα, στο (1), προτείνεται μια νέα δομή εξαγωγής παραμέτρων ομιλίας βασισμένη σε πακέτα κυματομορφών, κατάλληλα σχεδιασμένη για αναγνώριση ομιλητή. Εξάγεται με ένα αντικειμενικό τρόπο σε σχέση με την απόδοση αναγνώρισης ομιλητή, σε αντίθεση με την MFCC προσέγγιση, που βασίζεται στην προσέγγιση της αντίληψης της ανθρώπινης ακοής. Έπειτα, στο (2), δίνεται μια δομή για την εξαγωγή παραμέτρων βασισμένη στα MFCC, ανεκτική στο θόρυβο, για την βελτίωση της απόδοσης της αναγνώρισης ομιλητή σε πραγματικό περιβάλλον. Συνοπτικά, μια τεχνική μείωσης του θορύβου βασισμένη σε μοντέλο προσαρμοσμένη στο πρόβλημα της επιβεβαίωσης ομιλητή ενσωματώνεται απευθείας στη δομή υπολογισμού των MFCC. Αυτή η προσέγγιση επέδειξε σημαντικό πλεονέκτημα σε πραγματικό και ταχέως μεταβαλλόμενο περιβάλλον. Τέλος, στο (3), εισάγονται δύο νέοι κατηγοριοποιητές που αναφέρονται ως Locally Recurrent Probabilistic Neural Network (LR PNN), και Generalized Locally Recurrent Probabilistic Neural Network (GLR PNN). Είναι υβρίδια μεταξύ των Recurrent Neural Network (RNN) και Probabilistic Neural Network (PNN) και συνδυάζουν τα πλεονεκτήματα των γεννετικών και διαφορικών προσσεγγίσεων κατηγοριοποίησης. Επιπλέον, τα νέα αυτά νευρωνικά δίκτυα είναι ευαίσθητα σε παροδικές και ειδικές συσχετίσεις μεταξύ διαδοχικών εισόδων, και έτσι, είναι κατάλληλα για να αξιοποιήσουν την συσχέτιση παραμέτρων ομιλίας μεταξύ πλαισίων ομιλίας. Κατά την εξαγωγή των πειραμάτων, διαφάνηκε ότι οι αρχιτεκτονικές LR PNN και GLR PNN παρέχουν καλύτερη απόδοση, σε σχέση με τα αυθεντικά PNN. / This dissertation dials with speaker recognition in real-world conditions. The main accent falls on: (1) evaluation of various speech feature extraction approaches, (2) reduction of the impact of environmental interferences on the speaker recognition performance, and (3) studying alternative to the present state-of-the-art classification techniques. Specifically, within (1), a novel wavelet packet-based speech features extraction scheme fine-tuned for speaker recognition is proposed. It is derived in an objective manner with respect to the speaker recognition performance, in contrast to the state-of-the-art MFCC scheme, which is based on approximation of human auditory perception. Next, within (2), an advanced noise-robust feature extraction scheme based on MFCC is offered for improving the speaker recognition performance in real-world environments. In brief, a model-based noise reduction technique adapted for the specifics of the speaker verification task is incorporated directly into the MFCC computation scheme. This approach demonstrated significant advantage in real-world fast-varying environments. Finally, within (3), two novel classifiers referred to as Locally Recurrent Probabilistic Neural Network (LR PNN), and Generalized Locally Recurrent Probabilistic Neural Network (GLR PNN) are introduced. They are hybrids between Recurrent Neural Network (RNN) and Probabilistic Neural Network (PNN) and combine the virtues of the generative and discriminative classification approaches. Moreover, these novel neural networks are sensitive to temporal and special correlations among consecutive inputs, and therefore, are capable to exploit the inter-frame correlations among speech features derived for successive speech frames. In the experimentations, it was demonstrated that the LR PNN and GLR PNN architectures provide benefit in terms of performance, when compared to the original PNN.
|
67 |
Reliability of voice comparison for forensic applications / Fiabilité de la comparaison des voix dans le cadre judiciaireAjili, Moez 28 November 2017 (has links)
Dans les procédures judiciaires, des enregistrements de voix sont de plus en plus fréquemment présentés comme élément de preuve. En général, il est fait appel à un expert scientifique pour établir si l’extrait de voix en question a été prononcé par un suspect donné (prosecution hypothesis) ou non (defence hypothesis). Ce prosessus est connu sous le nom de “Forensic Voice Comparison (FVC)” (comparaison de voix dans le cadre judiciaire). Depuis l’émergence du modèle DNA typing, l’approche Bayesienne est devenue le nouveau “golden standard” en sciences criminalistiques. Dans cette approche, l’expert exprime le résultat de son analyse sous la forme d’un rapport de vraisemblance (LR). Ce rapport ne favorise pas seulement une des hypothèses (“prosecution” ou “defence”) mais il fournit également le poids de cette décision. Bien que le LR soit théoriquement suffisant pour synthétiser le résultat, il est dans la pratique assujetti à certaines limitations en raison de son processus d’estimation. Cela est particulièrement vrai lorsque des systèmes de reconnaissance automatique du locuteur (ASpR) sont utilisés. Ces systèmes produisent un score dans toutes les situations sans prendre en compte les conditions spécifiques au cas étudié. Plusieurs facteurs sont presque toujours ignorés par le processus d’estimation tels que la qualité et la quantité d’information dans les deux enregistrements vocaux, la cohérence de l’information entre les deux enregistrements, leurs contenus phonétiques ou encore les caractéristiques intrinsèques des locuteurs. Tous ces facteurs mettent en question la notion de fiabilité de la comparaison de voix dans le cadre judiciaire. Dans cette thèse, nous voulons adresser cette problématique dans le cadre des systèmes automatiques (ASpR) sur deux points principaux. Le premier consiste à établir une échelle hiérarchique des catégories phonétiques des sons de parole selon la quantité d’information spécifique au locuteur qu’ils contiennent. Cette étude montre l’importance du contenu phonétique: Elle met en évidence des différences intéressantes entre les phonèmes et la forte influence de la variabilité intra-locuteurs. Ces résultats ont été confirmés par une étude complémentaire sur les voyelles orales basée sur les paramètres formantiques, indépendamment de tout système de reconnaissance du locuteur. Le deuxième point consiste à mettre en œuvre une approche afin de prédire la fiabilité du LR à partir des deux enregistrements d’une comparaison de voix sans recours à un ASpR. À cette fin, nous avons défini une mesure d’homogénéité (NHM) capable d’estimer la quantité d’information et l’homogénéité de cette information entre les deux enregistrements considérés. Notre hypothèse ainsi définie est que l’homogénéité soit directement corrélée avec le degré de fiabilité du LR. Les résultats obtenus ont confirmé cette hypothèse avec une mesure NHM fortement corrélée à la mesure de fiabilité du LR. Nos travaux ont également mis en évidence des différences significatives du comportement de NHM entre les comparaisons cibles et les comparaisons imposteurs. Nos travaux ont montré que l’approche “force brute” (reposant sur un grand nombre de comparaisons) ne suffit pas à assurer une bonne évaluation de la fiabilité en FVC. En effet, certains facteurs de variabilité peuvent induire des comportements locaux des systèmes, liés à des situations particulières. Pour une meilleure compréhension de l’approche FVC et/ou d’un système ASpR, il est nécessaire d’explorer le comportement du système à une échelle aussi détaillée que possible (le diable se cache dans les détails) / It is common to see voice recordings being presented as a forensic trace in court. Generally, a forensic expert is asked to analyse both suspect and criminal’s voice samples in order to indicate whether the evidence supports the prosecution (same-speaker) or defence (different-speakers) hypotheses. This process is known as Forensic Voice Comparison (FVC). Since the emergence of the DNA typing model, the likelihood-ratio (LR) framework has become the new “golden standard” in forensic sciences. The LR not only supports one of the hypotheses but also quantifies the strength of its support. However, the LR accepts some practical limitations due to its estimation process itself. It is particularly true when Automatic Speaker Recognition (ASpR) systems are considered as they are outputting a score in all situations regardless of the case specific conditions. Indeed, several factors are not taken into account by the estimation process like the quality and quantity of information in both voice recordings, their phonological content or also the speakers intrinsic characteristics, etc. All these factors put into question the validity and reliability of FVC. In this Thesis, we wish to address these issues. First, we propose to analyse how the phonetic content of a pair of voice recordings affects the FVC accuracy. We show that oral vowels, nasal vowels and nasal consonants bring more speaker-specific information than averaged phonemic content. In contrast, plosive, liquid and fricative do not have a significant impact on the LR accuracy. This investigation demonstrates the importance of the phonemic content and highlights interesting differences between inter-speakers effects and intra-speaker’s ones. A further study is performed in order to study the individual speaker-specific information for each vowel based on formant parameters without any use of ASpR system. This study has revealed interesting differences between vowels in terms of quantity of speaker information. The results show clearly the importance of intra-speaker variability effects in FVC reliability estimation. Second, we investigate an approach to predict the LR reliability based only on the pair of voice recordings. We define a homogeneity criterion (NHM) able to measure the presence of relevant information and the homogeneity of this information between the pair of voice recordings. We are expecting that lowest values of homogeneity are correlated with the lowest LR’s accuracy measures, as well as the opposite behaviour for high values. The results showed the interest of the homogeneity measure for FVC reliability. Our studies reported also large differences of behaviour between FVC genuine and impostor trials. The results confirmed the importance of intra-speaker variability effects in FVC reliability estimation. The main takeaway of this Thesis is that averaging the system behaviour over a high number of factors (speaker, duration, content...) hides potentially many important details. For a better understanding of FVC approach and/or an ASpR system, it is mandatory to explore the behaviour of the system at an as-detailed-as-possible scale (The devil lies in the details).
|
68 |
Convergence phonétique en interaction Phonetic convergence in interaction / Phonetic convergence in interactionLelong, Amélie 03 July 2012 (has links)
Le travail présenté dans cette thèse est basé sur l’étude d’un phénomène appelé convergence phonétique qui postule que deux interlocuteurs en interaction vont avoir tendance à adapter leur façon de parler à leur interlocuteur dans un but communicatif. Nous avons donc mis en place un paradigme appelé « Dominos verbaux » afin de collecter un corpus large pour caractériser ce phénomène, le but final étant de doter un agent conversationnel animé de cette capacité d’adaptation afin d’améliorer la qualité des interactions homme-machine.Nous avons mené différentes études pour étudier le phénomène entre des paires d’inconnus, d’amis de longue date, puis entre des personnes provenant de la même famille. On s’attend à ce que l’amplitude de la convergence soit liée à la distance sociale entre les deux interlocuteurs. On retrouve bien ce résultat. Nous avons ensuite étudié l’impact de la connaissance de la cible linguistique sur l’adaptation. Pour caractériser la convergence phonétique, nous avons développé deux méthodes : la première basée sur une analyse discriminante linéaire entre les coefficients MFCC de chaque locuteur, la seconde utilisant la reconnaissance de parole. La dernière méthode nous permettra par la suite d’étudier le phénomène en condition moins contrôlée.Finalement, nous avons caractérisé la convergence phonétique à l’aide d’une mesure subjective en utilisant un nouveau test de perception basé sur la détection « en ligne » d’un changement de locuteur. Le test a été réalisé à l’aide signaux extraits des interactions mais également avec des signaux obtenus avec une synthèse adaptative basé sur la modélisation HNM. Nous avons obtenus des résultats comparables démontrant ainsi la qualité de notre synthèse adaptative. / The work presented in this manuscript is based on the study of a phenomenon called phonetic convergence which postulates that two people in interaction will tend to adapt how they talk to their partner in a communicative purpose. We have developed a paradigm called “Verbal Dominoes“ to collect a large corpus to characterize this phenomenon, the ultimate goal being to fill a conversational agent of this adaptability in order to improve the quality of human-machine interactions.We have done several studies to investigate the phenomenon between pairs of unknown people, good friends, and between people coming from the same family. We expect that the amplitude of convergence is proportional to the social distance between the two speakers. We found this result. Then, we have studied the knowledge of the linguistic target impact on adaptation. To characterize the phonetic convergence, we have developed two methods: the first one is based on a linear discriminant analysis between the MFCC coefficients of each speaker and the second one used speech recognition techniques. The last method will allow us to study the phenomenon in less controlled conditions.Finally, we characterized the phonetic convergence with a subjective measurement using a new perceptual test called speaker switching. The test was performed using signals coming from real interactions but also with synthetic data obtained with the harmonic plus
|
69 |
Characterization of the Voice Source by the DCT for Speaker InformationAbhiram, B January 2014 (has links) (PDF)
Extracting speaker-specific information from speech is of great interest to both researchers and developers alike, since speaker recognition technology finds application in a wide range of areas, primary among them being forensics and biometric security systems.
Several models and techniques have been employed to extract speaker information from the speech signal. Speech production is generally modeled as an excitation source followed by a filter. Physiologically, the source corresponds to the vocal fold vibrations and the filter corresponds to the spectrum-shaping vocal tract. Vocal tract-based features like the melfrequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients have been shown to contain speaker information. However, high speed videos of the larynx show that the vocal folds of different individuals vibrate differently. Voice source (VS)-based features have also been shown to perform well in speaker recognition tasks, thereby revealing that the VS does contain speaker information. Moreover, a combination of the vocal tract and VS-based features has been shown to give an improved performance, showing that the latter contains supplementary speaker information.
In this study, the focus is on extracting speaker information from the VS. The existing techniques for the same are reviewed, and it is observed that the features which are obtained by fitting a time-domain model on the VS perform poorly than those obtained by simple transformations of the VS. Here, an attempt is made to propose an alternate way of characterizing the VS to extract speaker information, and to study the merits and shortcomings of the proposed speaker-specific features.
The VS cannot be measured directly. Thus, to characterize the VS, we first need an estimate of the VS, and the integrated linear prediction residual (ILPR) extracted from the speech signal is used as the VS estimate in this study. The voice source linear prediction model, which was proposed in an earlier study to obtain the ILPR, is used in this work.
It is hypothesized here that a speaker’s voice may be characterized by the relative proportions of the harmonics present in the VS. The pitch synchronous discrete cosine transform (DCT) is shown to capture these, and the gross shape of the ILPR in a few coefficients. The ILPR and hence its DCT coefficients are visually observed to distinguish between speakers. However, it is also observed that they do have intra-speaker variability, and thus it is hypothesized that the distribution of the DCT coefficients may capture speaker information, and this distribution is modeled by a Gaussian mixture model (GMM).
The DCT coefficients of the ILPR (termed the DCTILPR) are directly used as a feature vector in speaker identification (SID) tasks. Issues related to the GMM, like the type of covariance matrix, are studied, and it is found that diagonal covariance matrices perform better than full covariance matrices. Thus, mixtures of Gaussians having diagonal covariances are used as speaker models, and by conducting SID experiments on three standard databases, it is found that the proposed DCTILPR features fare comparably with the existing VS-based features. It is also found that the gross shape of the VS contains most of the speaker information, and the very fine structure of the VS does not help in distinguishing speakers, and instead leads to more confusion between speakers. The major drawbacks of the DCTILPR are the session and handset variability, but they are also present in existing state-of-the-art speaker-specific VS-based features and the MFCCs, and hence seem to be common problems. There are techniques to compensate these variabilities, which need to be used when the systems using these features are deployed in an actual application.
The DCTILPR is found to improve the SID accuracy of a system trained with MFCC features by 12%, indicating that the DCTILPR features capture speaker information which is missed by the MFCCs. It is also found that a combination of MFCC and DCTILPR features on a speaker verification task gives significant performance improvement in the case of short test utterances. Thus, on the whole, this study proposes an alternate way of extracting speaker information from the VS, and adds to the evidence for speaker information present in the VS.
|
70 |
Rozšíření pro pravděpodobnostní lineární diskriminační analýzu v rozpoznávání mluvčího / Extensions to Probabilistic Linear Discriminant Analysis for Speaker RecognitionPlchot, Oldřich Unknown Date (has links)
Tato práce se zabývá pravděpodobnostními modely pro automatické rozpoznávání řečníka. Podrobně analyzuje zejména pravděpodobnostní lineární diskriminační analýzu (PLDA), která modeluje nízkodimenzionální reprezentace promluv ve formě \acronym{i--vektorů}. Práce navrhuje dvě rozšíření v současnosti požívaného PLDA modelu. Nově navržený PLDA model s plným posteriorním rozložením modeluje neurčitost při generování i--vektorů. Práce také navrhuje nový diskriminativní přístup k trénování systému pro verifikaci řečníka, který je založený na PLDA. Pokud srovnáváme původní PLDA s modelem rozšířeným o modelování neurčitosti i--vektorů, výsledky dosažené s rozšířeným modelem dosahují až 20% relativního zlepšení při testech s krátkými nahrávkami. Pro delší testovací segmenty (více než jedna minuta) je zisk v přesnosti menší, nicméně přesnost nového modelu není nikdy menší než přesnost výchozího systému. Trénovací data jsou ale obvykle dostupná ve formě dostatečně dlouhých segmentů, proto v těchto případech použití nového modelu neposkytuje žádné výhody při trénování. Při trénování může být použit původní PLDA model a jeho rozšířená verze může být využita pro získání skóre v případě, kdy se bude provádět testování na krátkých segmentech řeči. Diskriminativní model je založen na klasifikaci dvojic i--vektorů do dvou tříd představujících oprávněný a neoprávněný soud (target a non-target trial). Funkcionální forma pro získání skóre pro každý pár je odvozena z PLDA a trénování je založeno na logistické regresi, která minimalizuje vzájemnou entropii mezi správným označením všech soudů a pravděpodobnostním označením soudů, které navrhuje systém. Výsledky dosažené s diskriminativně trénovaným klasifikátorem jsou podobné výsledkům generativního PLDA, ale diskriminativní systém prokazuje schopnost produkovat lépe kalibrované skóre. Tato schopnost vede k lepší skutečné přesnosti na neviděné evaluační sadě, což je důležitá vlastnost pro reálné použití.
|
Page generated in 0.0936 seconds