Spelling suggestions: "subject:"speaker recognition"" "subject:"peaker recognition""
11 |
Verificação de locutores independente de texto: uma análise de robustez a ruídoPINHEIRO, Hector Natan Batista 25 February 2015 (has links)
Submitted by Irene Nascimento (irene.kessia@ufpe.br) on 2016-11-08T19:13:18Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação_Final.pdf: 15901621 bytes, checksum: e3bd1c1be70941932d970f61be02e4c1 (MD5) / Made available in DSpace on 2016-11-08T19:13:18Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação_Final.pdf: 15901621 bytes, checksum: e3bd1c1be70941932d970f61be02e4c1 (MD5)
Previous issue date: 2015-02-25 / O processo de identificação de um determinado indivíduo é realizado milhões de vezes,
todos os dias, por organizações dos mais diversos setores. Perguntas como "Quem
é esse indivíduo?" ou "É essa pessoa quem ela diz ser?" são realizadas frequentemente
por organizações financeiras, sistemas de saúde, sistemas de comércio eletrônico, sistemas
de telecomunicações e por instituições governamentais. Identificação biométrica diz
respeito ao processo de realizar essa identificação a partir de características físicas ou
comportamentais. Tais características são comumente referenciadas como características
biométricas e alguns exemplos delas são: face, impressão digital, íris, assinatura e voz.
Reconhecimento de locutores é uma modalidade biométrica que se propõe a realizar o
processo de identificação pessoal a partir das informações presentes unicamente na voz do
indivíduo. Este trabalho foca no desenvolvimento de sistemas de verificação de locutores
independente de texto. O principal desafio no desenvolvimento desses sistemas provém
das chamadas incompatibilidades que podem ocorrer na aquisição dos sinais de voz. As
técnicas propostas para suavizá-las são chamadas de técnicas de compensação e três são
os domínios onde elas podem operar: no processo de extração de características do sinal,
na construção dos modelos dos locutores e no cálculo do score final do sistema. Além de
apresentar uma vasta revisão da literatura do desenvolvimento de sistemas de verificação
de locutores independentes de texto, esse trabalho também apresenta as principais técnicas
de compensação de características, modelos e scores. Na fase de experimentação, uma
análise comparativa das principais técnicas propostas na literatura é apresentada. Além
disso, duas técnicas de compensação são propostas, uma do domínio de modelagem e
outra do domínio dos scores. A técnica de compensação de score proposta é baseada na
Distribuição Normal Acumulada e apresentou, em alguns contextos, resultados superiores
aos apresentados pelas principais técnicas da literatura. Já a técnica de compensação de
modelo é baseada em uma técnica da literatura que combina dois conceitos: treinamento
multi-condicional e Teoria dos Dados Ausentes (Missing Data Theory). A formulação
apresentada pelos autores é baseada nos chamados Modelos de União a Posteriori (Posterior
Union Models), mas não é completamente adequada para verificação de locutores
independente de texto. Este trabalho apresenta uma formulação apropriada para esse
contexto que combina os dois conceitos utilizados pelos autores com um tipo de modelagem
utilizando UBMs (Universal Background Models). A técnica proposta apresentou ganhos
de desempenhos quando comparada à técnica-padrão GMM-UBM, baseada em Modelos
de Misturas Gaussianas (GMMs). / The personal identification of individuals is a task executed millions of times every day
by organizations from diverse fields. Questions such as "Who is this individual?" or "Is
this person who he or she claims to be?" are constantly made by organizations in financial
services, health care, e-commerce, telecommunication systems and governments. Biometric
identification is the process of identifying people using their physiological or behavioral
characteristics. These characteristics are generally known as biometrics and examples
of these include face, fingerprint, iris, handwriting and speech. Speaker recognition is
a biometric modality which makes the personal identification by using speaker-specific
information from the speech. This work focuses on the development of text-independent
speaker verification systems. In these systems, speech from an individual is used to verify the
claimed identity of that individual. Furthermore, the verification must occur independently
of the pronounced word or phrase. The main challenge in the development of speaker
recognition systems comes from the mismatches which may occur in the acquisition of
the speech signals. The techniques proposed to mitigate the mismatch effects are referred
as compensation methods. They may operate in three domains: in the feature extraction
process, in the estimation of the speaker models and in the computation of the decision
score. Besides presenting a wide description of the main techniques used in the development
of text-independent speaker verification systems, this work presents the description of
the main feature-, model- and score-based compensation methods. In the experiments,
this work shows comprehensive comparisons between the conventional techniques and
the alternatively compensations methods. Furthermore, two compensation methods are
proposed: one operates in the model domain and the other in the score-domain. The scoredomain
proposed compensation method is based on the Normal cumulative distribution
function and, in some contexts, outperformed the performance of the main score-domain
compensation techniques. On the other hand, the model-domain compensation technique
proposed in this work is based on a method presented in the literature which combines
two concepts: the multi-condition training and the Missing Data Theory. The formulation
proposed by the authors is based on the Posterior Union models and is not completely
appropriate for the text-independent speaker verification task. This work proposes a more
appropriate formulation for this context which combines the concepts used by the authors
with a type of modeling using Universal Background Models (UBMs). The proposed
method outperformed the usual GMM-UBM modeling technique, based on Gaussian
Mixture Models (GMMs).
|
12 |
Reconhecimento automático do locutor com redes neurais pulsadas. / Automatic speaker recognition using pulse coupled neural networks.Timoszczuk, Antonio Pedro 22 March 2004 (has links)
As Redes Neurais Pulsadas são objeto de intensa pesquisa na atualidade. Neste trabalho é avaliado o potencial de aplicação deste paradigma neural, na tarefa de reconhecimento automático do locutor. Após uma revisão dos tópicos considerados importantes para o entendimento do reconhecimento automático do locutor e das redes neurais artificiais, é realizada a implementação e testes do modelo de neurônio com resposta por impulsos. A partir deste modelo é proposta uma nova arquitetura de rede com neurônios pulsados para a implementação de um sistema de reconhecimento automático do locutor. Para a realização dos testes foi utilizada a base de dados Speaker Recognition v1.0, do CSLU Center for Spoken Language Understanding do Oregon Graduate Institute - E.U.A., contendo frases gravadas a partir de linhas telefônicas digitais. Para a etapa de classificação foi utilizada uma rede neural do tipo perceptron multicamada e os testes foram realizados no modo dependente e independente do texto. A viabilidade das Redes Neurais Pulsadas para o reconhecimento automático do locutor foi constatada, demonstrando que este paradigma neural é promissor para tratar as informações temporais do sinal de voz. / Pulsed Neural Networks have received a lot of attention from researchers. This work aims to verify the capability of this neural paradigm when applied to a speaker recognition task. After a description of the automatic speaker recognition and artificial neural networks fundamentals, a spike response model of neurons is tested. A novel neural network architecture based on this neuron model is proposed and used in a speaker recognition system. Text dependent and independent tests were performed using the Speaker Recognition v1.0 database from CSLU Center for Spoken Language Understanding of Oregon Graduate Institute - U.S.A. A multilayer perceptron is used as a classifier. The Pulsed Neural Networks demonstrated its capability to deal with temporal information and the use of this neural paradigm in a speaker recognition task is promising.
|
13 |
Análise das concentrações energéticas no limiar entre fonemas vozeados e não-vozeados e suas implicações para fins de reconhecimento de locutores dependente do discurso / Analysis of energy cocentrations in the threshold between voiced and unvoiced phonemes and their implications for text-dependent speaker recognitionIshizawa, William Habaro 19 February 2015 (has links)
Atualmente, diversos trabalhos e aplicações são desenvolvidos com foco na área de reconhecimento computacional de locutores. À medida que o interesse por diversas aplicações reais dentro dessa área emerge, principalmente em biometria, na qual a segurança e a eficácia são de extrema importância, torna-se cada vez mais necessário que estudos sejam feitos, na mesma proporção, visando avaliá-las. Desse modo, a proposta do presente trabalho é a de mensurar a acurácia de um sistema de reconhecimento de locutores baseado em características elementares, isto é, energias de sub-bandas de frequências, em associação com um classificador probabilístico, estudando a viabilidade de extraí-las das transições entre trechos vozeados e não-vozeados (TTVNV) dos sinais. Testes são realizados com diferentes quantidades de locutores e discurso fixado. A acurácia obtida nos testes variam de 20.18% a 92.53%. Os resultados obtidos são comparados e relatados, complementando as afirmações existentes na literatura sobre o uso das TTVNV com dados quantitativos. / Nowadays, many works and applications are developed focusing on computational speaker recognition. As the interest for several real applications within this area emerges, especially in biometrics, where the safety and the efficacy of the applications are extremely important, studies need to be developed in the same proportion, to evaluate the effectiveness of such approaches. Based on that, this work intends to measure the accuracy of a speaker recognition system that uses elementar features, i.e., sub-band frequency energies, associated with a probabilistic classifier, studying the viability of extracting them from the transition between voiced and unvoiced speech tags (TTVNV). Tests are carried out with different numbers of speakers and a text-dependent approach. The accuracy of the tests varies from 20.18% to 92.53%. The results are compared and reported, complementing the existent information on the use of TTVNV with quantitative data.
|
14 |
Reconhecimento automático do locutor com redes neurais pulsadas. / Automatic speaker recognition using pulse coupled neural networks.Antonio Pedro Timoszczuk 22 March 2004 (has links)
As Redes Neurais Pulsadas são objeto de intensa pesquisa na atualidade. Neste trabalho é avaliado o potencial de aplicação deste paradigma neural, na tarefa de reconhecimento automático do locutor. Após uma revisão dos tópicos considerados importantes para o entendimento do reconhecimento automático do locutor e das redes neurais artificiais, é realizada a implementação e testes do modelo de neurônio com resposta por impulsos. A partir deste modelo é proposta uma nova arquitetura de rede com neurônios pulsados para a implementação de um sistema de reconhecimento automático do locutor. Para a realização dos testes foi utilizada a base de dados Speaker Recognition v1.0, do CSLU Center for Spoken Language Understanding do Oregon Graduate Institute - E.U.A., contendo frases gravadas a partir de linhas telefônicas digitais. Para a etapa de classificação foi utilizada uma rede neural do tipo perceptron multicamada e os testes foram realizados no modo dependente e independente do texto. A viabilidade das Redes Neurais Pulsadas para o reconhecimento automático do locutor foi constatada, demonstrando que este paradigma neural é promissor para tratar as informações temporais do sinal de voz. / Pulsed Neural Networks have received a lot of attention from researchers. This work aims to verify the capability of this neural paradigm when applied to a speaker recognition task. After a description of the automatic speaker recognition and artificial neural networks fundamentals, a spike response model of neurons is tested. A novel neural network architecture based on this neuron model is proposed and used in a speaker recognition system. Text dependent and independent tests were performed using the Speaker Recognition v1.0 database from CSLU Center for Spoken Language Understanding of Oregon Graduate Institute - U.S.A. A multilayer perceptron is used as a classifier. The Pulsed Neural Networks demonstrated its capability to deal with temporal information and the use of this neural paradigm in a speaker recognition task is promising.
|
15 |
Text-Independent Speaker Recognition Using Source Based FeaturesWildermoth, Brett Richard, n/a January 2001 (has links)
Speech signal is basically meant to carry the information about the linguistic message. But, it also contains the speaker-specific information. It is generated by acoustically exciting the cavities of the mouth and nose, and can be used to recognize (identify/verify) a person. This thesis deals with the speaker identification task; i.e., to find the identity of a person using his/her speech from a group of persons already enrolled during the training phase. Listeners use many audible cues in identifying speakers. These cues range from high level cues such as semantics and linguistics of the speech, to low level cues relating to the speaker's vocal tract and voice source characteristics. Generally, the vocal tract characteristics are modeled in modern day speaker identification systems by cepstral coefficients. Although, these coeficients are good at representing vocal tract information, they can be supplemented by using both pitch and voicing information. Pitch provides very important and useful information for identifying speakers. In the current speaker recognition systems, it is very rarely used as it cannot be reliably extracted, and is not always present in the speech signal. In this thesis, an attempt is made to utilize this pitch and voicing information for speaker identification. This thesis illustrates, through the use of a text-independent speaker identification system, the reasonable performance of the cepstral coefficients, achieving an identification error of 6%. Using pitch as a feature in a straight forward manner results in identification errors in the range of 86% to 94%, and this is not very helpful. The two main reasons why the direct use of pitch as a feature does not work for speaker recognition are listed below. First, the speech is not always periodic; only about half of the frames are voiced. Thus, pitch can not be estimated for half of the frames (i.e. for unvoiced frames). The problem is how to account for pitch information for the unvoiced frames during recognition phase. Second, the pitch estimation methods are not very reliable. They classify some of the frames unvoiced when they are really voiced. Also, they make pitch estimation errors (such as doubling or halving of pitch value depending on the method). In order to use pitch information for speaker recognition, we have to overcome these problems. We need a method which does not use the pitch value directly as feature and which should work for voiced as well as unvoiced frames in a reliable manner. We propose here a method which uses the autocorrelation function of the given frame to derive pitch-related features. We call these features the maximum autocorrelation value (MACV) features. These features can be extracted for voiced as well as unvoiced frames and do not suffer from the pitch doubling or halving type of pitch estimation errors. Using these MACV features along with the cepstral features, the speaker identification performance is improved by 45%.
|
16 |
That voice sounds familiar : factors in speaker recognitionEriksson, Erik J. January 2007 (has links)
<p>Humans have the ability to recognize other humans by voice alone. This is important both socially and for the robustness of speech perception. This Thesis contains a set of eight studies that investigates how different factors impact on speaker recognition and how these factors can help explain how listeners perceive and evaluate speaker identity. The first study is a review paper overviewing emotion decoding and encoding research. The second study compares the relative importance of the emotional tone in the voice and the emotional content of the message. A mismatch between these was shown to impact upon decoding speed. The third study investigates the factor dialect in speaker recognition and shows, using a bidialectal speaker as the target voice to control all other variables, that the dominance of dialect cannot be overcome. The fourth paper investigates if imitated stage dialects are as perceptually dominant as natural dialects. It was found that a professional actor could disguise his voice successfully by imitating a dialect, yet that a listener's proficiency in a language or accent can reduce susceptibility to a dialect imitation. Papers five to seven focus on automatic techniques for speaker separation. Paper five shows that a method developed for Australian English diphthongs produced comparable results with a Swedish glide + vowel transition. The sixth and seventh papers investigate a speaker separation technique developed for American English. It was found that the technique could be used to separate Swedish speakers and that it is robust against professional imitations. Paper eight investigates how age and hearing impact upon earwitness reliability. This study shows that a senior citizen with corrected hearing can be as reliable an earwitness as a younger adult with no hearing problem, but suggests that a witness' general cognitive skill deterioration needs to be considered when assessing a senior citizen's earwitness evidence. On the basis of the studies a model of speaker recognition is presented, based on the face recognition model by V. Bruce and Young (1986; British Journal of Psychology, 77, pp. 305 - 327) and the voice recognition model by Belin, Fecteau and Bédard (2004; TRENDS in Cognitive Science, 8, pp. 129 - 134). The merged and modified model handles both familiar and unfamiliar voices. The findings presented in this Thesis, in particular the findings of the individual papers in Part II, have implications for criminal cases in which speaker recognition forms a part. The findings feed directly into the growing body of forensic phonetic and forensic linguistic research.</p>
|
17 |
Confidence Measures for Speech/Speaker Recognition and Applications on Turkish LVCSRMengusoglu, Erhan 24 May 2004 (has links)
Confidence measures for the results of speech/speaker recognition make the systems more useful in the real time applications. Confidence measures provide a test statistic for accepting or rejecting the recognition hypothesis of the speech/speaker recognition system.
Speech/speaker recognition systems are usually based on statistical modeling techniques. In
this thesis we defined confidence measures for statistical modeling techniques used in speech/speaker recognition systems.
For speech recognition we tested available confidence measures and the newly defined acoustic prior information based confidence measure in two different conditions which cause errors: the out-of-vocabulary words and presence of additive noise. We showed that the newly defined confidence measure performs better in both tests.
Review of speech recognition and speaker recognition techniques and some related statistical methods is given through the thesis.
We defined also a new interpretation technique for confidence measures which is based on Fisher transformation of likelihood ratios obtained in speaker verification. Transformation provided us with a linearly interpretable confidence level which can be used directly in real time applications like for dialog management.
We have also tested the confidence measures for speaker verification systems and evaluated the efficiency of the confidence measures for adaptation of speaker models. We showed that use of confidence measures to select adaptation data improves the accuracy of the speaker model adaptation process.
Another contribution of this thesis is the preparation of a phonetically rich continuous speech database for Turkish Language. The database is used for developing an HMM/MLP hybrid speech recognition for Turkish Language. Experiments on the test sets of the database showed that the speech recognition system has a good accuracy for long speech sequences while performance is lower for short words, as it is the case for current speech recognition systems for
other languages.
A new language modeling technique for the Turkish language is introduced in this thesis, which can be used for other agglutinative languages. Performance evaluations on newly defined language modeling techniques showed that it outperforms the classical n-gram language modeling technique.
|
18 |
That voice sounds familiar : factors in speaker recognitionEriksson, Erik J. January 2007 (has links)
Humans have the ability to recognize other humans by voice alone. This is important both socially and for the robustness of speech perception. This Thesis contains a set of eight studies that investigates how different factors impact on speaker recognition and how these factors can help explain how listeners perceive and evaluate speaker identity. The first study is a review paper overviewing emotion decoding and encoding research. The second study compares the relative importance of the emotional tone in the voice and the emotional content of the message. A mismatch between these was shown to impact upon decoding speed. The third study investigates the factor dialect in speaker recognition and shows, using a bidialectal speaker as the target voice to control all other variables, that the dominance of dialect cannot be overcome. The fourth paper investigates if imitated stage dialects are as perceptually dominant as natural dialects. It was found that a professional actor could disguise his voice successfully by imitating a dialect, yet that a listener's proficiency in a language or accent can reduce susceptibility to a dialect imitation. Papers five to seven focus on automatic techniques for speaker separation. Paper five shows that a method developed for Australian English diphthongs produced comparable results with a Swedish glide + vowel transition. The sixth and seventh papers investigate a speaker separation technique developed for American English. It was found that the technique could be used to separate Swedish speakers and that it is robust against professional imitations. Paper eight investigates how age and hearing impact upon earwitness reliability. This study shows that a senior citizen with corrected hearing can be as reliable an earwitness as a younger adult with no hearing problem, but suggests that a witness' general cognitive skill deterioration needs to be considered when assessing a senior citizen's earwitness evidence. On the basis of the studies a model of speaker recognition is presented, based on the face recognition model by V. Bruce and Young (1986; British Journal of Psychology, 77, pp. 305 - 327) and the voice recognition model by Belin, Fecteau and Bédard (2004; TRENDS in Cognitive Science, 8, pp. 129 - 134). The merged and modified model handles both familiar and unfamiliar voices. The findings presented in this Thesis, in particular the findings of the individual papers in Part II, have implications for criminal cases in which speaker recognition forms a part. The findings feed directly into the growing body of forensic phonetic and forensic linguistic research.
|
19 |
A Design of Multi-Session, Text Independent, TV-Recorded Audio-Video Database for Speaker RecognitionWang, Long-Cheng 07 September 2006 (has links)
A four-session text independent, TV-recorded audio-video database for speaker recognition is collected in this thesis. The speaker data is used to verify the applicability of a design methodology based on Mel-frequency cepstrum coefficients and Gaussian mixture model. Both single-session and multi-session problems are discussed in the thesis. Experimental results indicate that 90% correct rate can be achieved for a single-session 3000-speaker corpus while only 67% correct rate can be obtained for a two-session 800-speaker dataset. The performance of a multi-session speaker recognition system is greatly reduced due to the variability incurred in the recording environment, speakers¡¦ recording mood and other unknown factors. How to increase the system performance under multi-session conditions becomes a challenging task in the future. And the establishment of such a multi-session large-scale speaker database does indeed play an indispensable role in this task.
|
20 |
A design of text-independent medium-size speaker recognition systemZheng, Shun-De 13 September 2002 (has links)
This paper presents text-independent speaker identification results for medium-size speaker population sizes up to 400 speakers for TV speech and TIMIT database . A system based on Gaussian mixture speaker models is used for speaker identification, and experiments are conducted on the TV database and TIMIT database. The TV-Database results show medium-size population performance under TV conditions. These are believed to be the first speaker identification experiments on the complete 400 speaker TV databases and the largest text-independent speaker identification task reported to date. Identification accuracies of 94.5% on the TV databases, respectively and 98.5% on the TIMIT database .
|
Page generated in 0.0777 seconds