Spelling suggestions: "subject:"speaker recognition"" "subject:"peaker recognition""
1 |
Learning speaker-specific characteristics with deep neural architectureSalman, Ahmad January 2012 (has links)
Robust Speaker Recognition (SR) has been a focus of attention for researchers since long. The advancement in speech-aided technologies especially biometrics highlights the necessity of foolproof SR systems. However, the performance of a SR system critically depends on the quality of speech features used to represent the speaker-specific information. This research aims at extracting the speaker-specific information from Mel-frequency Cepstral Coefficients (MFCCs) using deep learning. Speech is a mixture of various information components that include linguistic, speaker-specific and speaker’s emotional state information. Feature extraction for each information component is inevitable in different speech-related tasks for robust performance. However, almost all forms of speech representation carry all the information as a whole, which is responsible for the compromised performances by SR systems. Motivated by the complex problem solving ability of deep architectures by learning high-level task-specific information in the data, we propose a novel Deep Neural Architecture (DNA) to extract speaker-specific information (SI) from MFCCs, a popular frequency domain speech signal representation. A two-stage learning strategy is adopted, which is based on unsupervised training for network initialisation followed by regularised contrastive learning. To train our network in the 2nd stage, we devise a contrastive loss function to discriminate the speakers on the basis of their intrinsic statistical patterns, distributed in the representations yielded by our deep network. This is achieved in the contrastive pair-wise comparison of these representations for similar or dissimilar speakers. To improve the generalisation and reduce the interference of environmental effects with the speaker-specific representation, we regulate the contrastive loss with the data reconstruction loss in a multi-objective optimisation. A detailed study has been done to analyse the parametric space in training the proposed deep architecture for optimum performance. Finally we compare the performance of our learned speaker-specific representations with several state-of-the-art techniques in speaker verification and speaker segmentation tasks. It is evident that the representations acquired through learned DNA are invariant and comparatively less sensitive to the text, language and environmental variability.
|
2 |
Speaker Recognition in a handheld computerDomínguez Sánchez, Carlos January 2010 (has links)
Handheld computers are widely used, be it a mobile phone, personal digital assistant (PDA), or a media player. Although these devices are personal, often a small set of persons can use a given device, for example a group of friends or a family. The most natural way to communicate for most humans is through speech. Therefore a natural way for these devices to know who is using them is for the device to listen to the user’s speech, i.e., to recognize the speaker based upon their speech. This project exploits the microphone built into most of these devices and asks whether it is possible to develop an effective speaker recognition system which can operate within the limited resources of these devices (as compared to a desktop PC). The goal of this speaker recognition is to distinguish between the small set of people that could share a handheld device and those outside of this small set. Therefore the criteria is that the device should work for any of the members of this small set and not work for anyone outside of this small set. Furthermore, within this small set the device should recognize which specific person within this small group is using it. An application for a Windows Mobile PDA has been developed using C++. This application and its underlying theoretical concepts, as well as parts of the code and the results obtained (in terms of accuracy rate and performance) are presented in this thesis. The experiments conducted within this research indicate that it is feasible to recognize the user based upon their speech is within a small group and further more to identify which member of the group is the user. This has great potential for automatically configuring devices within a home or office environment for the specific user. Potentially all a user needs to do is speak within hearing range of the device to identify themselves to the device. The device in turn can configure itself for this user. / Handdatorer används mycket, det kan vara en mobiltelefon, handdator (PDA) eller en media spelare. Även om dessa enheter är personliga, kan en liten uppsättning med personer ofta använda en viss enhet, t.ex. en grupp av vänner eller en familj. Det mest naturliga sättet att kommunicera för de flesta människor är att tala. Därför ett naturligt sätt för dessa enheten att veta vem som använder dem är för enheten att lyssna på användarens röst, till exempel att erkänna talaren baserat på deras röst. Detta projekt utnyttjar mikrofonen inbyggd i de flesta av dessa enheter och frågar om det är möjligt att utveckla ett effektivt system högtalare erkännande som kan verka inom de begränsade resurserna av dessa enheter (jämfört med en stationär dator). Målet med denna högtalare erkännande är att skilja mellan den lilla set av människor som skulle kunna dela en handdator och de utanför detta lilla set. Därför kriterierna är att enheten bör arbeta för någon av medlemmarna i detta lilla set och inte fungerar för någon utanför detta lilla set. Övrigt inom denna lilla set, bör enheten erkänna som specifik person inom denna lilla grupp. En ansökan om emph Windows Mobile PDA har utvecklats med C++. Denna ansökan och det underliggande teoretiska begreppet, liksom delar av koden och uppnådda resultat (i form av noggrannhet hastighet och prestanda) presenteras i denna avhandling. Experimenten som utförs inom denna forskning visar att det är möjligt att känna användaren baserat på deras röst inom en liten grupp och ytterligare mer att identifiera vilken medlem i gruppen är användaren. Detta har stor potential för att automatiskt konfigurera enheter inom en hemifrån eller från kontoret till den specifika användaren. Potentiellt behöver en användare tala inom hörhåll för att identifiera sig till enheten. Enheten kan konfigurera själv för denna användare.
|
3 |
Speaker verification incorporating high-level linguistic featuresBaker, Brendan J. January 2008 (has links)
Speaker verification is the process of verifying or disputing the claimed identity of a speaker based on a recorded sample of their speech. Automatic speaker verification technology can be applied to a variety of person authentication and identification applications including forensics, surveillance, national security measures for combating terrorism, credit card and transaction verification, automation and indexing of speakers in audio data, voice based signatures, and over-the-phone security access. The ubiquitous nature of modern telephony systems allows for the easy acquisition and delivery of speech signals for processing by an automated speaker recognition system. Traditionally, approaches to automatic speaker verification have involved holistic modelling of low-level acoustic-based features in order to characterise physiological aspects of a speaker such as the length and shape of the vocal tract. Although the use of these low-level features has proved highly successful, there are numerous other sources of speaker specific information in the speech signal that have largely been ignored. In spontaneous and conversational speech, perceptually higher levels of in- formation such as the linguistic content, pronunciation idiosyncrasies, idiolectal word usage, speaking rates and prosody, can also provide useful cues as to identify of a speaker. The main aim of this work is to explore the incorporation of higher levels of information into the verification process. Specifically, linguistic constructs such as words, syllables and phones are examined for their usefulness as features for text-independent speaker verification. Two main approaches to incorporating these linguistic features are explored. Firstly, the direct modelling of linguistic feature sequences is examined. Stochastic language models are used to model word and phonetic sequences obtained from automatically obtained transcripts. Experimentation indicates that significant speaker characterising information is indeed contained in both word and phone-level transcripts. It is shown, however, that model estimation issues arise when limited speech is available for training. This speaker model estimation problem is addressed by employing an adaptive model training strategy that significantly improves the performance and extended the usefulness of both lexical and phonetic techniques to short training length situations. An alternate approach to incorporating linguistic information is also examined. Rather than modelling the high-level features independently of acoustic information, linguistic information is instead used to constrain and aid acoustic- based speaker verification techniques. It is hypothesised that a ext-constrained" approach provides direct benefits by facilitating more detailed modelling, as well as providing useful insight into which articulatory events provide the most useful speaker-characterising information. A novel framework for text-constrained speaker verification is developed. This technique is presented as a generalised framework capable of using di®erent feature sets and modelling paradigms, and is based upon the use of a newly defined pseudo-syllabic segmentation unit. A detailed exploration of the speaker characterising power of both broad phonetic and syllabic events is performed and used to optimise the system configuration. An evaluation of the proposed text- constrained framework using cepstral features demonstrates the benefits of such an approach over holistic approaches, particularly in extended training length scenarios. Finally, a complete evaluation of the developed techniques on the NIST2005 speaker recognition evaluation database is presented. The benefit of including high-level linguistic information is demonstrated when a fusion of both high- and low-level techniques is performed.
|
4 |
DSP Base Independent Phrase Real Time Speaker Recognition SystemYan, Ming-Xiang 27 July 2004 (has links)
The thesis illustrates a DSP-based speaker recognition system . In order to make the modular within the representation floating-point, we simplify the algorithm. This speaker recognition system is including hardware setting and implementation of speaker algorithm. The DSP chip is float arithmetic DSP(ADSP-21161 of ADI SHARK Series) , the algorithm of speaker recognition is gaussian mixture model. According to result of experiments, the speaker recognition of DSP can gain good recognition and speed efficiency.
|
5 |
Feature Design for Text Independent Speaker Recognition in Numerous Speaker CasesHuang, Chun-Hao 28 June 2001 (has links)
A Microsoft Windows program is designed to implement a text independent speaker recognition system in numerous speaker cases based on Mel-Cepstrum and hierarchical tree classifier and binary vector quantization. Experimental result show that the accuracy is barely affected by increasing population sizes. And the speed of recognizing is fast than traditional methods.
|
6 |
Automatic speaker recognition using phase based featuresThiruvaran, Tharmarajah , Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2009 (has links)
Despite recent advances, improving the accuracy of automatic speaker recognition systems remains an important and challenging area of research. This thesis investigates two-phase based features, namely the frequency modulation (FM) feature and the group delay feature in order to improve the speaker recognition accuracy. Introducing features complementary to spectral envelope-based features is a promising approach for increasing the information content of the speaker recognition system. Although phase-based features are motivated by psychophysics and speech production considerations, they have rarely been incorporated into speaker recognition front-ends. A theory has been developed and reported in this thesis, to show that the FM component can be extracted using second-order all pole modelling, and a technique for extracting FM features using this model is proposed, to produce very smooth, slowly varying FM features that are effective for speaker recognition tasks. This approach is shown herein to significantly improve speaker recognition performance over other existing FM extraction methods. A highly computationally efficient FM estimation technique is then proposed and its computational efficiency is shown through a comparative study with other methods with respect to the trade off between computational complexity and performance. In order to further enhance the FM based front-end specifically for speaker recognition, optimum frequency band allocation is studied in terms of the number of sub-bands and spacing of centre frequencies, and two new frequency band re-allocations are proposed for FM based speaker recognition. Two group delay features are also proposed: log compressed group delay feature and the sub-band group delay feature, to address problems in group delay caused by the zeros of the z-transform polynomial of a speech signal being close to the unit circle. It has been shown that the combination of group delay and FM, complements Mel Frequency Cepstral Coefficient (MFCC) in speaker recognition tasks. Furthermore, the proposed FM feature is successfully utilised for automatic forensic speaker recognition, which is implemented based on the likelihood ratio framework with two stage modelling and calibration, and shown to behave in a complementary manner to MFCCs. Notably, the FM based system provides better calibration loss than the MFCC based system, suggesting less ambiguity of FM information than MFCC information in an automatic forensic speaker recognition system. In order to demonstrate the effectiveness of FM features in a large scale speaker recognition environment, an FM-based speaker recognition subsystem is developed and submitted to the NIST 2008 speaker recognition evaluation as part of the I4U submission. Post evaluation analysis shows a 19.7% relative improvement over the traditional MFCC based subsystem when it is augmented by the FM based subsystem. Consistent improvements in performance are obtained when MFCC is augmented with FM in all sub-categories of NIST 2008, in three development tasks and for the NIST 2001 database, demonstrating the complementary behaviour of MFCC and FM features.
|
7 |
Multibiometric security in wireless communication systemsSepasian, Mojtaba January 2010 (has links)
This thesis has aimed to explore an application of Multibiometrics to secured wireless communications. The medium of study for this purpose included Wi-Fi, 3G, and WiMAX, over which simulations and experimental studies were carried out to assess the performance. In specific, restriction of access to authorized users only is provided by a technique referred to hereafter as multibiometric cryptosystem. In brief, the system is built upon a complete challenge/response methodology in order to obtain a high level of security on the basis of user identification by fingerprint and further confirmation by verification of the user through text-dependent speaker recognition. First is the enrolment phase by which the database of watermarked fingerprints with memorable texts along with the voice features, based on the same texts, is created by sending them to the server through wireless channel. Later is the verification stage at which claimed users, ones who claim are genuine, are verified against the database, and it consists of five steps. Initially faced by the identification level, one is asked to first present one’s fingerprint and a memorable word, former is watermarked into latter, in order for system to authenticate the fingerprint and verify the validity of it by retrieving the challenge for accepted user. The following three steps then involve speaker recognition including the user responding to the challenge by text-dependent voice, server authenticating the response, and finally server accepting/rejecting the user. In order to implement fingerprint watermarking, i.e. incorporating the memorable word as a watermark message into the fingerprint image, an algorithm of five steps has been developed. The first three novel steps having to do with the fingerprint image enhancement (CLAHE with 'Clip Limit', standard deviation analysis and sliding neighborhood) have been followed with further two steps for embedding, and extracting the watermark into the enhanced fingerprint image utilising Discrete Wavelet Transform (DWT). In the speaker recognition stage, the limitations of this technique in wireless communication have been addressed by sending voice feature (cepstral coefficients) instead of raw sample. This scheme is to reap the advantages of reducing the transmission time and dependency of the data on communication channel, together with no loss of packet. Finally, the obtained results have verified the claims.
|
8 |
Automatic speaker recognition by linear prediction : a study of the parametric sensitivity of the modelCollins, Anthony McLaren, n/a January 1982 (has links)
The application of the linear prediction Model for
speech waveform analysis to context-independent automatic
speaker recognition is explored, primarily in terns of the
parametric sensitivity of the model. Feature vectors to
characterize speakers are formed from linear prediction
speech parameters computed as inverse filter coefficients,
reflection coefficients or cepstral coefficients, and also
power spectrum parameters via Fast Fourier Transform coefficients.
The comparative performance of these parameters is
investigated in speaker recognition experiments. The stability
of the linear prediction parameters is tested over a
range of model order from p=6 to p=30. Two independent
speech databases are used to substantiate the experimental
results.
The quality of the automatic recognition technique is
assessed in a novel experiment based on a direct performance
comparison with the human skill of aural recognition.
Correlation is sought between the performance of the aural
and automatic recognition methods, for each of the four parameter
sets. Although the recognition accuracy of the automatic system is superior to that of the direct aural technique,
the error distributions are highly variable. The performance
of the automatic system is shown to be empirically
based and unlike the intuitive human process.
An extended preamble to the description of the experiments
reviews the current art of automatic speaker recognition,
with a critical consideration of the performance of
linear prediction techniques. As supported by our experimental
results, it is concluded that success in the laboratory
rests upon a rather fragile foundation. Application to
problems beyond the controlled laboratory environment is
seen, therefore, to be still more precarious.
|
9 |
Robust speaker verification systemNosratighods, Mohaddeseh, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2008 (has links)
Identity verification or biometric recognition systems play an important role in our daily lives. Applications include Automatic Teller Machines (ATM), banking and share information retrieval, and personal verification for credit cards. Among the biometric techniques, authentication of speakers by his/her voice is of great importance, since it employs a non-invasive approach and is the only available modality in many applications. However,the performance of Automatic Speaker Verification (ASV) systems degrades significantly under adverse conditions which cause recordings from the same speaker to be different.The objective of this research is to investigate and develop robust techniques for performing automatic speaker recognition over various channel conditions, such as telephony and recorded microphone speech. This research is shown to improve the robustness of ASV systems in three main areas of feature extraction, speaker modelling and score normalization. At the feature level, a new set of dynamic features, termed Delta Cepstral Energy (DCE) is proposed, instead of traditional delta cepstra, which not only greatly reduces thedimensionality of the feature vector compared with delta and delta-delta cepstra, but is also shown to provide the same performance for matched testing and training conditions on TIMIT and a subset of the NIST 2002 dataset. The concept of speaker entropy, which conveys the information contained in a speaker's speech based on the extracted features, facilitates comparative evaluation of the proposed methods. In addition, Frequency Modulation features are combined in a complementary manner with the Mel Frequency CepstralCoefficients (MFCCs) to improve the performance of the ASV system under channel variability of various types. The proposed fused system shows a relative reduction of up to 23% in Equal Error Rate (EER) over the MFCC-based system when evaluated on the NIST 2008 dataset. Currently, the main challenge in speaker modelling is channel variability across different sessions. A recent approach to channel compensation, based on Support Vector Machines (SVM) is Nuisance Attribute Projection (NAP). The proposed multi-component approach to NAP, attempts to compensate for the main sources of inter-session variations through an additional optimization criteria, to allow more accurate estimates of the most dominant channel artefacts and to improve the system performance under mismatched training and test conditions. Another major issue in speaker recognition is that the variability of score distributions due to incompletely modelled regions of the feature space can produce segments of the test speech that are poorly matched to the claimed speaker model. A segment selection technique in score normalization is proposed that relies only on discriminative and reliable segments of the test utterance to verify the speaker. This approach is particularly useful in noisy conditions where using speech activity detection is not reliable at the feature level. Another source of score variability comes from the fact that not all phonemes are equally discriminative. To address this, a new score re-weighting technique is applied to likelihood values based on the discriminative level of each Gaussian component, i.e. each particular region of the feature space. It is found that a limited number of Gaussian mixtures, herein termed discriminative components are responsible for the overall performance, and that inclusion of the other non-discriminative components may only degrade the system performance.
|
10 |
Robust speaker verification systemNosratighods, Mohaddeseh, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2008 (has links)
Identity verification or biometric recognition systems play an important role in our daily lives. Applications include Automatic Teller Machines (ATM), banking and share information retrieval, and personal verification for credit cards. Among the biometric techniques, authentication of speakers by his/her voice is of great importance, since it employs a non-invasive approach and is the only available modality in many applications. However,the performance of Automatic Speaker Verification (ASV) systems degrades significantly under adverse conditions which cause recordings from the same speaker to be different.The objective of this research is to investigate and develop robust techniques for performing automatic speaker recognition over various channel conditions, such as telephony and recorded microphone speech. This research is shown to improve the robustness of ASV systems in three main areas of feature extraction, speaker modelling and score normalization. At the feature level, a new set of dynamic features, termed Delta Cepstral Energy (DCE) is proposed, instead of traditional delta cepstra, which not only greatly reduces thedimensionality of the feature vector compared with delta and delta-delta cepstra, but is also shown to provide the same performance for matched testing and training conditions on TIMIT and a subset of the NIST 2002 dataset. The concept of speaker entropy, which conveys the information contained in a speaker's speech based on the extracted features, facilitates comparative evaluation of the proposed methods. In addition, Frequency Modulation features are combined in a complementary manner with the Mel Frequency CepstralCoefficients (MFCCs) to improve the performance of the ASV system under channel variability of various types. The proposed fused system shows a relative reduction of up to 23% in Equal Error Rate (EER) over the MFCC-based system when evaluated on the NIST 2008 dataset. Currently, the main challenge in speaker modelling is channel variability across different sessions. A recent approach to channel compensation, based on Support Vector Machines (SVM) is Nuisance Attribute Projection (NAP). The proposed multi-component approach to NAP, attempts to compensate for the main sources of inter-session variations through an additional optimization criteria, to allow more accurate estimates of the most dominant channel artefacts and to improve the system performance under mismatched training and test conditions. Another major issue in speaker recognition is that the variability of score distributions due to incompletely modelled regions of the feature space can produce segments of the test speech that are poorly matched to the claimed speaker model. A segment selection technique in score normalization is proposed that relies only on discriminative and reliable segments of the test utterance to verify the speaker. This approach is particularly useful in noisy conditions where using speech activity detection is not reliable at the feature level. Another source of score variability comes from the fact that not all phonemes are equally discriminative. To address this, a new score re-weighting technique is applied to likelihood values based on the discriminative level of each Gaussian component, i.e. each particular region of the feature space. It is found that a limited number of Gaussian mixtures, herein termed discriminative components are responsible for the overall performance, and that inclusion of the other non-discriminative components may only degrade the system performance.
|
Page generated in 0.1112 seconds