Return to search

Speaker verification incorporating high-level linguistic features

Speaker verification is the process of verifying or disputing the claimed identity of a speaker based on a recorded sample of their speech. Automatic speaker verification technology can be applied to a variety of person authentication and identification applications including forensics, surveillance, national security measures for combating terrorism, credit card and transaction verification, automation and indexing of speakers in audio data, voice based signatures, and over-the-phone security access. The ubiquitous nature of modern telephony systems allows for the easy acquisition and delivery of speech signals for processing by an automated speaker recognition system. Traditionally, approaches to automatic speaker verification have involved holistic modelling of low-level acoustic-based features in order to characterise physiological aspects of a speaker such as the length and shape of the vocal tract. Although the use of these low-level features has proved highly successful, there are numerous other sources of speaker specific information in the speech signal that have largely been ignored. In spontaneous and conversational speech, perceptually higher levels of in- formation such as the linguistic content, pronunciation idiosyncrasies, idiolectal word usage, speaking rates and prosody, can also provide useful cues as to identify of a speaker. The main aim of this work is to explore the incorporation of higher levels of information into the verification process. Specifically, linguistic constructs such as words, syllables and phones are examined for their usefulness as features for text-independent speaker verification. Two main approaches to incorporating these linguistic features are explored. Firstly, the direct modelling of linguistic feature sequences is examined. Stochastic language models are used to model word and phonetic sequences obtained from automatically obtained transcripts. Experimentation indicates that significant speaker characterising information is indeed contained in both word and phone-level transcripts. It is shown, however, that model estimation issues arise when limited speech is available for training. This speaker model estimation problem is addressed by employing an adaptive model training strategy that significantly improves the performance and extended the usefulness of both lexical and phonetic techniques to short training length situations. An alternate approach to incorporating linguistic information is also examined. Rather than modelling the high-level features independently of acoustic information, linguistic information is instead used to constrain and aid acoustic- based speaker verification techniques. It is hypothesised that a ext-constrained" approach provides direct benefits by facilitating more detailed modelling, as well as providing useful insight into which articulatory events provide the most useful speaker-characterising information. A novel framework for text-constrained speaker verification is developed. This technique is presented as a generalised framework capable of using diĀ®erent feature sets and modelling paradigms, and is based upon the use of a newly defined pseudo-syllabic segmentation unit. A detailed exploration of the speaker characterising power of both broad phonetic and syllabic events is performed and used to optimise the system configuration. An evaluation of the proposed text- constrained framework using cepstral features demonstrates the benefits of such an approach over holistic approaches, particularly in extended training length scenarios. Finally, a complete evaluation of the developed techniques on the NIST2005 speaker recognition evaluation database is presented. The benefit of including high-level linguistic information is demonstrated when a fusion of both high- and low-level techniques is performed.

Identiferoai:union.ndltd.org:ADTP/265738
Date January 2008
CreatorsBaker, Brendan J.
PublisherQueensland University of Technology
Source SetsAustraliasian Digital Theses Program
Detected LanguageEnglish

Page generated in 0.002 seconds