Return to search

Automatic Person Verification Using Speech and Face Information

Identity verification systems are an important part of our every day life. A typical example is the Automatic Teller Machine (ATM) which employs a simple identity verification scheme: the user is asked to enter their secret password after inserting their ATM card; if the password matches the one prescribed to the card, the user is allowed access to their bank account. This scheme suffers from a major drawback: only the validity of the combination of a certain possession (the ATM card) and certain knowledge (the password) is verified. The ATM card can be lost or stolen, and the password can be compromised. Thus new verification methods have emerged, where the password has either been replaced by, or used in addition to, biometrics such as the person’s speech, face image or fingerprints. Apart from the ATM example described above, biometrics can be applied to other areas, such as telephone & internet based banking, airline reservations & check-in, as well as forensic work and law enforcement applications. Biometric systems based on face images and/or speech signals have been shown to be quite effective. However, their performance easily degrades in the presence of a mismatch between training and testing conditions. For speech based systems this is usually in the form of channel distortion and/or ambient noise; for face based systems it can be in the form of a change in the illumination direction. A system which uses more than one biometric at the same time is known as a multi-modal verification system; it is often comprised of several modality experts and a decision stage. Since a multi-modal system uses complimentary discriminative information, lower error rates can be achieved; moreover, such a system can also be more robust, since the contribution of the modality affected by environmental conditions can be decreased. This thesis makes several contributions aimed at increasing the robustness of single- and multi-modal verification systems. Some of the major contributions are listed below. The robustness of a speech based system to ambient noise is increased by using Maximum Auto-Correlation Value (MACV) features, which utilize information from the source part of the speech signal. A new facial feature extraction technique is proposed (termed DCT-mod2), which utilizes polynomial coefficients derived from 2D Discrete Cosine Transform (DCT) coefficients of spatially neighbouring blocks. The DCT-mod2 features are shown to be robust to an illumination direction change as well as being over 80 times quicker to compute than 2D Gabor wavelet derived features. The fragility of Principal Component Analysis (PCA) derived features to an illumination direction change is solved by introducing a pre-processing step utilizing the DCT-mod2 feature extraction. We show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness to compression artefacts and white Gaussian noise) while also being robust to the illumination direction change. Several new methods, for use in fusion of speech and face information under noisy conditions, are proposed; these include a weight adjustment procedure, which explicitly measures the quality of the speech signal, and a decision stage comprised of a structurally noise resistant piece-wise linear classifier, which attempts to minimize the effects of noisy conditions via structural constraints on the decision boundary.

Identiferoai:union.ndltd.org:ADTP/195453
Date January 2003
CreatorsSanderson, Conrad, conradsand@ieee.org
PublisherGriffith University. School of Microelectronic Engineering
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
Rightshttp://www.gu.edu.au/disclaimer.html), Copyright Conrad Sanderson

Page generated in 0.0022 seconds