This thesis addresses the significance of speech features within the task of speaker recognition. Motivated by the perception of simple attributes like `loud', `smooth', `fast', more than 70 new speech features are developed. A set of basic speech features like pitch, loudness and speech speed are combined together with these new features in a feature set, one set per utterance. A neural network classifier is used to evaluate the significance of these features by creating a speaker recognition system and analysing the behaviour of successfully trained single-speaker networks. An in-depth analysis of network weights allows a rating of significance and feature contribution. A subjective listening experiment validates and confirms the results of the neural network analysis. The work starts with an extended sentence analysis; ten sentences are uttered by 630 speakers. The extraction of 100 speech features is outlined and a 100-element feature vector for each utterance is derived. Some features themselves and the methods of analysing them have been used elsewhere, for example pitch, sound pressure level, spectral envelope, loudness, speech speed and glottal-to-noise excitation. However, more than 70 of the 100 features are derivatives of these basic features and have not yet been described and used before in the speakerr ecognition research,e speciallyyn ot within a rating of feature significance. These derivatives include histogram, 3`d and 4 moments, function approximation, as well as other statistical analyses applied to the basic features. The first approach assessing the significance of features and their possible use in a recognition system is based on a probability analysis. The analysis is established on the assumption that within the speaker's ten utterances' single feature values have a small deviation and cluster around the mean value of one speaker. The presented features indeed cluster into groups and show significant differences between speakers, thus enabling a clear separation of voices when applied to a small database of < 20 speakers. The recognition and assessment of individual feature contribution jecomes impossible, when the database is extended to 200 speakers. To ensure continous vplidation of feature contribution it is necessary to consider a different type of classifier. These limitations are overcome with the introduction of neural network classifiers. A separate network is assigned to each speaker, resulting in the creation of 630 networks. All networks are of standard feed-forward backpropagation type and have a 100-input, 20- hidden-nodes, one-output architecture. The 6300 available feature vectors are split into a training, validation and test set in the ratio of 5-3-2. The networks are initially trained with the same 100-feature input database. Successful training was achieved within 30 to 100 epochs per network. The speaker related to the network with the highest output is declared as the speaker represented by the input. The achieved recognition rate for 630 speakers is -49%. A subsequent preclusion of features with minor significance raises the recognition rate to 57%. The analysis of the network weight behaviour reveals two major pointsA definite ranking order of significance exists between the 100 features. Many of the newly introduced derivatives of pitch, brightness, spectral voice patterns and speech speed contribute intensely to recognition, whereas feature groups related to glottal-to-noiseexcitation ratio and sound pressure level play a less important role. The significance of features is rated by the training, testing and validation behaviour of the networks under data sets with reduced information content, the post-trained weight distribution and the standard deviation of weight distribution within networks. The findings match with results of a subjective listening experiment. As a second major result the analysis shows that there are large differences between speakers and the significance of features, i. e. not all speakers use the same feature set to the same extent. The speaker-related networks exhibit key features, where they are uniquely identifiable and these key features vary from speaker to speaker. Some features like pitch are used by all networks; other features like sound pressure level and glottal-to-noise excitation ratio are used by only a few distinct classifiers. Again, the findings correspond with results of a subjective listening experiment. This thesis presents more than 70 new features which never have been used before in speaker recognition. A quantitative ranking order of 100 speech features is introduced. Such a ranking order has not been documented elsewhere and is comparatively new to the area of speaker recognition. This ranking order is further extended and describes the amount to which a classifier uses or omits single features, solely depending on the characteristics of the voice sample. Such a separation has not yet been documented and is a novel contribution. The close correspondence of the subjective listening experiment and the findings of the network classifiers show that it is plausible to model the behaviour of human speech recognition with an artificial neural network. Again such a validation is original in the area of speaker recognition
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:288845 |
Date | January 2002 |
Creators | Schuy, Lars |
Publisher | University of Sussex |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Page generated in 0.002 seconds