Return to search

Adaption of reference patterns in word-based speech recognition

The word-based approach to automatic speech recognition is one which has received attention from many researchers and has been exploited in various practical applications. A typical recognition system has a set of stored reference patterns, one or more for each word in the vocabulary to be recognised. These reference patterns are formed from training utterances supplied before a recognition session begins, either by the intended user of the system or, for a speaker-independent system, by a representative set of speakers. When the system is used for recognition, each new input utterance is compared with the stored patterns and is recognised as the word (or sequence of words) for which the minimal value of a distance (dissimilarity) measure, or equivalently the maximal likelihood, is obtained. The comparison of the input with the reference patterns is typically accomplished by an algorithm incorporating dynamic programming, which finds the optimal alignment of input and reference patterns andthe corresponding distance or likelihood. This approach to recognition, in its basic form, retains the same reference patterns unchanged throughout the recognition of any sequence of input utterances. Thus the recognition system has no capability of learning from the new utterances presented during a recognition session. If a recognition system can be made to adapt its reference patterns during its operation, to incorporate information from the recognised utterances, then this may be expected to allow progressive improvement of the modelling of the words (as pronounced by the current speaker), and hence enhancement of the accuracy of recognition - provided that the adaptation of incorrect words' reference patterns in cases of misrecognition can be prevented or kept to a sufficiently low level. By adaptation, speaker-specific initial reference patterns can be made more reliably representative of the speaker's typical pronunciations, by the use of data from additional utterances of the words; and speaker-independent reference patterns can be made speaker-specific through the incorporation of information from utterances by the speaker currently using the recognition system. Adaptation can also permit the dynamic adjustment of reference patterns to track any gradual drift, or systematic difference from one occasion to another, in the speaker's voice or pronunciation or in the level and characteristics of background noise. In this thesis, the development of an isolated word recognition system which incorporates various adaptation options is described, and the results of experiments to measure the effects of adaptation are presented and discussed. Both supervised adaptation (which is controlled by feedback from the user as to the correctness or incorrectness of each recognition) and unsupervised adaptation (without such feedback) are explored. The adaptation operates by a weighted averaging of the current reference pattern (template) with the recognised input. Two main weighting options have been defined: one which results in optimisation of the templates for the speaker's typical realisations of the words (if these are assumed to be invariant in time), and one which results in tracking of gradual variations in time. Various values of the relative weights on the existing template and on the input have been tested. Adaptation has been applied both to speaker-specific initial templates and to speaker-independent ones. In each case, the statistical significances of comparative results are computed from the means and variations across a set of test speakers. A compensation technique has been introduced, whereby the distance obtained in matching a template with an input utterance is adjusted according to the number of times that template has been adapted. This is necessary because adaptation reduces the typical distances obtained for the adapted template even when this template does not correspond to the correct recognition of the input. Appropriate values of the compensation parameters, to optimise the recognition performance, have been found for various adaptation options. The main conclusions from the experiments are that adaptation, especially supervised adaptation, can yield consistent and useful improvements in the performance of an isolated word recognition system, and that the application of appropriate word distance compensation is important for the attainment of the maximum benefit from the adaptation. Possible refinements and extensions of the adaptation technique are discussed. Results of a limited evaluation of template adaptation in a connected word recognition system are presented. Other aspects of the recognition system which are described and discussed include an efficient multiple-stage decision procedure and some features of the user-system interface design.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:666256
Date January 1988
CreatorsMcInnes, F. R.
PublisherUniversity of Edinburgh
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0026 seconds