Visual translation of speech as an aid for the deaf has long been a subject of electronic research and development. This thesis is concerned with a technique of sound visualisation based upon the theory of the primacy of dynamic, rather than static, information in the perception of speech sounds. The goal is design and evaluation of a system to display the perceptually important features of an input sound in a dynamic format as similar as possible to the auditory representation of that sound. The human auditory system, as the most effective system of sound representation, is first studied. Then, based on the latest theories of hearing and techniques of auditory modelling, a simplified model of the human ear is developed. In this model, the outer and middle ears together are simulated by a high-pass filter, and the inner ear is modelled by a bank of band-pass filters the outputs of which, after rectification and compression, are applied to a visualiser block. To design an appropriate visualiser block, theories of sound and speech perception are reviewed. Then the perceptually important properties of sound, and their relations to the physical attributes of the sound pressure wave, are considered to map the outputs from the auditory model onto an informative and recognisable running image-like the one known as cochleagram. This conveyor-like image is then sampled by a window of 20 milliseconds duration at a rate of 50 samples per second, so that a sequence of phase-locked, rectangular images is produced. Animation of these images results in a novel method of spectrography displaying both the time-varying and the time-independent information of the underlying sound with a high resolution in real time. The resulting system translates a spoken word into a visual gesture, and displays a still picture when the input is a steady state sound. Finally the implementation of this visualiser system is evaluated through several experiments undertaken by normal-hearing subjects. In these experiments, recognition of the gestures of a number of spoken words, is examined through a set of two-word and multi-word forced-choice tests. The results of these preliminary experiments show a high recognition score (40-90 percent, where zero represents chance expectation) after only 10 learning trials. General conclusions from the results suggest: a potential quick learning of the gestures, language independence of the system, fidelity of the system in translating the auditory information, and persistence of the learned gestures in the long-term memory. The results are very promising and motivate further investigations.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:312098 |
Date | January 1998 |
Creators | Soltani-Farani, A. A. |
Publisher | University of Surrey |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://epubs.surrey.ac.uk/844112/ |
Page generated in 0.0019 seconds