Return to search

Audio Source Separation Using Perceptual Principles for Content-Based Coding and Information Management

The information age has brought with it a dual problem. In the first place, the ready access to mechanisms to capture and store vast amounts of data in all forms (text, audio, image and video), has resulted in a continued demand for ever more efficient means to store and transmit this data. In the second, the rapidly increasing store demands effective means to structure and access the data in an efficient and meaningful manner. In terms of audio data, the first challenge has traditionally been the realm of audio compression research that has focused on statistical, unstructured audio representations that obfuscate the inherent structure and semantic content of the underlying data. This has only served to further complicate the resolution of the second challenge resulting in access mechanisms that are either impractical to implement, too inflexible for general application or too low level for the average user. Thus, an artificial dichotomy has been created from what is in essence a dual problem. The founding motivation of this thesis is that, although the hypermedia model has been identified as the ideal, cognitively justified method for organising data, existing audio data representations and coding models provide little, if any, support for, or resemblance to, this model. It is the contention of the author that any successful attempt to create hyperaudio must resolve this schism, addressing both storage and information management issues simultaneously. In order to achieve this aim, an audio representation must be designed that provides compact data storage while, at the same time, revealing the inherent structure of the underlying data. Thus it is the aim of this thesis to present a representation designed with these factors in mind. Perhaps the most difficult hurdle in the way of achieving the aims of content-based audio coding and information management is that of auditory source separation. The MPEG committee has noted this requirement during the development of its MPEG-7 standard, however, the mechanics of "how" to achieve auditory source separation were left as an open research question. This same committee proposed that MPEG-7 would "support descriptors that can act as handles referring directly to the data, to allow manipulation of the multimedia material." While meta-data tags are a part solution to this problem, these cannot allow manipulation of audio material down to the level of individual sources when several simultaneous sources exist in a recording. In order to achieve this aim, the data themselves must be encoded in such a manner that allows these descriptors to be formed. Thus, content-based coding is obviously required. In the case of audio, this is impossible to achieve without effecting auditory source separation. Auditory source separation is the concern of computational auditory scene analysis (CASA). However, the findings of CASA research have traditionally been restricted to a limited domain. To date, the only real application of CASA research to what could loosely be classified as information management has been in the area of signal enhancement for automatic speech recognition systems. In these systems, a CASA front end serves as a means of separating the target speech from the background "noise". As such, the design of a CASA-based approach, as presented in this thesis, to one of the most significant challenges facing audio information management research represents a significant contribution to the field of information management. Thus, this thesis unifies research from three distinct fields in an attempt to resolve some specific and general challenges faced by all three. It describes an audio representation that is based on a sinusoidal model from which low-level auditory primitive elements are extracted. The use of a sinusoidal representation is somewhat contentious with the modern trend in CASA research tending toward more complex approaches in order to resolve issues relating to co-incident partials. However, the choice of a sinusoidal representation has been validated by the demonstration of a method to resolve many of these issues. The majority of the thesis contributes several algorithms to organise the low-level primitives into low-level auditory objects that may form the basis of nodes or link anchor points in a hyperaudio structure. Finally, preliminary investigations in the representation’s suitability for coding and information management tasks are outlined as directions for future research.

Identiferoai:union.ndltd.org:ADTP/195310
Date January 2004
CreatorsMelih, Kathy, n/a
PublisherGriffith University. School of Information Technology
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
Rightshttp://www.gu.edu.au/disclaimer.html), Copyright Kathy Melih

Page generated in 0.0019 seconds