1 |
Sequential organization in computational auditory scene analysisShao, Yang, January 2007 (has links)
Thesis (Ph. D.)--Ohio State University, 2007. / Title from first page of PDF file. Includes bibliographical references (p. 156-168).
|
2 |
Audio Source Separation Using Perceptual Principles for Content-Based Coding and Information ManagementMelih, Kathy, n/a January 2004 (has links)
The information age has brought with it a dual problem. In the first place, the ready access to mechanisms to capture and store vast amounts of data in all forms (text, audio, image and video), has resulted in a continued demand for ever more efficient means to store and transmit this data. In the second, the rapidly increasing store demands effective means to structure and access the data in an efficient and meaningful manner. In terms of audio data, the first challenge has traditionally been the realm of audio compression research that has focused on statistical, unstructured audio representations that obfuscate the inherent structure and semantic content of the underlying data. This has only served to further complicate the resolution of the second challenge resulting in access mechanisms that are either impractical to implement, too inflexible for general application or too low level for the average user. Thus, an artificial dichotomy has been created from what is in essence a dual problem. The founding motivation of this thesis is that, although the hypermedia model has been identified as the ideal, cognitively justified method for organising data, existing audio data representations and coding models provide little, if any, support for, or resemblance to, this model. It is the contention of the author that any successful attempt to create hyperaudio must resolve this schism, addressing both storage and information management issues simultaneously. In order to achieve this aim, an audio representation must be designed that provides compact data storage while, at the same time, revealing the inherent structure of the underlying data. Thus it is the aim of this thesis to present a representation designed with these factors in mind. Perhaps the most difficult hurdle in the way of achieving the aims of content-based audio coding and information management is that of auditory source separation. The MPEG committee has noted this requirement during the development of its MPEG-7 standard, however, the mechanics of "how" to achieve auditory source separation were left as an open research question. This same committee proposed that MPEG-7 would "support descriptors that can act as handles referring directly to the data, to allow manipulation of the multimedia material." While meta-data tags are a part solution to this problem, these cannot allow manipulation of audio material down to the level of individual sources when several simultaneous sources exist in a recording. In order to achieve this aim, the data themselves must be encoded in such a manner that allows these descriptors to be formed. Thus, content-based coding is obviously required. In the case of audio, this is impossible to achieve without effecting auditory source separation. Auditory source separation is the concern of computational auditory scene analysis (CASA). However, the findings of CASA research have traditionally been restricted to a limited domain. To date, the only real application of CASA research to what could loosely be classified as information management has been in the area of signal enhancement for automatic speech recognition systems. In these systems, a CASA front end serves as a means of separating the target speech from the background "noise". As such, the design of a CASA-based approach, as presented in this thesis, to one of the most significant challenges facing audio information management research represents a significant contribution to the field of information management. Thus, this thesis unifies research from three distinct fields in an attempt to resolve some specific and general challenges faced by all three. It describes an audio representation that is based on a sinusoidal model from which low-level auditory primitive elements are extracted. The use of a sinusoidal representation is somewhat contentious with the modern trend in CASA research tending toward more complex approaches in order to resolve issues relating to co-incident partials. However, the choice of a sinusoidal representation has been validated by the demonstration of a method to resolve many of these issues. The majority of the thesis contributes several algorithms to organise the low-level primitives into low-level auditory objects that may form the basis of nodes or link anchor points in a hyperaudio structure. Finally, preliminary investigations in the representations suitability for coding and information management tasks are outlined as directions for future research.
|
3 |
Computational auditory saliencyDelmotte, Varinthira Duangudom 07 November 2012 (has links)
The objective of this dissertation research is to identify sounds that grab a listener's attention. These sounds that draw a person's attention are sounds that are considered salient. The focus here will be on investigating the role of saliency in the auditory attentional
process. In order to identify these salient sounds, we have developed a computational auditory saliency model inspired by our understanding of the human auditory system and auditory perception.
By identifying salient sounds we can obtain a better understanding of how sounds are processed by the auditory system, and in particular,
the key features contributing to sound salience. Additionally, studying the salience of different auditory stimuli can lead to improvements in the performance of current computational models in
several different areas, by making use of the information obtained about what stands out perceptually to observers in a particular scene.
Auditory saliency also helps to rapidly sort the information present in a complex auditory scene. Since our resources are finite, not all information can be processed equally. We must, therefore, be able to quickly determine the importance of different objects in a scene.
Additionally, an immediate response or decision may be required. In order to respond, the observer needs to know the key elements of the
scene. The issue of saliency is closely related to many different areas, including scene analysis.
The thesis provides a comprehensive look at auditory saliency. It explores the advantages and limitations of using auditory saliency models through different experiments and presents a general computational auditory saliency model that can be used for various applications.
|
4 |
Bayesian Microphone Array Processing / ベイズ法によるマイクロフォンアレイ処理Otsuka, Takuma 24 March 2014 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第18412号 / 情博第527号 / 新制||情||93(附属図書館) / 31270 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 奥乃 博, 教授 河原 達也, 准教授 CUTURI CAMETO Marco, 講師 吉井 和佳 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
5 |
Sound source segregation of multiple concurrent talkers via Short-Time Target CancellationCantu, Marcos Antonio 22 October 2018 (has links)
The Short-Time Target Cancellation (STTC) algorithm, developed as part of this dissertation research, is a “Cocktail Party Problem” processor that can boost speech intelligibility for a target talker from a specified “look” direction, while suppressing the intelligibility of competing talkers. The algorithm holds promise for both automatic speech recognition and assistive listening device applications. The STTC algorithm operates on a frame-by-frame basis, leverages the computational efficiency of the Fast Fourier Transform (FFT), and is designed to run in real time. Notably, performance in objective measures of speech intelligibility and sound source segregation is comparable to that of the Ideal Binary Mask (IBM) and Ideal Ratio Mask (IRM). Because the STTC algorithm computes a time-frequency mask that can be applied independently to both the left and right signals, binaural cues for spatial hearing, including Interaural Time Differences (ITDs), Interaural Level Differences (ILDs) and spectral cues, can be preserved in potential hearing aid applications. A minimalist design for a proposed STTC Assistive Listening Device (ALD), consisting of six microphones embedded in the frame of a pair of eyeglasses, is presented and evaluated using virtual room acoustics and both objective and behavioral measures. The results suggest that the proposed STTC ALD can provide a significant speech intelligibility benefit in complex auditory scenes comprised of multiple spatially separated talkers. / 2020-10-22T00:00:00Z
|
6 |
Bio-inspired noise robust auditory featuresJavadi, Ailar 12 June 2012 (has links)
The purpose of this work
is to investigate a series of biologically inspired modifications to state-of-the-art Mel-
frequency cepstral coefficients (MFCCs) that may improve automatic speech recognition
results. We have provided recommendations to improve speech recognition results de-
pending on signal-to-noise ratio levels of input signals. This work has been motivated by
noise-robust auditory features (NRAF). In the feature extraction technique, after a signal is filtered using bandpass filters, a
spatial derivative step is used to sharpen the results, followed by an envelope detector (recti-
fication and smoothing) and down-sampling for each filter bank before being compressed.
DCT is then applied to the results of all filter banks to produce features. The Hidden-
Markov Model Toolkit (HTK) is used as the recognition back-end to perform speech
recognition given the features we have extracted. In this work, we investigate the
role of filter types, window size, spatial derivative, rectification types, smoothing, down-
sampling and compression and compared the final results to state-of-the-art Mel-frequency
cepstral coefficients (MFCC). A series of conclusions and insights are provided for each
step of the process. The goal of this work has not been to outperform MFCCs; however,
we have shown that by changing the compression type from log compression to 0.07 root
compression we are able to outperform MFCCs for all noisy conditions.
|
7 |
AUDIO SCENE SEGEMENTATION USING A MICROPHONE ARRAY AND AUDITORY FEATURESUnnikrishnan, Harikrishnan 01 January 2010 (has links)
Auditory stream denotes the abstract effect a source creates in the mind of the listener. An auditory scene consists of many streams, which the listener uses to analyze and understand the environment. Computer analyses that attempt to mimic human analysis of a scene must first perform Audio Scene Segmentation (ASS). ASS find applications in surveillance, automatic speech recognition and human computer interfaces. Microphone arrays can be employed for extracting streams corresponding to spatially separated sources. However, when a source moves to a new location during a period of silence, such a system loses track of the source. This results in multiple spatially localized streams for the same source. This thesis proposes to identify local streams associated with the same source using auditory features extracted from the beamformed signal. ASS using the spatial cues is first performed. Then auditory features are extracted and segments are linked together based on similarity of the feature vector. An experiment was carried out with two simultaneous speakers. A classifier is used to classify the localized streams as belonging to one speaker or the other. The best performance was achieved when pitch appended with Gammatone Frequency Cepstral Coefficeints (GFCC) was used as the feature vector. An accuracy of 96.2% was achieved.
|
8 |
Monaural Speech Segregation in Reverberant EnvironmentsJin, Zhaozhang 27 September 2010 (has links)
No description available.
|
9 |
A biologically inspired approach to the cocktail party problemChou, Kenny F. 19 May 2020 (has links)
At a cocktail party, one can choose to scan the room for conversations of interest, attend to a specific conversation partner, switch between conversation partners, or not attend to anything at all. The ability of the normal-functioning auditory system to flexibly listen in complex acoustic scenes plays a central role in solving the cocktail party problem (CPP). In contrast, certain demographics (e.g., individuals with hearing impairment or older adults) are unable to solve the CPP, leading to psychological ailments and reduced quality of life. Since the normal auditory system still outperforms machines in solving the CPP, an effective solution may be found by mimicking the normal-functioning auditory system.
Spatial hearing likely plays an important role in CPP-processing in the auditory system. This thesis details the development of a biologically based approach to the CPP by modeling specific neural mechanisms underlying spatial tuning in the auditory cortex. First, we modeled bottom-up, stimulus-driven mechanisms using a multi-layer network model of the auditory system. To convert spike trains from the model output into audible waveforms, we designed a novel reconstruction method based on the estimation of time-frequency masks. We showed that our reconstruction method produced sounds with significantly higher intelligibility and quality than previous reconstruction methods. We also evaluated the algorithm's performance using a psychoacoustic study, and found that it provided the same amount of benefit to normal-hearing listeners as a current state-of-the-art acoustic beamforming algorithm.
Finally, we modeled top-down, attention driven mechanisms that allowed the network to flexibly operate in different regimes, e.g., monitor the acoustic scene, attend to a specific target, and switch between attended targets. The model explains previous experimental observations, and proposes candidate neural mechanisms underlying flexible listening in cocktail-party scenarios. The strategies proposed here would benefit hearing-assistive devices for CPP processing (e.g., hearing aids), where users would benefit from switching between various modes of listening in different social situations. / 2022-05-19T00:00:00Z
|
10 |
串流式音訊分類於智慧家庭之應用 / Streaming audio classification for smart home environments溫景堯, Wen, Jing Yao Unknown Date (has links)
聽覺與視覺同為人類最重要的感官。計算式聽覺場景分析(Computation Auditory Scene Analysis, CASA)透過聽覺心理學中對於人耳特性與心理感知的關連性,定義了一個可能的方向,讓電腦聽覺更為貼近人類感知。本研究目的在於應用聽覺心理學之原則,以影像處理與圖型辨識技術,設計音訊增益、切割、描述等對應之處理,透過相似度計算方式實現智慧家庭之環境中的即時音訊分類。
本研究分為三部分,第一部分為音訊處理,將環境中的聲音轉換成電腦可處理與強化之訊號;第二部分透過CASA原則設計影像處理,以冀於影像上達成音訊處理之結果,並以影像特徵加以描述音訊事件;第三部分定義影像特徵之距離,以K個最近鄰點(K-Nearest Neighbor, KNN)技術針對智慧家庭環境常見之音訊事件,實現即時辨識與分類。實驗結果顯示本論文所提出的音訊分類方法有著不錯的效果,對八種家庭環境常見的聲音辨識正確率可達80-90%,而在雜訊或其他聲音干擾的情況下,辨識結果也維持在70%左右。 / Human receive sounds such as language and music through audition. Therefore, audition and vision are viewed as the two most important aspects of human perception. Computational auditory scene analysis (CASA) defined a possible direction to close the gap between computerized audition and human perception using the correlation between features of ears and mental perception in psychology of hearing. In this research, we develop and integrate methods for real-time streaming audio classification based on the principles of psychology of hearing as well as techniques in pattern recognition.
There are three major parts in this research. The first is audio processing, translating sounds into information that can be enhanced by computers; the second part uses the principles of CASA to design a framework for audio signal description and event detection by means of computer vision and image processing techniques; the third part defines the distance of image feature vectors and uses K-Nearest Neighbor (KNN) classifier to accomplish audio recognition and classification in real-time. Experimental results show that the proposed approach is quite effective, achieving an overall recognition rate of 80-90% for 8 types of audio input. The performance degrades only slightly in the presence of noise and other interferences.
|
Page generated in 0.395 seconds