Spelling suggestions: "subject:"audio classification"" "subject:"audio 1classification""
1 |
Robuste Genre-KlassifikationKlausing, Tilo 22 February 2008 (has links) (PDF)
Die automatische Klassifikation von Musik in Genres wird seit einigen Jahren systematisch erforscht. In dieser Zeit wurden Genre-Klassifikationssysteme und ihre Komponenten immer weiter verbessert, wofür die verschiedensten Richtungen eingeschlagen wurden. Diese Arbeit gibt deshalb im ersten Teil einen umfassenden Überblick über das Forschungsgebiet der Genre-Klassifikation, von den grundlegenden Techniken bis zum aktuellen Forschungsstand. Im zweiten Teil der Arbeit wird ein neuartiger Ansatz vorgestellt, der das Ziel hat, die Genre-Klassifikation gegen eventuelle Störungen robuster zu machen. Dies soll durch die gezielte Erkennung und Ausfilterung von Bereichen, in denen das Musikstück einer Veränderung seiner Charakteristik unterliegt, realisiert werden. Eine Implementation dieses Ansatzes wird an einer Musikkollektion mit fünf Genres evaluiert und die Ergebnisse werden ausführlich analysiert.
|
2 |
Language identification with language and feature dependencyYin, Bo, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2009 (has links)
The purpose of Language Identification (LID) is to identify a specific language from a spoken utterance, automatically. Language-specific characteristics are always associated with different languages. Most existing LID approaches utilise a statistical modelling process with common acoustic/phonotactic features to model specific languages while avoiding any language-specific knowledge. Great successes have been achieved in this area over past decades. However, there is still a huge gap between these languageindependent methods and the actual language-specific patterns. It is extremely useful to address these specific acoustic or semantic construction patterns, without spending huge labour on annotation which requires language-specific knowledge. Inspired by this goal, this research focuses on the language-feature dependency. Several practical methods have been proposed. Various features and modelling techniques have been studied in this research. Some of them carry out additional language-specific information without manual labelling, such as a novel duration modelling method based on articulatory features, and a novel Frequency-Modulation (FM) based feature. The performance of each individual feature is studied for each of the language-pair combinations. The similarity between languages and the contribution in identifying a language by using a particular feature are defined for the first time, in a quantitative style. These distance measures and languagedependent contributions become the foundations of the later-presented frameworks ?? language-dependent weighting and hierarchical language identification. The latter particularly provides remarkable flexibility and enhancement when identifying a relatively large number of languages and accents, due to the fact that the most discriminative feature or feature-combination is used when separating each of the languages. The proposed systems are evaluated in various corpora and task contexts including NIST language recognition evaluation tasks. The performances have been improved in various degrees. The key techniques developed for this work have also been applied to solve a different problem other than LID ?? speech-based cognitive load monitoring.
|
3 |
Language identification with language and feature dependencyYin, Bo, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW January 2009 (has links)
The purpose of Language Identification (LID) is to identify a specific language from a spoken utterance, automatically. Language-specific characteristics are always associated with different languages. Most existing LID approaches utilise a statistical modelling process with common acoustic/phonotactic features to model specific languages while avoiding any language-specific knowledge. Great successes have been achieved in this area over past decades. However, there is still a huge gap between these languageindependent methods and the actual language-specific patterns. It is extremely useful to address these specific acoustic or semantic construction patterns, without spending huge labour on annotation which requires language-specific knowledge. Inspired by this goal, this research focuses on the language-feature dependency. Several practical methods have been proposed. Various features and modelling techniques have been studied in this research. Some of them carry out additional language-specific information without manual labelling, such as a novel duration modelling method based on articulatory features, and a novel Frequency-Modulation (FM) based feature. The performance of each individual feature is studied for each of the language-pair combinations. The similarity between languages and the contribution in identifying a language by using a particular feature are defined for the first time, in a quantitative style. These distance measures and languagedependent contributions become the foundations of the later-presented frameworks ?? language-dependent weighting and hierarchical language identification. The latter particularly provides remarkable flexibility and enhancement when identifying a relatively large number of languages and accents, due to the fact that the most discriminative feature or feature-combination is used when separating each of the languages. The proposed systems are evaluated in various corpora and task contexts including NIST language recognition evaluation tasks. The performances have been improved in various degrees. The key techniques developed for this work have also been applied to solve a different problem other than LID ?? speech-based cognitive load monitoring.
|
4 |
Real-time Audio Classification onan Edge Device : Using YAMNet and TensorFlow LiteMalmberg, Christoffer January 2021 (has links)
Edge computing is the idea of moving computations away from the cloud andinstead perform them at the edge of the network. The benefits of edge computing arereduced latency, increased integrity, and less strain on networks. Edge AI is the practiceof deploying machine learning algorithms to perform computations on the edge.In this project, a pre-trained model YAMNet is retrained and used to perform audioclassification in real-time to detect gunshots, glass shattering, and speech. The modelis deployed onto the edge device both as a full TensorFlow model and as TensorFlowLite models. Comparing results of accuracy, inference time, and memory allocationfor full TensorFlow and TensorFlow Lite models with and without optimization. Resultsfrom this research were that it was a valid option to use both TensorFlow andTensorFlow Lite but there was a lot of performance to gain by using TensorFlow Litewith little downside.
|
5 |
Robuste Genre-KlassifikationKlausing, Tilo 22 February 2008 (has links)
Die automatische Klassifikation von Musik in Genres wird seit einigen Jahren systematisch erforscht. In dieser Zeit wurden Genre-Klassifikationssysteme und ihre Komponenten immer weiter verbessert, wofür die verschiedensten Richtungen eingeschlagen wurden. Diese Arbeit gibt deshalb im ersten Teil einen umfassenden Überblick über das Forschungsgebiet der Genre-Klassifikation, von den grundlegenden Techniken bis zum aktuellen Forschungsstand. Im zweiten Teil der Arbeit wird ein neuartiger Ansatz vorgestellt, der das Ziel hat, die Genre-Klassifikation gegen eventuelle Störungen robuster zu machen. Dies soll durch die gezielte Erkennung und Ausfilterung von Bereichen, in denen das Musikstück einer Veränderung seiner Charakteristik unterliegt, realisiert werden. Eine Implementation dieses Ansatzes wird an einer Musikkollektion mit fünf Genres evaluiert und die Ergebnisse werden ausführlich analysiert.
|
6 |
Audio Moment Retrieval based on Natural Language QueryShevchuk, Danylo January 2020 (has links)
Background. Users spend a lot of time searching through media content to find the desirable fragment. Most of the time people can describe verbally what they are looking for but there is not much of a use for that as of today. Using that verbal description as a query to search for the right interval in a given audio sample would save people a lot of time. Objectives. The aim of this thesis is to compare the performance of the methods suitable for retrieving desired intervals from an audio of an arbitrary length using a natural language query input. There are two objectives. The first one is to train models that match a natural language input to the specific interval of a given soundtrack. The second one is to evaluate the models' performance using conventional metrics. Methods. The research method used in this research is mixed. Various literature on the existing methods suitable for audio classification was reviewed. Three models were selected for conducting the experiments. The selected models are YamNet, AlexNet and ResNet-50. Two experiments were conducted. The goal of the first experiment was to measure the models' performance on classifying audio samples. The goal of the second experiment was to measure the same models' performance on the audio intervals retrieval problem which uses classification as a part of the approach. The steps taken to conduct the experiments were reported as well as the statistical data obtained as a result of the experiments. These steps include data collection, data preprocessing, models training and their performance evaluation. Results. The two tests were conducted to see which model performs better on two separate problems - audio classification and intervals retrieval based on a natural language query. The statistical data was obtained as a result of the tests. The degree (performance-wise) to which can we match a natural language query input to a corresponding interval of an audio of an arbitrary length was calculated for each of the selected models. The aggregated performance of the models are mostly comparable, with YamNet occasionally outperforming the other two models. The average Area Under the Curve, and Accuracy for the studied models are as follows: (67, 71.62), (68.99, 67.72) and (66.59, 71.93) for YamNet, AlexNet and ResNet-50, respectively. Conclusions. We have discovered that the tested models were not capable of retrieving intervals from an audio of an arbitrary length based on a natural language query, however the degree to which the models are able to retrieve the intervals varies depending on the queried keyword and other hyperparameters such as the value of the threshold that is used to filter the audio patches that yield too low probability of the queried class.
|
7 |
<strong>A LARGE-SCALE UAV AUDIO DATASET AND AUDIO-BASED UAV CLASSIFICATION USING CNN</strong>Yaqin Wang (8797037) 17 July 2023 (has links)
<p>The growing popularity and increased accessibility of unmanned aerial vehicles (UAVs) have raised concerns about potential threats they may pose. In response, researchers have devoted significant efforts to developing UAV detection and classification systems, utilizing diverse methodologies such as computer vision, radar, radio frequency, and audio-based approaches. However, the availability of publicly accessible UAV audio datasets remains limited. Consequently, this research endeavor was undertaken to address this gap by undertaking the collection of a comprehensive UAV audio dataset, alongside the development of a precise and efficient audio-based UAV classification system.</p>
<p>This research project is structured into three distinct phases, each serving a unique purpose in data collection and training the proposed UAV classifier. These phases encompass data collection, dataset evaluation, the implementation of a proposed convolutional neural network, training procedures, as well as an in-depth analysis and evaluation of the obtained results. To assess the effectiveness of the model, several evaluation metrics are employed, including training accuracy, loss rate, the confusion matrix, and ROC curves.</p>
<p>The findings from this study conclusively demonstrate that the proposed CNN classi- fier exhibits nearly flawless performance in accurately classifying UAVs across 22 distinct categories.</p>
|
8 |
Content-based Audio Management And Retrieval System For News BroadcastsDogan, Ebru 01 September 2009 (has links) (PDF)
The audio signals can provide rich semantic cues for analyzing multimedia content, so audio information has been recently used for content-based multimedia indexing and retrieval. Due to growing amount of audio data, demand for efficient retrieval techniques is increasing. In this thesis work, we propose a complete, scalable and extensible audio based content management and retrieval system for news broadcasts. The proposed system considers classification, segmentation, analysis and retrieval of an audio stream. In the sound classification and segmentation stage, a sound stream is segmented by classifying each sub segment into silence, pure speech, music, environmental sound, speech over music, and speech over environmental sound in multiple steps. Support Vector Machines and Hidden Markov Models are employed for classification and these models are trained by using different sets of MPEG-7 features. In the analysis and retrieval stage, two alternatives exist for users to query audio data. The first of these isolates user from main acoustic classes by providing semantic domain based fuzzy classes. The latter offers users to query audio by giving an audio sample in order to find out the similar segments or by requesting expressive summary of the content directly. Additionally, a series of tests was conducted on audio tracks of TRECVID news broadcasts to evaluate the performance of the proposed solution.
|
9 |
串流式音訊分類於智慧家庭之應用 / Streaming audio classification for smart home environments溫景堯, Wen, Jing Yao Unknown Date (has links)
聽覺與視覺同為人類最重要的感官。計算式聽覺場景分析(Computation Auditory Scene Analysis, CASA)透過聽覺心理學中對於人耳特性與心理感知的關連性,定義了一個可能的方向,讓電腦聽覺更為貼近人類感知。本研究目的在於應用聽覺心理學之原則,以影像處理與圖型辨識技術,設計音訊增益、切割、描述等對應之處理,透過相似度計算方式實現智慧家庭之環境中的即時音訊分類。
本研究分為三部分,第一部分為音訊處理,將環境中的聲音轉換成電腦可處理與強化之訊號;第二部分透過CASA原則設計影像處理,以冀於影像上達成音訊處理之結果,並以影像特徵加以描述音訊事件;第三部分定義影像特徵之距離,以K個最近鄰點(K-Nearest Neighbor, KNN)技術針對智慧家庭環境常見之音訊事件,實現即時辨識與分類。實驗結果顯示本論文所提出的音訊分類方法有著不錯的效果,對八種家庭環境常見的聲音辨識正確率可達80-90%,而在雜訊或其他聲音干擾的情況下,辨識結果也維持在70%左右。 / Human receive sounds such as language and music through audition. Therefore, audition and vision are viewed as the two most important aspects of human perception. Computational auditory scene analysis (CASA) defined a possible direction to close the gap between computerized audition and human perception using the correlation between features of ears and mental perception in psychology of hearing. In this research, we develop and integrate methods for real-time streaming audio classification based on the principles of psychology of hearing as well as techniques in pattern recognition.
There are three major parts in this research. The first is audio processing, translating sounds into information that can be enhanced by computers; the second part uses the principles of CASA to design a framework for audio signal description and event detection by means of computer vision and image processing techniques; the third part defines the distance of image feature vectors and uses K-Nearest Neighbor (KNN) classifier to accomplish audio recognition and classification in real-time. Experimental results show that the proposed approach is quite effective, achieving an overall recognition rate of 80-90% for 8 types of audio input. The performance degrades only slightly in the presence of noise and other interferences.
|
10 |
Feature selection for multimodal: acoustic Event detectionButko, Taras 08 July 2011 (has links)
Acoustic Event Detection / The detection of the Acoustic Events (AEs) naturally produced in a meeting room may help to describe the human and social activity. The automatic description of interactions between humans and environment can be useful for providing: implicit assistance to the people inside the room, context-aware and content-aware information requiring a minimum of human attention or interruptions, support for high-level analysis of the underlying acoustic scene, etc. On the other hand, the recent fast growth of available audio or audiovisual content strongly demands tools for analyzing, indexing, searching and retrieving the available documents. Given an audio document, the first processing step usually is audio segmentation (AS), i.e. the partitioning of the input audio stream into acoustically homogeneous regions which are labelled according to a predefined broad set of classes like speech, music, noise, etc. Acoustic event detection (AED) is the objective of this thesis work. A variety of features coming not only from audio but also from the video modality is proposed to deal with that detection problem in meeting-room and broadcast news domains. Two basic detection approaches are investigated in this work: a joint segmentation and classification using Hidden Markov Models (HMMs) with Gaussian Mixture Densities (GMMs), and a detection-by-classification approach using discriminative Support Vector Machines (SVMs). For the first case, a fast one-pass-training feature selection algorithm is developed in this thesis to select, for each AE class, the subset of multimodal features that shows the best detection rate. AED in meeting-room environments aims at processing the signals collected by distant microphones and video cameras in order to obtain the temporal sequence of (possibly overlapped) AEs that have been produced in the room. When applied to interactive seminars with a certain degree of spontaneity, the detection of acoustic events from only the audio modality alone shows a large amount of errors, which is mostly due to the temporal overlaps of sounds. This thesis includes several novelties regarding the task of multimodal AED. Firstly, the use of video features. Since in the video modality the acoustic sources do not overlap (except for occlusions), the proposed features improve AED in such rather spontaneous scenario recordings. Secondly, the inclusion of acoustic localization features, which, in combination with the usual spectro-temporal audio features, yield a further improvement in recognition rate. Thirdly, the comparison of feature-level and decision-level fusion strategies for the combination of audio and video modalities. In the later case, the system output scores are combined using two statistical approaches: weighted arithmetical mean and fuzzy integral. On the other hand, due to the scarcity of annotated multimodal data, and, in particular, of data with temporal sound overlaps, a new multimodal database with a rich variety of meeting-room AEs has been recorded and manually annotated, and it has been made publicly available for research purposes.
|
Page generated in 0.1078 seconds