51 |
Multi-objective optimization for model selection in music classification / Flermålsoptimering för modellval i musikklassificeringUjihara, Rintaro January 2021 (has links)
With the breakthrough of machine learning techniques, the research concerning music emotion classification has been getting notable progress combining various audio features and state-of-the-art machine learning models. Still, it is known that the way to preprocess music samples and to choose which machine classification algorithm to use depends on data sets and the objective of each project work. The collaborating company of this thesis, Ichigoichie AB, is currently developing a system to categorize music data into positive/negative classes. To enhance the accuracy of the existing system, this project aims to figure out the best model through experiments with six audio features (Mel spectrogram, MFCC, HPSS, Onset, CENS, Tonnetz) and several machine learning models including deep neural network models for the classification task. For each model, hyperparameter tuning is performed and the model evaluation is carried out according to pareto optimality with regard to accuracy and execution time. The results show that the most promising model accomplished 95% correct classification with an execution time of less than 15 seconds. / I och med genombrottet av maskininlärningstekniker har forskning kring känsloklassificering i musik sett betydande framsteg genom att kombinera olikamusikanalysverktyg med nya maskinlärningsmodeller. Trots detta är hur man förbehandlar ljuddatat och valet av vilken maskinklassificeringsalgoritm som ska tillämpas beroende på vilken typ av data man arbetar med samt målet med projektet. Denna uppsats samarbetspartner, Ichigoichie AB, utvecklar för närvarande ett system för att kategorisera musikdata enligt positiva och negativa känslor. För att höja systemets noggrannhet är målet med denna uppsats att experimentellt hitta bästa modellen baserat på sex musik-egenskaper (Mel-spektrogram, MFCC, HPSS, Onset, CENS samt Tonnetz) och ett antal olika maskininlärningsmodeller, inklusive Deep Learning-modeller. Varje modell hyperparameteroptimeras och utvärderas enligt paretooptimalitet med hänsyn till noggrannhet och beräkningstid. Resultaten visar att den mest lovande modellen uppnådde 95% korrekt klassificering med en beräkningstid på mindre än 15 sekunder.
|
52 |
Adaptive Voice Control System using AISteen, Jasmine, Wilroth, Markus January 2021 (has links)
Controlling external actions with the voice is something humans have tried to do for a long time. There are many different ways to implement a voice control system, and many of these applications require internet connections. Leaving the application area limited, as commercially available voice controllers have been stagnating behind due to the cost of developing and maintaining. In this project an artifact was created to work as an easy to use, generic, voice controller tool that allows the user to easily create different voice commands that can be implemented in many different applications and platforms. The user shall have no need of understanding or experience of voice controls in order to use and implement the voice controller.
|
53 |
[en] CONTINUOUS SPEECH RECOGNITION BY COMBINING MFCC AND PNCC ATTRIBUTES WITH SS, WD, MAP AND FRN METHODS OF ROBUSTNESS / [pt] RECONHECIMENTO DE VOZ CONTINUA COMBINANDO OS ATRIBUTOS MFCC E PNCC COM METODOS DE ROBUSTEZ SS, WD, MAP E FRNCHRISTIAN DAYAN ARCOS GORDILLO 09 June 2014 (has links)
[pt] O crescente interesse por imitar o modelo que rege o processo cotidiano de comunicação humana através de maquinas tem se convertido em uma das áreas do conhecimento mais pesquisadas e de grande importância nas ultimas décadas. Esta área da tecnologia, conhecida como reconhecimento de voz, em como principal desafio desenvolver sistemas robustos que diminuam o ruído aditivo dos ambientes de onde o sinal de voz é adquirido, antes de que se esse sinal alimente os reconhecedores de voz. Por esta razão, este trabalho apresenta quatro formas diferentes de melhorar o desempenho do reconhecimento de voz contınua na presença de ruído aditivo, a saber: Wavelet Denoising e Subtração Espectral, para realce de fala e Mapeamento de Histogramas e Filtro com Redes Neurais, para compensação de atributos. Esses métodos são aplicados isoladamente e simultaneamente, afim de minimizar os desajustes causados pela inserção de ruído no sinal de voz. Alem dos métodos de robustez propostos, e devido ao fato de que os e conhecedores de voz dependem basicamente dos atributos de voz utilizados, examinam-se dois algoritmos de extração de atributos, MFCC e PNCC, através dos quais se representa o sinal de voz como uma sequência de vetores que contêm informação espectral de curtos períodos de tempo. Os métodos considerados são avaliados através de experimentos usando os software HTK e Matlab, e as bases de dados TIMIT (de vozes) e NOISEX-92 (de ruído). Finalmente, para obter os resultados experimentais, realizam-se dois tipos de testes. No primeiro caso, é avaliado um sistema de referência baseado unicamente em atributos MFCC e PNCC, mostrando como o sinal é fortemente degradado quando as razões sinal-ruıdo são menores. No segundo caso, o sistema de referência é combinado com os métodos de robustez aqui propostos, analisando-se comparativamente os resultados dos métodos quando agem isolada e simultaneamente. Constata-se que a mistura simultânea dos métodos nem sempre é mais atraente. Porem, em geral o melhor resultado é obtido combinando-se MAP com atributos PNCC. / [en] The increasing interest in imitating the model that controls the daily
process of human communication trough machines has become one of the
most researched areas of knowledge and of great importance in recent decades.
This technological area known as voice recognition has as a main challenge
to develop robust systems that reduce the noisy additive environment where
the signal voice was acquired. For this reason, this work presents four different
ways to improve the performance of continuous speech recognition in presence
of additive noise, known as Wavelet Denoising and Spectral Subtraction for
enhancement of voice, and Mapping of Histograms and Filter with Neural
Networks to compensate for attributes. These methods are applied separately
and simultaneously two by two, in order to minimize the imbalances caused
by the inclusion of noise in voice signal. In addition to the proposed methods
of robustness and due to the fact that voice recognizers depend mainly on the
attributes voice used, two algorithms are examined for extracting attributes,
MFCC, and PNCC, through which represents the voice signal as a sequence
of vectors that contain spectral information for short periods of time. The
considered methods are evaluated by experiments using the HTK and Matlab
software, and databases of TIMIT (voice) and Noisex-92 (noise). Finally, for
the experimental results, two types of tests were carried out. In the first case
a reference system was assessed based on MFCC and PNCC attributes, only
showing how the signal degrades strongly when signal-noise ratios are higher.
In the second case, the reference system is combined with robustness methods
proposed here, comparatively analyzing the results of the methods when they
act alone and simultaneously. It is noted that simultaneous mix of methods is
not always more attractive. However, in general, the best result is achieved by
the combination of MAP with PNCC attributes.
|
54 |
Sélection de paramètres acoustiques pertinents pour la reconnaissance de la parole / Relevant acoustic feature selection for speech recognitionHacine-Gharbi, Abdenour 09 December 2012 (has links)
L’objectif de cette thèse est de proposer des solutions et améliorations de performance à certains problèmes de sélection des paramètres acoustiques pertinents dans le cadre de la reconnaissance de la parole. Ainsi, notre première contribution consiste à proposer une nouvelle méthode de sélection de paramètres pertinents fondée sur un développement exact de la redondance entre une caractéristique et les caractéristiques précédemment sélectionnées par un algorithme de recherche séquentielle ascendante. Le problème de l’estimation des densités de probabilités d’ordre supérieur est résolu par la troncature du développement théorique de cette redondance à des ordres acceptables. En outre, nous avons proposé un critère d’arrêt qui permet de fixer le nombre de caractéristiques sélectionnées en fonction de l’information mutuelle approximée à l’itération j de l’algorithme de recherche. Cependant l’estimation de l’information mutuelle est difficile puisque sa définition dépend des densités de probabilités des variables (paramètres) dans lesquelles le type de ces distributions est inconnu et leurs estimations sont effectuées sur un ensemble d’échantillons finis. Une approche pour l’estimation de ces distributions est basée sur la méthode de l’histogramme. Cette méthode exige un bon choix du nombre de bins (cellules de l’histogramme). Ainsi, on a proposé également une nouvelle formule de calcul du nombre de bins permettant de minimiser le biais de l’estimateur de l’entropie et de l’information mutuelle. Ce nouvel estimateur a été validé sur des données simulées et des données de parole. Plus particulièrement cet estimateur a été appliqué dans la sélection des paramètres MFCC statiques et dynamiques les plus pertinents pour une tâche de reconnaissance des mots connectés de la base Aurora2. / The objective of this thesis is to propose solutions and performance improvements to certain problems of relevant acoustic features selection in the framework of the speech recognition. Thus, our first contribution consists in proposing a new method of relevant feature selection based on an exact development of the redundancy between a feature and the feature previously selected using Forward search algorithm. The estimation problem of the higher order probability densities is solved by the truncation of the theoretical development of this redundancy up to acceptable orders. Moreover, we proposed a stopping criterion which allows fixing the number of features selected according to the mutual information approximated at the iteration J of the search algorithm. However, the mutual information estimation is difficult since its definition depends on the probability densities of the variables (features) in which the type of these distributions is unknown and their estimates are carried out on a finite sample set. An approach for the estimate of these distributions is based on the histogram method. This method requires a good choice of the bin number (cells of the histogram). Thus, we also proposed a new formula of computation of bin number that allows minimizing the estimator bias of the entropy and mutual information. This new estimator was validated on simulated data and speech data. More particularly, this estimator was applied in the selection of the static and dynamic MFCC parameters that were the most relevant for a recognition task of the connected words of the Aurora2 base.
|
55 |
Ljudklassificering med Tensorflow och IOT-enheter : En teknisk studieKarlsson, David January 2020 (has links)
Artificial Inteligens and machine learning has started to get established as reco- gnizable terms to the general masses in their daily lives. Applications such as voice recognicion and image recognicion are used widely in mobile phones and autonomous systems such as self-drivning cars. This study examines how one can utilize this technique to classify sound as a complement to videosurveillan- ce in different settings, for example a busstation or other areas that might need monitoring. To be able to do this a technique called Convolution Neural Ne- twork has been used since this is a popular architecture to use when it comes to image classification. In this model every sound has a visual representation in form of a spectogram that showes frequencies over time. One of the main goals of this study has been to be able to apply this technique on so called IOT units to be able to classify sounds in real time, this because of the fact that these units are relativly affordable and requires little resources. A Rasberry Pi was used to run a prototype version using tensorflow & keras as base api ́s. The studys re- sults show which parts that are important to consider to be able to get a good and reliable system, for example which hardware and software that is needed to get started. The results also shows what factors is important to be able to stream live sound and get reliable results, a classification models architecture is very important where different layers and parameters can have a large impact on the end result. / Termer som Artificiell Intelligens och maskininlärning har under de senaste åren börjat etablera sig hos den breda massan och är numera någonting som på- verkar nästan alla människors vardagliga liv i någon form. Vanliga använd- ningsområden är röststyrning och bildigenkänning som bland annat används i mobiltelefoner och autonoma system som självkörande bilar med mera. Den här studien utforskar hur man kan använda sig av denna teknik för att kunna klassi- ficera ljud som ett komplement till videoövervakning i olika miljöer, till exem- pel på en busstation eller andra övervakningsobjekt. För att göra detta har en teknik kallad Convolution Neural Network använts, vilket är en mycket populär arkitektur att använda vid klassificering av bilder. I denna modell har varje ljud fått en visuell representation i form av ett spektogram som visar frekvenser över tid. Ett av huvudmålen med denna studie har varit att kunna applicera denna teknik på så kallade IOT-enheter för att klassificera ljud i realtid. Dessa är rela- tivt billiga och resurssnåla enheter vilket gör dem till ett attraktivt alternativ för detta ändamål. I denna studie används en Raspberry Pi för att köra en prototyp- version med Tensorflow & Keras som grund APIer. Studien visar bland annat på vilka moment och delar som är viktiga att tänka på för att få igång ett smidigt och pålitligt system, till exempel vilken hårdvara och mjukvara som krävs för att starta. Den visar också på vilka faktorer som spelar in för att kunna streama ljud med bra resultat, detta då en klassifikationsmodells arkitektur och upp- byggnad kan ha stor påverkan på slutresultatet.
|
56 |
SPARSE DISCRETE WAVELET DECOMPOSITION AND FILTER BANK TECHNIQUES FOR SPEECH RECOGNITIONJingzhao Dai (6642491) 11 June 2019 (has links)
<p>Speech recognition is widely applied to
translation from speech to related text, voice driven commands, human machine
interface and so on [1]-[8]. It has been increasingly proliferated to Human’s
lives in the modern age. To improve the accuracy of speech recognition, various
algorithms such as artificial neural network, hidden Markov model and so on
have been developed [1], [2].</p>
<p>In this thesis work, the tasks of speech
recognition with various classifiers are investigated. The classifiers employed
include the support vector machine (SVM), k-nearest neighbors (KNN), random
forest (RF) and convolutional neural network (CNN). Two novel features extraction
methods of sparse discrete wavelet decomposition (SDWD) and bandpass filtering
(BPF) based on the Mel filter banks [9] are developed and proposed. In order to
meet diversity of classification algorithms, one-dimensional (1D) and two-dimensional
(2D) features are required to be obtained. The 1D features are the array of
power coefficients in frequency bands, which are dedicated for training SVM,
KNN and RF classifiers while the 2D features are formed both in frequency domain
and temporal variations. In fact, the 2D feature consists of the power values
in decomposed bands versus consecutive speech frames. Most importantly, the 2D
feature with geometric transformation are adopted to train CNN.</p>
<p>Speech recognition including males and females
are from the recorded data set as well as the standard data set. Firstly, the
recordings with little noise and clear pronunciation are applied with the
proposed feature extraction methods. After many trials and experiments using
this dataset, a high recognition accuracy is achieved. Then, these feature
extraction methods are further applied to the standard recordings having random
characteristics with ambient noise and unclear pronunciation. Many experiment
results validate the effectiveness of the proposed feature extraction techniques.</p>
|
57 |
An IoT Solution for Urban Noise Identification in Smart Cities : Noise Measurement and ClassificationAlsouda, Yasser January 2019 (has links)
Noise is defined as any undesired sound. Urban noise and its effect on citizens area significant environmental problem, and the increasing level of noise has become a critical problem in some cities. Fortunately, noise pollution can be mitigated by better planning of urban areas or controlled by administrative regulations. However, the execution of such actions requires well-established systems for noise monitoring. In this thesis, we present a solution for noise measurement and classification using a low-power and inexpensive IoT unit. To measure the noise level, we implement an algorithm for calculating the sound pressure level in dB. We achieve a measurement error of less than 1 dB. Our machine learning-based method for noise classification uses Mel-frequency cepstral coefficients for audio feature extraction and four supervised classification algorithms (that is, support vector machine, k-nearest neighbors, bootstrap aggregating, and random forest). We evaluate our approach experimentally with a dataset of about 3000 sound samples grouped in eight sound classes (such as car horn, jackhammer, or street music). We explore the parameter space of the four algorithms to estimate the optimal parameter values for the classification of sound samples in the dataset under study. We achieve noise classification accuracy in the range of 88% – 94%.
|
58 |
Semantic Classification And Retrieval System For Environmental SoundsOkuyucu, Cigdem 01 October 2012 (has links) (PDF)
The growth of multimedia content in recent years motivated the research on audio classification and content retrieval area. In this thesis, a general environmental audio classification and retrieval approach is proposed in which higher level semantic classes (outdoor, nature, meeting and violence) are obtained from lower level acoustic classes (emergency alarm, car horn, gun-shot, explosion, automobile, motorcycle, helicopter, wind, water, rain, applause, crowd and laughter). In order to classify an audio sample into acoustic classes, MPEG-7 audio features, Mel Frequency Cepstral Coefficients (MFCC) feature and Zero Crossing Rate (ZCR) feature are used with Hidden Markov Model (HMM) and Support Vector Machine (SVM) classifiers. Additionally, a new classification method is proposed using Genetic Algorithm (GA) for classification of semantic classes. Query by Example (QBE) and keyword-based query capabilities are implemented for content retrieval.
|
59 |
Sélection de paramètres acoustiques pertinents pour la reconnaissance de la paroleHacine-Gharbi, Abdenour 09 December 2012 (has links) (PDF)
L'objectif de cette thèse est de proposer des solutions et améliorations de performance à certains problèmes de sélection des paramètres acoustiques pertinents dans le cadre de la reconnaissance de la parole. Ainsi, notre première contribution consiste à proposer une nouvelle méthode de sélection de paramètres pertinents fondée sur un développement exact de la redondance entre une caractéristique et les caractéristiques précédemment sélectionnées par un algorithme de recherche séquentielle ascendante. Le problème de l'estimation des densités de probabilités d'ordre supérieur est résolu par la troncature du développement théorique de cette redondance à des ordres acceptables. En outre, nous avons proposé un critère d'arrêt qui permet de fixer le nombre de caractéristiques sélectionnées en fonction de l'information mutuelle approximée à l'itération j de l'algorithme de recherche. Cependant l'estimation de l'information mutuelle est difficile puisque sa définition dépend des densités de probabilités des variables (paramètres) dans lesquelles le type de ces distributions est inconnu et leurs estimations sont effectuées sur un ensemble d'échantillons finis. Une approche pour l'estimation de ces distributions est basée sur la méthode de l'histogramme. Cette méthode exige un bon choix du nombre de bins (cellules de l'histogramme). Ainsi, on a proposé également une nouvelle formule de calcul du nombre de bins permettant de minimiser le biais de l'estimateur de l'entropie et de l'information mutuelle. Ce nouvel estimateur a été validé sur des données simulées et des données de parole. Plus particulièrement cet estimateur a été appliqué dans la sélection des paramètres MFCC statiques et dynamiques les plus pertinents pour une tâche de reconnaissance des mots connectés de la base Aurora2.
|
60 |
Analýza zvukových nahrávek pomocí hlubokého učení / Deep learning based sound records analysisKramář, Denis January 2021 (has links)
This master thesis deals with the problem of audio-classification of the chainsaw logging sound in natural environment using mainly convolutional neural networks. First, a theory of grafical representation of audio signal is discussed. Following part is devoted to the machine learning area. In third chapter, some of present works dealing with this problematics are given. Within the practical part, used dataset and tested neural networks are presented. Final resultes are compared by achieved accuracy and by ROC curves. The robustness of the presented solutions was tested by proposed detection program and evaluated using objective criteria.
|
Page generated in 0.0277 seconds