• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 18
  • 9
  • 4
  • 3
  • 1
  • 1
  • Tagged with
  • 40
  • 40
  • 38
  • 38
  • 26
  • 24
  • 23
  • 22
  • 20
  • 19
  • 13
  • 12
  • 8
  • 7
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

A robust audio-based symbol recognition system using machine learning techniques

Wu, Qiming 02 1900 (has links)
Masters of Science / This research investigates the creation of an audio-shape recognition system that is able to interpret a user’s drawn audio shapes—fundamental shapes, digits and/or letters— on a given surface such as a table-top using a generic stylus such as the back of a pen. The system aims to make use of one, two or three Piezo microphones, as required, to capture the sound of the audio gestures, and a combination of the Mel-Frequency Cepstral Coefficients (MFCC) feature descriptor and Support Vector Machines (SVMs) to recognise audio shapes. The novelty of the system is in the use of piezo microphones which are low cost, light-weight and portable, and the main investigation is around determining whether these microphones are able to provide sufficiently rich information to recognise the audio shapes mentioned in such a framework.
22

Classification of Affective Emotion in Musical Themes : How to understand the emotional content of the soundtracks of the movies?

Diaz Banet, Paula January 2021 (has links)
Music is created by composers to arouse different emotions and feelings in the listener, and in the case of soundtracks, to support the storytelling of scenes. The goal of this project is to seek the best method to evaluate the emotional content of soundtracks. This emotional content can be measured quantitatively thanks to Russell’s model of valence, arousal and dominance which converts moods labels into numbers. To conduct the analysis, MFCCs and VGGish features were extracted from the soundtracks and used as inputs to a CNN and an LSTM model, in order to study which one achieved a better prediction. A database of 6757 number of soundtracks with their correspondent VAD values was created to perform the mentioned analysis. As an ultimate purpose, the results of the experiments will contribute to the start-up Vionlabs to understand better the content of the movies and, therefore, make a more accurate recommendation on what users want to consume on Video on Demand platforms according to their emotions or moods. / Musik skapas av kompositörer för att väcka olika känslor och känslor hos lyssnaren, och när det gäller ljudspår, för att stödja berättandet av scener. Målet med detta projekt är att söka den bästa metoden för att utvärdera det emotionella innehållet i ljudspår. Detta känslomässiga innehåll kan mätas kvantitativt tack vare Russells modell av valens, upphetsning och dominans som omvandlar stämningsetiketter till siffror. För att genomföra analysen extraherades MFCC: er och VGGish-funktioner från ljudspåren och användes som ingångar till en CNN- och en LSTM-modell för att studera vilken som uppnådde en bättre förutsägelse. En databas med totalt 6757 ljudspår med deras korrespondent acrshort VAD-värden skapades för att utföra den nämnda analysen. Som ett yttersta syfte kommer resultaten av experimenten att bidra till att starta upp Vionlabs för att bättre förstå innehållet i filmerna och därför ge mer exakta rekommendationer på Video on Demand-plattformar baserat på användarnas känslor eller stämningar.
23

Application of LabVIEW and myRIO to voice controlled home automation

Lindstål, Tim, Marklund, Daniel January 2019 (has links)
The aim of this project is to use NI myRIO and LabVIEW for voice controlled home automation. The NI myRIO is an embedded device which has a Xilinx FPGA and a dual-core ARM Cortex-A9processor as well as analog input/output and digital input/output, and is programmed with theLabVIEW, a graphical programming language. The voice control is implemented in two differentsystems. The first system is based on an Amazon Echo Dot for voice recognition, which is acommercial smart speaker developed by Amazon Lab126. The Echo Dot devices are connectedvia the Internet to the voice-controlled intelligent personal assistant service known as Alexa(developed by Amazon), which is capable of voice interaction, music playback, and controllingsmart devices for home automation. This system in the present thesis project is more focusingon myRIO used for the wireless control of smart home devices, where smart lamps, sensors,speakers and a LCD-display was implemented. The other system is more focusing on myRIO for speech recognition and was built on myRIOwith a microphone connected. The speech recognition was implemented using mel frequencycepstral coefficients and dynamic time warping. A few commands could be recognized, includinga wake word ”Bosse” as well as other four commands for controlling the colors of a smart lamp. The thesis project is shown to be successful, having demonstrated that the implementation ofhome automation using the NI myRIO with two voice-controlled systems can correctly controlhome devices such as smart lamps, sensors, speakers and a LCD-display.
24

A Design of Recognition Rate Improving Strategy for Japanese Speech Recognition System

Lin, Cheng-Hung 24 August 2010 (has links)
This thesis investigates the recognition rate improvement strategies for a Japanese speech recognition system. Both training data development and consonant correction scheme are studied. For training data development, a database of 995 two-syllable Japanese words is established by phonetic balanced sieving. Furthermore, feature models for the 188 common Japanese mono-syllables are derived through mixed position training scheme to increase recognition rate. For consonant correction, a sub-syllable model is developed to enhance the consonant recognition accuracy, and hence further improve the overall correct rate for the whole Japanese phrases. Experimental results indicate that the average correct rate for Japanese phrase recognition system with 34 thousand phrases can be improved from 86.91% to 92.38%.
25

Inversion acoustique articulatoire à partir de coefficients cepstraux / Acoustic-to-articulatory inversion from cepstral coefficients

Busset, Julie 25 March 2013 (has links)
L'inversion acoustique-articulatoire de la parole consiste à récupérer la forme du conduit vocal à partir d'un signal de parole. Ce problème est abordé à l'aide d'une méthode d'analyse par synthèse reposant sur un modèle physique de production de la parole contrôlé par un petit nombre de paramètres décrivant la forme du conduit vocal : l'ouverture de la mâchoire, la forme et la position de la langue et la position des lèvres et du larynx. Afin de s'approcher de la géométrie de notre locuteur, le modèle articulatoire est construit à l'aide de contours articulatoires issus d'images cinéradiographiques présentant une vue sagittale du conduit vocal. Ce synthétiseur articulatoire nous permet de créer une table formée de couples associant un vecteur articulatoire au vecteur acoustique correspondant. Nous n'utiliserons pas les formants (fréquences de résonance du conduit vocal) comme vecteur acoustique car leur extraction n'est pas toujours fiable provoquant des erreurs lors de l'inversion. Les coefficients cepstraux sont utilisés comme vecteur acoustique. De plus, l'effet de la source et les disparités entre le conduit vocal du locuteur et le modèle articulatoire sont pris en compte explicitement en comparant les spectres naturels à ceux produits par le synthétiseur car nous disposons des deux signaux / The acoustic-to-articulatory inversion of speech consist in the recovery of the vocal tract shape from the speech signal. This problem is tackled with an analysis-by-synthesis method depending on a physical model of speech production controlled by a small number of parameters describing the vocal tract shape: the jaw opening, the shape and the position of the tongue and the position of lips and larynx. In order to approach the geometry of the speaker, the articulatory model is built with articulatory contours from cineradiographic images of the sagittal view of the vocal tract. This articulatory synthesizer allows us to create a table made up with couples associating a articulatory vector with the corresponding acoustic vector. The formants (resonance frequency of the vocal tract shape) are not used as acoustic vector because their extraction is not always reliable causing errors during inversion. The cepstral coefficients are used as acoustic vector. Moreover, the source effect and the mismatch between the speaker vocal tract and the articulatory model are considered explicitly comparing the natural spectrum with those produced by the synthesizer because we have the both signals
26

Optimizing text-independent speaker recognition using an LSTM neural network

Larsson, Joel January 2014 (has links)
In this paper a novel speaker recognition system is introduced. Automated speaker recognition has become increasingly popular to aid in crime investigations and authorization processes with the advances in computer science. Here, a recurrent neural network approach is used to learn to identify ten speakers within a set of 21 audio books. Audio signals are processed via spectral analysis into Mel Frequency Cepstral Coefficients that serve as speaker specific features, which are input to the neural network. The Long Short-Term Memory algorithm is examined for the first time within this area, with interesting results. Experiments are made as to find the optimum network model for the problem. These show that the network learns to identify the speakers well, text-independently, when the recording situation is the same. However the system has problems to recognize speakers from different recordings, which is probably due to noise sensitivity of the speech processing algorithm in use.
27

Accent Classification from Speech Samples by Use of Machine Learning

Carol Pedersen Unknown Date (has links)
“Accent” is the pattern of speech pronunciation by which one can identify a person’s linguistic, social or cultural background. It is an important source of inter-speaker variability and a particular problem for automated speech recognition. The aim of the study was to investigate a new computational approach to accent classification which did not require phonemic segmentation or the identification of phonemes as input, and which could therefore be used as a simple, effective accent classifier. Through a series of structured experiments this study investigated the effectiveness of Support Vector Machines (SVMs) for speech accent classification using time-based units rather than linguistically-informed ones, and compared it to the accuracy of other machine learning methods, as well as the ability of humans to classify speech according to accent. A corpus of read-speech was collected in two accents of English (Arabic and “Indian”) and used as the main datasource for the experiments. Mel-frequency cepstral coefficients were extracted from the speech samples and combined into larger units of 10 to 150ms duration, which then formed the input data for the various machine learning systems. Support Vector Machines were found to classify the samples with up to 97.5% accuracy with very high precision and recall, using samples of between 1 and 4 seconds of speech. This compared favourably with a human listener study where subjects were able to distinguish between the two accent groups with an average of 92.5% accuracy in approximately 8 seconds. Repeating the SVM experiments on a different corpus resulted in a best classification accuracy of 84.6%. Experiments using a decision tree learner and a rule-based classifier on the original corpus gave a best accuracy of 95% but results over the range of conditions were much more variable than those using the SVM. Rule extraction was performed in order to help explain the results and better inform the design of the system. The new approach was therefore shown to be effective for accent classification, and a plan for its role within various other larger speech-related contexts was developed.
28

Accent Classification from Speech Samples by Use of Machine Learning

Carol Pedersen Unknown Date (has links)
“Accent” is the pattern of speech pronunciation by which one can identify a person’s linguistic, social or cultural background. It is an important source of inter-speaker variability and a particular problem for automated speech recognition. The aim of the study was to investigate a new computational approach to accent classification which did not require phonemic segmentation or the identification of phonemes as input, and which could therefore be used as a simple, effective accent classifier. Through a series of structured experiments this study investigated the effectiveness of Support Vector Machines (SVMs) for speech accent classification using time-based units rather than linguistically-informed ones, and compared it to the accuracy of other machine learning methods, as well as the ability of humans to classify speech according to accent. A corpus of read-speech was collected in two accents of English (Arabic and “Indian”) and used as the main datasource for the experiments. Mel-frequency cepstral coefficients were extracted from the speech samples and combined into larger units of 10 to 150ms duration, which then formed the input data for the various machine learning systems. Support Vector Machines were found to classify the samples with up to 97.5% accuracy with very high precision and recall, using samples of between 1 and 4 seconds of speech. This compared favourably with a human listener study where subjects were able to distinguish between the two accent groups with an average of 92.5% accuracy in approximately 8 seconds. Repeating the SVM experiments on a different corpus resulted in a best classification accuracy of 84.6%. Experiments using a decision tree learner and a rule-based classifier on the original corpus gave a best accuracy of 95% but results over the range of conditions were much more variable than those using the SVM. Rule extraction was performed in order to help explain the results and better inform the design of the system. The new approach was therefore shown to be effective for accent classification, and a plan for its role within various other larger speech-related contexts was developed.
29

[en] INDEPENDENT TEXT ROBUST SPEAKER RECOGNITION IN THE PRESENCE OF NOISE USING PAC-MFCC AND SUB BAND CLASSIFIERS / [pt] RECONHECIMENTO DE LOCUTOR INDEPENDENTE DO TEXTO EM PRESENÇA DE RUÍDO USANDO PAC-MFCC E CLASSIFICADORES EM SUB-BANDAS

HARRY ARNOLD ANACLETO SILVA 06 September 2011 (has links)
[pt] O presente trabalho é proposto o atributo PAC-MFCC operando com Classificadores em Sub-Bandas para a tarefa de identificação de locutor independente do texto em ruído. O sistema proposto é comparado com os atributos MFCC (Coeficientes Cepestrais de Frequência Mel), PAC- MFCC (Fase Autocorrelação-MFCC ) sem uso de classificadores em sub-bandas, SSCH(Histogramas de Centróides de Sub-Bandas Espectrais) e TECC (Coeficientes Cepestrais da Energia Teager). Nesta tarefa de reconhecimento, utilizou-se a base TIMIT a qual é composta de 630 locutores onde cada um deles falam 10 frases de aproximadamente 3 segundos cada frase, das quais 8 frases foram utilizadas para treinamento e 2 para teste, obtendo-se um total de 1260 locuções para o reconhecimento. Investigou-se o desempenho dos diversos sistemas utilizando diferentes tipos de ruídos da base Noisex 92 com diferentes relação sinal ruído. Verificou-se que a taxa de acerto da técnica PAC-MFCC com classificador em Sub-Bandas apresenta o melhor desempenho em comparação com as outras técnicas quando se tem uma relação sinal ruído menor que 10dB. / [en] In this work is proposed the use of the PAC-MFCC feature with Sub-band Classifiers for the task of text-independent speaker identification in noise. The proposed scheme is compared with the features MFCC (Mel-Frequency Cepstral Coefficients ), PAC-MFCC (Phase Autocorrelation MFCC) without subband classifiers, SSCH (Subband Spectral Centroid Histograms) and TECC (Teager Energy Cepstrum Coefficients). In this recognition task, we used the TIMIT database which consists of 630 speakers, where every one of them speak 10 utterances of 3 seconds each one approximately, of which eight utterance were used for training and two for testing, thus obtaining a total of 1260 test utterance for the recognition. We investigated the performance of these techniques using differents types of noise from the base Noisex 92 with different signal to noise ratios. It was found that the accuracy rate of the PAC-MFCC feature with Sub-band Classifiers performs better in comparison with other techniques at a lower signal noise(less than 10dB).
30

Improved MFCC Front End Using Spectral Maxima For Noisy Speech Recognition

Sujatha, J 11 1900 (has links) (PDF)
No description available.

Page generated in 0.0782 seconds