Global ETD Search

41	Rozpoznávání mluvčího na mobilním telefonu / Speaker Recognition on Mobile Phone Pešán, Jan January 2011 (has links) Tato práce se zaměřuje na implementaci počítačového systému rozpoznávání řečníka do prostředí mobilního telefonu. Je zde popsán princip, funkce, a implementace rozpoznávače na mobilním telefonu Nokia N900.
42	Development of a text-independent automatic speaker recognition system Mokgonyane, Tumisho Billson January 2021 (has links) Thesis (M. Sc. (Computer Science)) -- University of Limpopo, 2021 / The task of automatic speaker recognition, wherein a system verifies or identifies speakers from a recording of their voices, has been researched for several decades. However, research in this area has been carried out largely on freely accessible speaker datasets built on languages that are well-resourced like English. This study undertakes automatic speaker recognition research focused on a low-resourced language, Sepedi. As one of the 11 official languages in South Africa, Sepedi is spoken by at least 2.8 million people. Pre-recorded voices were acquired from a speech and language national repository, namely, the National Centre for Human Language Technology (NCHLT), were we selected the Sepedi NCHLT Speech Corpus. The open-source pyAudioAnalysis python library was used to extract three types of acoustic features of speech namely, time, frequency and cepstral domain features, from the acquired speech data. The effects and compatibility of these acoustic features was investigated. It was observed that combining the three acoustic features of speech had a more significant effect than using individual features as far as speaker recognition accuracy is concerned. The study also investigated the performance of machine learning algorithms on low-resourced languages such as Sepedi. Five machine learning (ML) algorithms implemented on Scikit-learn namely, K-nearest neighbours (KNN), support vector machines (SVM), random forest (RF), logistic regression (LR), and multi-layer perceptrons (MLP) were used to train different classifier models. The GridSearchCV algorithm, also implemented on Scikit-learn, was used to deduce ideal hyper-parameters for each of the five ML algorithms. The classifier models were evaluated on recognition accuracy and the results show that the MLP classifier, with a recognition accuracy of 98%, outperforms KNN, RF, LR and SVM classifiers. A graphical user interface (GUI) is developed and the best performing classifier model, MLP, is deployed on the developed GUI intended to be used for real time speaker identification and verification tasks. Participants were recruited to the GUI performance and acceptable results were obtained Automatic speaker recognition Recording of voices Graphical user interface Automatic speech recognition Speech processing systems Icons (Computer graphics)
43	Novel Architectures for Human Voice and Environmental Sound Recognitionusing Machine Learning Algorithms Dhakal, Parashar January 2018 (has links) No description available. Computer Engineering Electrical Engineering classifiers end-to-end architecture feature extraction machine learning speaker recognition voice interface background sound identification
44	Measuring, refining and calibrating speaker and language information extracted from speech Brummer, Niko 12 1900 (has links) Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having di fferent cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizers likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers. / AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender en meer algemeen toepasbaar. Automatic speaker recognition Automatic spoken language recognition Proper scoring rule Calibration Dissertations -- Electronic engineering Theses -- Electronic engineering Automatic speech recognition Speech processing systems
45	Speaker recognition by voice / Asmens atpažinimas pagal balsą Kamarauskas, Juozas 15 June 2009 (has links) Questions of speaker’s recognition by voice are investigated in this dissertation. Speaker recognition systems, their evolution, problems of recognition, systems of features, questions of speaker modeling and matching used in text-independent and text-dependent speaker recognition are considered too. The text-independent speaker recognition system has been developed during this work. The Gaussian mixture model approach was used for speaker modeling and pattern matching. The automatic method for voice activity detection was proposed. This method is fast and does not require any additional actions from the user, such as indicating patterns of the speech signal and noise. The system of the features was proposed. This system consists of parameters of excitation source (glottal) and parameters of the vocal tract. The fundamental frequency was taken as an excitation source parameter and four formants with three antiformants were taken as parameters of the vocal tract. In order to equate dispersions of the formants and antiformants we propose to use them in mel-frequency scale. The standard mel-frequency cepstral coefficients (MFCC) for comparison of the results were implemented in the recognition system too. These features make baseline in speech and speaker recognition. The experiments of speaker recognition have shown that our proposed system of features outperformed standard mel-frequency cepstral coefficients. The equal error rate (EER) was equal to 5.17% using proposed... [to full text] / Disertacijoje nagrinėjami kalbančiojo atpažinimo pagal balsą klausimai. Aptartos kalbančiojo atpažinimo sistemos, jų raida, atpažinimo problemos, požymių sistemos įvairovė bei kalbančiojo modeliavimo ir požymių palyginimo metodai, naudojami nuo ištarto teksto nepriklausomame bei priklausomame kalbančiojo atpažinime. Darbo metu sukurta nuo ištarto teksto nepriklausanti kalbančiojo atpažinimo sistema. Kalbėtojų modelių kūrimui ir požymių palyginimui buvo panaudoti Gauso mišinių modeliai. Pasiūlytas automatinis vokalizuotų garsų išrinkimo (segmentavimo) metodas. Šis metodas yra greitai veikiantis ir nereikalaujantis iš vartotojo jokių papildomų veiksmų, tokių kaip kalbos signalo ir triukšmo pavyzdžių nurodymas. Pasiūlyta požymių vektorių sistema, susidedanti iš žadinimo signalo bei balso trakto parametrų. Kaip žadinimo signalo parametras, panaudotas žadinimo signalo pagrindinis dažnis, kaip balso trakto parametrai, panaudotos keturios formantės bei trys antiformantės. Siekiant suvienodinti žemesnių bei aukštesnių formančių ir antiformančių dispersijas, jas pasiūlėme skaičiuoti melų skalėje. Rezultatų palyginimui sistemoje buvo realizuoti standartiniai požymiai, naudojami kalbos bei asmens atpažinime – melų skalės kepstro koeficientai (MSKK). Atlikti kalbančiojo atpažinimo eksperimentai parodė, kad panaudojus pasiūlytą požymių sistemą buvo gauti geresni atpažinimo rezultatai, nei panaudojus standartinius požymius (MSKK). Gautas lygių klaidų lygis, panaudojant pasiūlytą požymių... [toliau žr. visą tekstą] Informatics Engineering Automatic speaker recognition system Gaussian mixture models Formants Antiformants Pitch Gauso mišinių modeliai Formantės Antiformantės Pagrindinis tonas
46	Reconhecimento automático de locutor em modo independente de texto por Self-Organizing Maps. / Text independent automatic speaker recognition using Self-Organizing Maps. Mafra, Alexandre Teixeira 18 December 2002 (has links) Projetar máquinas capazes identificar pessoas é um problema cuja solução encontra uma grande quantidade de aplicações. Implementações em software de sistemas baseados em medições de características físicas pessoais (biométricos), estão começando a ser produzidos em escala comercial. Nesta categoria estão os sistemas de Reconhecimento Automático de Locutor, que se usam da voz como característica identificadora. No presente momento, os métodos mais populares são baseados na extração de coeficientes mel-cepstrais (MFCCs) das locuções, seguidos da identificação do locutor através de Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs) ou quantização vetorial. Esta preferência se justifica pela qualidade dos resultados obtidos. Fazer com que estes sistemas sejam robustos, mantendo sua eficiência em ambientes ruidosos, é uma das grandes questões atuais. Igualmente relevantes são os problemas relativos à degradação de performance em aplicações envolvendo um grande número de locutores, e a possibilidade de fraude baseada em vozes gravadas. Outro ponto importante é embarcar estes sistemas como sub-sistemas de equipamentos já existentes, tornando-os capazes de funcionar de acordo com o seu operador. Este trabalho expõe os conceitos e algoritmos envolvidos na implementação de um software de Reconhecimento Automático de Locutor independente de texto. Inicialmente é tratado o processamento dos sinais de voz e a extração dos atributos essenciais deste sinal para o reconhecimento. Após isto, é descrita a forma pela qual a voz de cada locutor é modelada através de uma rede neural de arquitetura Self-Organizing Map (SOM) e o método de comparação entre as respostas dos modelos quando apresentada uma locução de um locutor desconhecido. Por fim, são apresentados o processo de construção do corpus de vozes usado para o treinamento e teste dos modelos, as arquiteturas de redes testadas e os resultados experimentais obtidos numa tarefa de identificação de locutor. / The design of machines that can identify people is a problem whose solution has a wide range of applications. Software systems, based on personal phisical attributes measurements (biometrics), are in the beginning of commercial scale production. Automatic Speaker Recognition systems fall into this cathegory, using voice as the identifying attribute. At present, the most popular methods are based on the extraction of mel-frequency cepstral coefficients (MFCCs), followed by speaker identification by Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs) or vector quantization. This preference is motivated by the quality of the results obtained by the use of these methods. Making these systems robust, able to keep themselves efficient in noisy environments, is now a major concern. Just as relevant are the problems related to performance degradation in applications with a large number of speakers involved, and the issues related to the possibility of fraud by the use of recorded voices. Another important subject is to embed these systems as sub-systems of existing devices, enabling them to work according to the operator. This work presents the relevant concepts and algorithms concerning the implementation of a text-independent Automatic Speaker Recognition software system. First, the voice signal processing and the extraction of its essential features for recognition are treated. After this, it is described the way each speaker\'s voice is represented by a Self-Organizing Map (SOM) neural network, and the comparison method of the models responses when a new utterance from an unknown speaker is presented. At last, it is described the construction of the speech corpus used for training and testing the models, the neural network architectures tested, and the experimental results obtained in a speaker identification task. neural networks quantização vetorial reconhecimento de locutor reconhecimento de voz redes neurais Self-Organizing Maps Self-Organizing Maps SOM SOM speaker recognition speech recognition vector quantization
47	The face in your voice–how audiovisual learning benefits vocal communication Schall, Sonja 12 September 2014 (has links) Gesicht und Stimme einer Person sind stark miteinander assoziiert und werden normalerweise als eine Einheit wahrgenommen. Trotz des natürlichen gemeinsamen Auftretens von Gesichtern und Stimmen, wurden deren Wahrnehmung in den Neurowissenschaften traditionell aus einer unisensorischen Perspektive untersucht. Das heißt, dass sich Forschung zu Gesichtswahrnehmung ausschließlich auf das visuelle System fokusierte, während Forschung zu Stimmwahrnehmung nur das auditorische System untersuchte. In dieser Arbeit schlage ich vor, dass das Gehirn an die multisensorische Beschaffenheit von Gesichtern und Stimmen adaptiert ist, und dass diese Adaption sogar dann sichtbar ist, wenn nur die Stimme einer Person gehört wird, ohne dass das Gesicht zu sehen ist. Im Besonderen, untersucht diese Arbeit wie das Gehirn zuvor gelernte Gesichts-Stimmassoziationen ausnutzt um die auditorische Analyse von Stimmen und Sprache zu optimieren. Diese Dissertation besteht aus drei empirischen Studien, welche raumzeitliche Hirnaktivität mittels funktionaler Magnetresonanztomographie (fMRT) und Magnetoenzephalographie (MEG) liefern. Alle Daten wurden gemessen, während Versuchspersonen auditive Sprachbeispiele von zuvor familiarisierten Sprechern (mit oder ohne Gesicht des Sprechers) hörten. Drei Ergebnisse zeigen, dass zuvor gelernte visuelle Sprecherinformationen zur auditorischen Analyse von Stimmen beitragen: (i) gesichtssensible Areale waren Teil des sensorischen Netzwerks, dass durch Stimmen aktiviert wurde, (ii) die auditorische Verarbeitung von Stimmen war durch die gelernte Gesichtsinformation zeitlich faszilitiert und (iii) multisensorische Interaktionen zwischen gesichtsensiblen und stimm-/sprachsensiblen Arealen waren verstärkt. Die vorliegende Arbeit stellt den traditionellen, unisensorischen Blickwinkel auf die Wahrnehmung von Stimmen und Sprache in Frage und legt nahe, dass die Wahrnehmung von Stimme und Sprache von von einem multisensorischen Verarbeitungsschema profitiert. / Face and voice of a person are strongly associated with each other and usually perceived as a single entity. Despite the natural co-occurrence of faces and voices, brain research has traditionally approached their perception from a unisensory perspective. This means that research into face perception has exclusively focused on the visual system, while research into voice perception has exclusively probed the auditory system. In this thesis, I suggest that the brain has adapted to the multisensory nature of faces and voices and that this adaptation is evident even when one input stream is missing, that is, when input is actually unisensory. Specifically, the current work investigates how the brain exploits previously learned voice-face associations to optimize the auditory processing of voices and vocal speech. Three empirical studies providing spatiotemporal brain data—via functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG)—constitute this thesis. All data were acquired while participants listened to auditory-only speech samples of previously familiarized speakers (with or without seeing the speakers’ faces). Three key findings demonstrate that previously learned visual speaker information support the auditory analysis of vocal sounds: (i) face-sensitive areas were part of the sensory network activated by voices, (ii) the auditory analysis of voices was temporally facilitated by learned facial associations and (iii) multisensory interactions between face- and voice/speech-sensitive regions were increased. The current work challenges traditional unisensory views on vocal perception and rather suggests that voice and vocal speech perception profit from a multisensory neural processing scheme. fMRT Sprache Gesicht Stimme Personenerkennung Sprechererkennung Multisensorisch Neuronale Mechanismen MEG fMRI Face Voice Person Recognition Speech Speaker Recognition Multisensory Neural Mechanisms MEG 150 Psychologie 11 Psychologie ddc:150
48	Reconhecimento automático de locutor em modo independente de texto por Self-Organizing Maps. / Text independent automatic speaker recognition using Self-Organizing Maps. Alexandre Teixeira Mafra 18 December 2002 (has links) Projetar máquinas capazes identificar pessoas é um problema cuja solução encontra uma grande quantidade de aplicações. Implementações em software de sistemas baseados em medições de características físicas pessoais (biométricos), estão começando a ser produzidos em escala comercial. Nesta categoria estão os sistemas de Reconhecimento Automático de Locutor, que se usam da voz como característica identificadora. No presente momento, os métodos mais populares são baseados na extração de coeficientes mel-cepstrais (MFCCs) das locuções, seguidos da identificação do locutor através de Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs) ou quantização vetorial. Esta preferência se justifica pela qualidade dos resultados obtidos. Fazer com que estes sistemas sejam robustos, mantendo sua eficiência em ambientes ruidosos, é uma das grandes questões atuais. Igualmente relevantes são os problemas relativos à degradação de performance em aplicações envolvendo um grande número de locutores, e a possibilidade de fraude baseada em vozes gravadas. Outro ponto importante é embarcar estes sistemas como sub-sistemas de equipamentos já existentes, tornando-os capazes de funcionar de acordo com o seu operador. Este trabalho expõe os conceitos e algoritmos envolvidos na implementação de um software de Reconhecimento Automático de Locutor independente de texto. Inicialmente é tratado o processamento dos sinais de voz e a extração dos atributos essenciais deste sinal para o reconhecimento. Após isto, é descrita a forma pela qual a voz de cada locutor é modelada através de uma rede neural de arquitetura Self-Organizing Map (SOM) e o método de comparação entre as respostas dos modelos quando apresentada uma locução de um locutor desconhecido. Por fim, são apresentados o processo de construção do corpus de vozes usado para o treinamento e teste dos modelos, as arquiteturas de redes testadas e os resultados experimentais obtidos numa tarefa de identificação de locutor. / The design of machines that can identify people is a problem whose solution has a wide range of applications. Software systems, based on personal phisical attributes measurements (biometrics), are in the beginning of commercial scale production. Automatic Speaker Recognition systems fall into this cathegory, using voice as the identifying attribute. At present, the most popular methods are based on the extraction of mel-frequency cepstral coefficients (MFCCs), followed by speaker identification by Hidden Markov Models (HMMs), Gaussian Mixture Models (GMMs) or vector quantization. This preference is motivated by the quality of the results obtained by the use of these methods. Making these systems robust, able to keep themselves efficient in noisy environments, is now a major concern. Just as relevant are the problems related to performance degradation in applications with a large number of speakers involved, and the issues related to the possibility of fraud by the use of recorded voices. Another important subject is to embed these systems as sub-systems of existing devices, enabling them to work according to the operator. This work presents the relevant concepts and algorithms concerning the implementation of a text-independent Automatic Speaker Recognition software system. First, the voice signal processing and the extraction of its essential features for recognition are treated. After this, it is described the way each speaker\'s voice is represented by a Self-Organizing Map (SOM) neural network, and the comparison method of the models responses when a new utterance from an unknown speaker is presented. At last, it is described the construction of the speech corpus used for training and testing the models, the neural network architectures tested, and the experimental results obtained in a speaker identification task. quantização vetorial reconhecimento de locutor reconhecimento de voz redes neurais Self-Organizing Maps SOM neural networks Self-Organizing Maps SOM speaker recognition speech recognition vector quantization
49	Die supramodale Verarbeitung individueller Konzepte am Beispiel menschlicher Stimmen und visuell präsentierter Comicfiguren : eine fMRT-Studie der Temporallappen / Supramodal processing of unique entities using human voices and drawings of cartoon characters : an fMRI study on the temporal lobes Bethmann, Anja January 2012 (has links) Ausgehend von den primärsensorischen Arealen verlaufen Verarbeitungswege nach anterior durch die Temporallappen, die der Objekterkennung dienen. Besonders die vorderste Spitze der Temporallappen, der anteriore Temporalkortex, wird mit Funktionen der Objektidentifizierung assoziiert. Es existieren jedoch mehrere Vermutungen, welcher Art die Objekte sind, die in dieser Region verarbeitet werden. Es gibt Annahmen über die Verarbeitung von Sprache, von menschlichen Stimmen, semantischen Informationen oder individuellen Konzepten. Um zwischen diesen Theorien zu differenzieren, wurden vier ereigniskorrelierte fMRT-Messungen an jungen gesunden Erwachsenen durchgeführt. Die Probanden hörten in drei Experimenten die Stimmen berühmter und unbekannter Personen und in einem der Experimente zusätzlich Geräusche von Tieren und Musikinstrumenten. Im vierten Experiment wurden Zeichnungen von Comicfiguren gezeigt sowie von Tieren und Obst- und Gemüsesorten. Die neuronale Aktivität bei der Verarbeitung dieser Reize im Vergleich zu Zeiten ohne Stimulation wurde mit Hilfe von Interesseregionen untersucht, die nahezu die gesamten Temporallappen abdeckten und diese in jeweils zwölf Areale untergliederten. In den anterioren Temporallappen waren sowohl mit auditiven als auch mit visuellen Stimuli deutliche Aktivierungsunterschiede in Abhängigkeit von der semantischen Kategorie festzustellen. Individuelle Konzepte (menschliche Stimmen und Zeichentrickfiguren) riefen eine signifikant stärkere Aktivierung hervor als kategoriale Konzepte (Tiere, Musikinstrumente, Obst- und Gemüse). Außerdem war das Signal, dass durch die Stimmen der bekannten Personen ausgelöst wurde, deutlich stärker als das Signal der unbekannten Stimmen. Damit sind die Daten am ehesten kompatibel mit der Annahme, dass die anterioren Temporallappen, bekannte individuelle Konzepte verarbeiten. Da die beschriebenen Signalunterschiede zwischen den verschiedenen Bedingungen ausgehend von den transversalen Temporalgyri nach anterior zum Temporalpol zunahmen, unterstützen die Ergebnisse zudem die Theorie von einem ventralen Verarbeitungsweg, der die Temporallappen nach anterior durchquert und zur Objekterkennung beiträgt. In Übereinstimmung mit den Annahmen der Konvergenzzonentheorie von A. R. Damasio scheint die spezifische Funktion dieses rostral gerichteten Verarbeitungsweges aus der sukzessiven Kombination immer mehr sensomotorischer Merkmale von Objekten zu bestehen. Da bekannte individuelle Konzepte eine besonders hohe Anzahl von Merkmalen aufweisen, ist eine weiter nach anterior verlaufende Verarbeitung zu beobachten als bei unbekannten oder kategorialen Konzepten. / It is assumed that neural pathways run from the primary sensory cortices through the temporal lobes towards their poles crossing areas necessary for object recognition. Especially the most anterior temporal parts were associated with processes contributing to the identification of objects. Yet, there is little agreement on the kinds of objects that are interpreted by the anterior temporal lobes. For example, there are assumptions regarding linguistic processing, voice recognition, the processing of general semantic information or the identification of unique entities. In order to differentiate between those theories, four event-related fMRI experiments were performed in healthy young adults. In three experiments, the subjects heard the voices of famous and unknown persons. In addition, characteristic sounds of animals and musical instruments were presented in one of these experiments. During the fourth experiment, drawings of famous cartoon characters were shown together with animals and fruit & vegetables. The neural activity in response to these stimuli compared to rest was analyzed using a regions-of-interest approach. 12 regions-of-interest that covered the majority of the temporal lobes were defined in each hemisphere. Both with auditory and visual stimuli, there were clear activation differences between the semantic categories in the anterior temporal lobes. Unique entities (human voices and cartoon characters) evoked a significantly stronger signal than categorical concepts (animals, musical instruments, fruit & vegetables). Furthermore, the signal in response to voices of familiar persons was significantly higher than to unfamiliar voices. Thus, the results are most compatible with the assumption that the anterior temporal lobes process supramodal features of familiar unique entities. As the before-mentioned signal differences between unique and categorical concepts and between familiar and unfamiliar voices increased from the transversal temporal gyri towards the temporal poles, the results support the notion of a ventral processing pathway running rostrally through the temporal lobes. In accordance with the convergence zone theory described by A.R. Damasio, the precise function of that pathway seems to consist in the incremental combination of sensorimotor concept features. Since familiar unique entities possess an especially high number of features, their processing was found to be directed into more anterior portions of the temporal lobe than the perception of unfamiliar or categorical concepts. Stimmenverarbeitung Identifizierung berühmter Personen Individuen anteriorer Temporallappen Hemisphärenunterschiede voice processing famous speaker recognition unique entities anterior temporal lobes hemispheric differences Language, Linguistics
50	Improved GMM-Based Classification Of Music Instrument Sounds Krishna, A G 05 1900 (has links) This thesis concerns with the recognition of music instruments from isolated notes. Music instrument recognition is a relatively nascent problem fast gaining importance not only because of the academic value the problem provides, but also for the potential it has in being able to realize applications like music content analysis, music transcription etc. Line spectral frequencies are proposed as features for music instrument recognition and shown to perform better than Mel filtered cepstral coefficients and linear prediction cepstral coefficients. Assuming a linear model of sound production, features based on the prediction residual, which represents the excitation signal, is proposed. Four improvements are proposed for classification using Gaussian mixture model (GMM) based classifiers. One of them involves characterizing the regions of overlap between classes in the feature space to improve classification. Applications to music instrument recognition and speaker recognition are shown. An experiment is proposed for discovering the hierarchy in music instrument in a data-driven manner. The hierarchy thus discovered closely corresponds to the hierarchy defined by musicians and experts and therefore shows that the feature space has successfully captured the required features for music instrument characterization. Sound-Pattern Perception Music Instrument Recognition Speaker Recognition Gaussian Mixture Models GMM MIR Speaker Identification Speaker Segmentation Music Instruments Improved Classification Communication Engineering

Search results