Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
641 |
Probabilistic Modelling of Hearing : Speech Recognition and Optimal AudiometryStadler, Svante January 2009 (has links)
<p>Hearing loss afflicts as many as 10\% of our population.Fortunately, technologies designed to alleviate the effects ofhearing loss are improving rapidly, including cochlear implantsand the increasing computing power of digital hearing aids. Thisthesis focuses on theoretically sound methods for improvinghearing aid technology. The main contributions are documented inthree research articles, which treat two separate topics:modelling of human speech recognition (Papers A and B) andoptimization of diagnostic methods for hearing loss (Paper C).Papers A and B present a hidden Markov model-based framework forsimulating speech recognition in noisy conditions using auditorymodels and signal detection theory. In Paper A, a model of normaland impaired hearing is employed, in which a subject's pure-tonehearing thresholds are used to adapt the model to the individual.In Paper B, the framework is modified to simulate hearing with acochlear implant (CI). Two models of hearing with CI arepresented: a simple, functional model and a biologically inspiredmodel. The models are adapted to the individual CI user bysimulating a spectral discrimination test. The framework canestimate speech recognition ability for a given hearing impairmentor cochlear implant user. This estimate could potentially be usedto optimize hearing aid settings.Paper C presents a novel method for sequentially choosing thesound level and frequency for pure-tone audiometry. A Gaussianmixture model (GMM) is used to represent the probabilitydistribution of hearing thresholds at 8 frequencies. The GMM isfitted to over 100,000 hearing thresholds from a clinicaldatabase. After each response, the GMM is updated using Bayesianinference. The sound level and frequency are chosen so as tomaximize a predefined objective function, such as the entropy ofthe probability distribution. It is found through simulation thatan average of 48 tone presentations are needed to achieve the sameaccuracy as the standard method, which requires an average of 135presentations.</p>
|
642 |
Machine Learning Methods for Articulatory DataBerry, Jeffrey James January 2012 (has links)
Humans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data useful to speech scientists, and for speech technology applications. This dissertation presents feature extraction methods for ultrasound images of the tongue and for data collected with an Electro-Magnetic Articulograph (EMA). The usefulness of these features is tested in several phoneme classification tasks. Feature extraction methods for ultrasound tongue images presented here consist of automatically tracing the tongue surface contour using a modified Deep Belief Network (DBN) (Hinton et al. 2006), and methods inspired by research in face recognition which use the entire image. The tongue tracing method consists of training a DBN as an autoencoder on concatenated images and traces, and then retraining the first two layers to accept only the image at runtime. This 'translational' DBN (tDBN) method is shown to produce traces comparable to those made by human experts. An iterative bootstrapping procedure is presented for using the tDBN to assist a human expert in labeling a new data set. Tongue contour traces are compared with the Eigentongues method of (Hueber et al. 2007), and a Gabor Jet representation in a 6-class phoneme classification task using Support Vector Classifiers (SVC), with Gabor Jets performing the best. These SVC methods are compared to a tDBN classifier, which extracts features from raw images and classifies them with accuracy only slightly lower than the Gabor Jet SVC method.For EMA data, supervised binary SVC feature detectors are trained for each feature in three versions of Distinctive Feature Theory (DFT): Preliminaries (Jakobson et al. 1954), The Sound Pattern of English (Chomsky and Halle 1968), and Unified Feature Theory (Clements and Hume 1995). Each of these feature sets, together with a fourth unsupervised feature set learned using Independent Components Analysis (ICA), are compared on their usefulness in a 46-class phoneme recognition task. Phoneme recognition is performed using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001), which takes advantage of the temporal nature of speech, by looking at observations adjacent in time. Results of the phoneme recognition task show that Unified Feature Theory performs slightly better than the other versions of DFT. Surprisingly, ICA actually performs worse than running the CRF on raw EMA data.
|
643 |
Towards a robust Arabic speech recognition system based on reservoir computingAlalshekmubarak, Abdulrahman January 2014 (has links)
In this thesis we investigate the potential of developing a speech recognition system based on a recently introduced artificial neural network (ANN) technique, namely Reservoir Computing (RC). This technique has, in theory, a higher capability for modelling dynamic behaviour compared to feed-forward ANNs due to the recurrent connections between the nodes in the reservoir layer, which serves as a memory. We conduct this study on the Arabic language, (one of the most spoken languages in the world and the official language in 26 countries), because there is a serious gap in the literature on speech recognition systems for Arabic, making the potential impact high. The investigation covers a variety of tasks, including the implementation of the first reservoir-based Arabic speech recognition system. In addition, a thorough evaluation of the developed system is conducted including several comparisons to other state- of-the-art models found in the literature, and baseline models. The impact of feature extraction methods are studied in this work, and a new biologically inspired feature extraction technique, namely the Auditory Nerve feature, is applied to the speech recognition domain. Comparing different feature extraction methods requires access to the original recorded sound, which is not possible in the only publicly accessible Arabic corpus. We have developed the largest public Arabic corpus for isolated words, which contains roughly 10,000 samples. Our investigation has led us to develop two novel approaches based on reservoir computing, ESNSVMs (Echo State Networks with Support Vector Machines) and ESNEKMs (Echo State Networks with Extreme Kernel Machines). These aim to improve the performance of the conventional RC approach by proposing different readout architectures. These two approaches have been compared to the conventional RC approach and other state-of-the- art systems. Finally, these developed approaches have been evaluated on the presence of different types and levels of noise to examine their resilience to noise, which is crucial for real world applications.
|
644 |
[en] DISTRIBUTED RECOGNITION FOR CONTINUOUS SPEECH IN LARGE VOCABULARY BRAZILIAN PORTUGUESE / [pt] RECONHECIMENTO DISTRIBUÍDO DE VOZ CONTÍNUA COM AMPLO VOCABULÁRIO PARA O PORTUGUÊS BRASILEIROVLADIMIR FABREGAS SURIGUE DE ALENCAR 05 October 2009 (has links)
[pt] Esta Tese visa explorar as oportunidades de melhoria do desempenho dos Sistemas
Automáticos de Reconhecimento de voz com amplo vocabulário para o Português Brasileiro
quando aplicados em um cenário distribuído (Reconhecimento de Voz Distribuído). Com esta
finalidade, foi construída uma base de vozes para reconhecimento de voz contínua para o
Português Brasileiro com 100 locutores, cada um falando 1000 frases foneticamente balanceadas.
A gravação foi realizada em estúdio, ambiente sem ruído, com uma especificação de gravação que
pudesse abranger a entrada dos diversos codificadores de voz utilizados em Telefonia Móvel
Celular e IP, em particular os codecs ITU-T G.723.1, AMR-NB e AMR-WB. Para um bom
funcionamento dos Sistemas Automáticos de Reconhecimento de voz é necessário que os atributos
de reconhecimento sejam obtidos a uma taxa elevada, porém os codificadores de Voz para
Telefonia IP e Móvel Celular normalmente geram seus parâmetros a taxas mais baixas, o que
degrada o desempenho do reconhecedor. Usualmente é utilizada a interpolação linear no domínio
das LSFs (Line Spectral Frequencies) para resolver este problema. Nesta Tese foi proposta a
realização da interpolação com a utilização de um Filtro Digital Interpolador que demonstrou ter
um desempenho de reconhecimento muito superior ao da interpolação linear. Foi avaliado também
o uso das ISFs (Immittance Spectral Frequencies) interpoladas como atributo de reconhecimento,
as quais se mostraram inadequadas para esta finalidade, assim como as LSFs. Outro aspecto de
fundamental importância para os reconhecedores de voz distribuídos é a recuperação de perda de
pacotes, que tem impacto direto no desempenho de reconhecimento. Normalmente os
codificadores inserem zeros nos pacotes perdidos ou interpolam linearmente os pacotes recebidos
visando restaurar estes pacotes. Foi proposta nesta tese uma nova técnica baseada em Redes
Neurais que se mostrou mais eficiente na restauração destes pacotes com a finalidade da realização
do reconhecimento. / [en] This Thesis aims at exploring several approaches for performance improvement of the
Automatic Speech Recognition System with large vocabulary for the Brazilian Portuguese when
applied in a distributed scenario (Distributed Speech Recognition). With this purpose, a speech
database for continuous speech recognition for the Brazilian Portuguese with 100 speakers was
constructed, each one uttering 1000 phonetic balanced sentences. The recording was carried out in
a studio (environment without noise) with a specification of recording that would be able to allow
the input of several speech codecs in Cellular Mobile Telephony and IP Networks, in particular the
ITU-T G.723.1, AMR-NB and AMR-WB. In order to work properly, Automatic Speech
Recognition Systems require that the recognition features be extracted at a high rate. However, the
Speech codecs for Cellular Mobile Telephony and IP Networks normally generate its parameters at
lower rates, which degrades the performance of the recognition system. Usually the linear
interpolation in the LSF (Line Spectral Frequencies) domain is used to solve this problem. In this
Thesis the accomplishment of the interpolation with the use of a Digital Filter Interpolator was
proposed and demonstrated to have a higher performance than the linear interpolation in
recognition systems. The use of the interpolated ISFs (Immittance Spectral Frequencies) was also
evaluated as recognition feature, which had shown to be inadequate for this purpose, as well as the
LSFs. Another very important aspect for the distributed speech recognizers is the recovery of lost
packets, that has direct impact in the recognition performance. Normally the coders insert zeros in
the lost packets or interpolate linearly the received packets aiming to restore them. A new
technique based on Neural Networks was proposed in this thesis that showed to be more efficient
in the restoration of these lost packets with the purpose of speech recognition.
|
645 |
Sensitivity analysis of blind separation of speech mixturesUnknown Date (has links)
Blind source separation (BSS) refers to a class of methods by which multiple sensor signals are combined with the aim of estimating the original source signals. Independent component analysis (ICA) is one such method that effectively resolves static linear combinations of independent non-Gaussian distributions. We propose a method that can track variations in the mixing system by seeking a compromise between adaptive and block methods by using mini-batches. The resulting permutation indeterminacy is resolved based on the correlation continuity principle. Methods employing higher order cumulants in the separation criterion are susceptible to outliers in the finite sample case. We propose a robust method based on low-order non-integer moments by exploiting the Laplacian model of speech signals. We study separation methods for even (over)-determined linear convolutive mixtures in the frequency domain based on joint diagonalization of matrices employing time-varying second order statistics. We investigate the sources affecting the sensitivity of the solution under the finite sample case such as the set size, overlap amount and cross-spectrum estimation methods. / by Savaskan Bulek. / Thesis (Ph.D.)--Florida Atlantic University, 2010. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2010. Mode of access: World Wide Web.
|
646 |
Einsetzbarkeit und Nutzen der digitalen Spracherkennung in der radiologischen DiagnostikArndt, Holger 17 February 1999 (has links)
Ziel: Einsetzbarkeit und Nutzen der digitalen Spracherkennung in der radiologischen Diagnostik sollte an Hand des Spracherkennungssystems SP 6000 getestet werden. Methodik: Das Spracherkennungssystem SP 6000 wurde in das Institutsnetzwerk integriert und an das vorhandene Radiologische Informationssystem (RIS) angebunden. 3 Testpersonen nutzten bei 2305 Diktaten dieses System zur Befunderstellung. Es wurden Datum, Diktatlänge, Zeitaufwand zur Kontrolle/Korrektur, Untersuchungsart und die Fehlerrate nach dem Erkennungsvorgang bei jedem Diktat erfaßt. Korreliert wurde gegenüber 625 durch die gleichen Untersucher konventionell geschriebenen Befunden. Ergebnisse: Nach dem einstündigen Initialtraining lagen durchschnittliche Fehlerraten von 8,4 - 13,3 % vor, die erste Adaptation des Spracherkennungssystems (nach 9 Arbeitstagen) verringerte auf Grund der Lernfähigkeit des Programms die durchschnittliche Fehlerrate auf 2,4 - 10,7 %. Die 2. und 3. Adaptation ergab nur geringe Änderungen der Fehlerrate. Der interindividuelle Vergleich der Entwicklung der Fehlerrate bei der gleichen Untersuchungart zeigte die relative Unabhängigkeit der Fehlerrate vom einzelnen Nutzer. Schlußfolgerungen: Unter Betrachtung der ermittelten Ergebnisse kann das digitale Spracherkennungssystem SP 6000 als vorteilhafte Alternative zur schnellen Erstellung radiologischer Befunde beurteilt werden. Der Vergleich der Befundungsdauer des Schreibens mit der des Diktierens beweist die individuellen Unterschiede bei der Schreibgeschwindigkeit und damit einen Zeitvorteil des Befundens mittels Spracherkennung bei normaler Tastaturfertigkeit. / Purpose: Applicability and benefits of the digital speech recognition in the radiological diagnostics should be tested with the speech recognition system SP 6000. Methods: The speech recognition system SP 6000 was integrated into the network of the institute and connected to the existing Radiological Information System (RIS). 3 subjects of the test used this system for writing 2305 findings from dictation. After the recognition process the date, length of dictation, time required for checking/correction, kind of examination and error rate were recorded for every dictation. By the same subjects of the test, a correlation was performed with 625 conventionally written findings. Results: After an 1-hour initial training the average error rates were 8.4 to 13.3 %. The first adaptation of the speech recognition system (after 9 days) decreased the average error rates to 2.4 to 10.7 % due to the ability of the program to learn. The 2nd and 3rd adaptations resulted only in small changes of the error rate. An individual comparison of the error rate developments in the same kind of investigation showed the relative independence of the error rate of the individual user. Conclusion: The results show that the speech recognition system SP 6000 can be evaluated as an advantageous alternative for quickly recording radiological findings. A comparison between manually writing and dictating the findings verifies the individual differences of the writing speeds and shows the advantage of the application of voice recognition when faced with normal keyboard performances.
|
647 |
Automatic phonological transcription using forced alignment : FAVE toolkit performance on four non-standard varieties of EnglishSella, Valeria January 2018 (has links)
Forced alignment, a speech recognition software performing semi-automatic phonological transcription, constitutes a methodological revolution in the recent history of linguistic research. Its use is progressively becoming the norm in research fields such as sociophonetics, but its general performance and range of applications have been relatively understudied. This thesis investigates the performance and portability of the Forced Alignment and Vowel Extraction program suite (FAVE), an aligner that was trained on, and designed to study, American English. It was decided to test FAVE on four non-American varieties of English (Scottish, Irish, Australian and Indian English) and a control variety (General American). First, the performance of FAVE was compared with human annotators, and then it was tested on three potentially problematic variables: /p, t, k/ realization, rhotic consonants and /l/. Although FAVE was found to perform significantly differently from human annotators on identical datasets, further analysis revealed that the aligner performed quite similarly on the non-standard varieties and the control variety, suggesting that the difference in accuracy does not constitute a major drawback to its extended usage. The study discusses the implications of the findings in relation to doubts expressed about the usage of such technology and argues for a wider implementation of forced alignment tools such as FAVE in sociophonetic research.
|
648 |
Um framework para desenvolvimento de interfaces multimodais em aplicações de computação ubíqua / A framework for multimodal interfaces development in ubiquitous computing applicationsInacio Junior, Valter dos Reis 26 April 2007 (has links)
Interfaces multimodais processam vários tipos de entrada do usuário, tais como voz, gestos e interação com caneta, de uma maneira combinada e coordenada com a saída multimídia do sistema. Aplicações que suportam a multimodalidade provêem um modo mais natural e flexível para a execução de tarefas em computadores, uma vez que permitem que usuários com diferentes níveis de habilidades escolham o modo de interação que melhor se adequa às suas necessidades. O uso de interfaces que fogem do estilo convencional de interação baseado em teclado e mouse vai de encontro ao conceito de computação ubíqua, que tem se estabelecido como uma área de pesquisa que estuda os aspectos tecnológicos e sociais decorrentes da integração de sistemas e dispositivos computacionais à ambientes. Nesse contexto, o trabalho aqui reportado visou investigar a implementação de interfaces multimodais em aplicações de computação ubíqua, por meio da construção de um framework de software para integração de modalidades de escrita e voz / Multimodal interfaces process several types of user inputs, such as voice, gestures and pen interaction, in a combined and coordinated manner with the system?s multimedia output. Applications which support multimodality provide a more natural and flexible way for executing tasks with computers, since they allow users with different levels of abilities to choose the mode of interaction that best fits their needs. The use of interfaces that run away from the conventional style of interaction, based in keyboard and mouse, comes together with the concept of ubiquitous computing, which has been established as a research area that studies the social and technological aspects decurrent from the integration os systems and devices into the environments. In this context, the work reported here aimed to investigate the implementation of multimodal interfaces in ubiquitous computing applications, by means of the building of a software framework used for integrating handwriting and speech modalities
|
649 |
Detection of Nonstationary Noise and Improved Voice Activity Detection in an Automotive Hands-free EnvironmentLaverty, Stephen William 11 May 2005 (has links)
Speech processing in the automotive environment is a challenging problem due to the presence of powerful and unpredictable nonstationary noise. This thesis addresses two detection problems involving both nonstationary noise signals and nonstationary desired signals. Two detectors are developed: one to detect passing vehicle noise in the presence of speech and one to detect speech in the presence of passing vehicle noise. The latter is then measured against a state-of-the-art voice activity detector used in telephony. The process of compiling a library of recordings in the automobile to facilitate this research is also detailed.
|
650 |
Application of LabVIEW and myRIO to voice controlled home automationLindstål, Tim, Marklund, Daniel January 2019 (has links)
The aim of this project is to use NI myRIO and LabVIEW for voice controlled home automation. The NI myRIO is an embedded device which has a Xilinx FPGA and a dual-core ARM Cortex-A9processor as well as analog input/output and digital input/output, and is programmed with theLabVIEW, a graphical programming language. The voice control is implemented in two differentsystems. The first system is based on an Amazon Echo Dot for voice recognition, which is acommercial smart speaker developed by Amazon Lab126. The Echo Dot devices are connectedvia the Internet to the voice-controlled intelligent personal assistant service known as Alexa(developed by Amazon), which is capable of voice interaction, music playback, and controllingsmart devices for home automation. This system in the present thesis project is more focusingon myRIO used for the wireless control of smart home devices, where smart lamps, sensors,speakers and a LCD-display was implemented. The other system is more focusing on myRIO for speech recognition and was built on myRIOwith a microphone connected. The speech recognition was implemented using mel frequencycepstral coefficients and dynamic time warping. A few commands could be recognized, includinga wake word ”Bosse” as well as other four commands for controlling the colors of a smart lamp. The thesis project is shown to be successful, having demonstrated that the implementation ofhome automation using the NI myRIO with two voice-controlled systems can correctly controlhome devices such as smart lamps, sensors, speakers and a LCD-display.
|
Page generated in 0.0957 seconds