Spelling suggestions: "subject:"speechrecognition"" "subject:"breedsrecognition""
731 |
Normalizing Flow based Hidden Markov Models for Phone Recognition / Normalisering av flödesbaserade dolda Markov-modeller för fonemigenkänningGhosh, Anubhab January 2020 (has links)
The task of Phone recognition is a fundamental task in Speech recognition and often serves a critical role in bench-marking purposes. Researchers have used a variety of models used in the past to address this task, using both generative and discriminative learning approaches. Among them, generative approaches such as the use of Gaussian mixture model-based hidden Markov models are always favored because of their mathematical tractability. However, the use of generative models such as hidden Markov models and its hybrid varieties is no longer in fashion owing to a large inclination to discriminative learning approaches, which have been found to perform better. The only downside is that these approaches do not always ensure mathematical tractability or convergence guarantees as opposed to their generative counterparts. So, the research problem was to investigate whether there could be a process of augmenting the modeling capability of generative Models using a kind of neural network based architectures that could simultaneously prove mathematically tractable and expressive. Normalizing flows are a class of generative models that have been garnered a lot of attention recently in the field of density estimation and offer a method for exact likelihood computation and inference. In this project, a few varieties of Normalizing flow-based hidden Markov models were used for the task of Phone recognition on the TIMIT dataset. It was been found that these models and their mixture model varieties outperformed classical generative model varieties like Gaussian mixture models. A decision fusion approach using classical Gaussian and Normalizing flow-based mixtures showed competitive results compared to discriminative learning approaches. Further analysis based on classes of speech phones was carried out to compare the generative models used. Additionally, a study of the robustness of these algorithms to noisy speech conditions was also carried out. / Uppgiften för fonemigenkänning är en grundläggande uppgift i taligenkänning och tjänar ofta en kritisk roll i benchmarkingändamål. Forskare har använt en mängd olika modeller som använts tidigare för att hantera denna uppgift genom att använda både generativa och diskriminerande inlärningssätt. Bland dem är generativa tillvägagångssätt som användning av Gaussian-blandnings modellbaserade dolda Markov-modeller alltid föredragna på grund av deras matematiska spårbarhet. Men användningen av generativa modeller som dolda Markov-modeller och dess hybridvarianter är inte längre på mode på grund av en stor lutning till diskriminerande inlärningsmetoder, som har visat sig fungera bättre. Den enda nackdelen är att dessa tillvägagångssätt inte alltid säkerställer matematisk spårbarhet eller konvergensgarantier i motsats till deras generativa motsvarigheter. Således var forskningsproblemet att undersöka om det kan finnas en process för att förstärka modelleringsförmågan hos generativa modeller med hjälp av ett slags neurala nätverksbaserade arkitekturer som samtidigt kunde visa sig matematiskt spårbart och uttrycksfullt. Normaliseringsflöden är en klass generativa modeller som nyligen har fått mycket uppmärksamhet inom området för densitetsberäkning och erbjuder en metod för exakt sannolikhetsberäkning och slutsats. I detta projekt användes några få varianter av Normaliserande flödesbaserade dolda Markov-modeller för uppgiften att fonemigenkänna i TIMIT-datasatsen. Det visade sig att dessa modeller och deras blandningsmodellvarianter överträffade klassiska generativa modellvarianter som Gaussiska blandningsmodeller. Ett beslutssmältningsstrategi med klassiska Gaussiska och Normaliserande flödesbaserade blandningar visade konkurrenskraftiga resultat jämfört med diskriminerande inlärningsmetoder. Ytterligare analys baserat på klasser av talsignaler utfördes för att jämföra de generativa modellerna som användes. Dessutom genomfördes en studie av robustheten hos dessa algoritmer till bullriga talförhållanden.
|
732 |
Malleable Contextual Partitioning and Computational DreamingBrar, Gurkanwal Singh 20 January 2015 (has links)
Computer Architecture is entering an era where hundreds of Processing Elements (PE) can be integrated onto single chips even as decades-long, steady advances in instruction, thread level parallelism are coming to an end. And yet, conventional methods of parallelism fail to scale beyond 4-5 PE's, well short of the levels of parallelism found in the human brain. The human brain is able to maintain constant real time performance as cognitive complexity grows virtually unbounded through our lifetime. Our underlying thesis is that contextual categorization leading to simplified algorithmic processing is crucial to the brains performance efficiency. But, since the overheads of such reorganization are unaffordable in real time, we also observe the critical role of sleep and dreaming in the lives of all intelligent beings. Based on the importance of dream sleep in memory consolidation, we propose that it is also responsible for contextual reorganization. We target mobile device applications that can be personalized to the user, including speech, image and gesture recognition, as well as other kinds of personalized classification, which are arguably the foundation of intelligence. These algorithms rely on a knowledge database of symbols, where the database size determines the level of intelligence. Essential to achieving intelligence and a seamless user interface however is that real time performance be maintained. Observing this, we define our chief performance goal as: Maintaining constant real time performance against ever increasing algorithmic and architectural complexities. Our solution is a method for Malleable Contextual Partitioning (MCP) that enables closer personalization to user behavior. We conceptualize a novel architectural framework, the Dream Architecture for Lateral Intelligence (DALI) that demonstrates the MCP approach. The DALI implements a dream phase to execute MCP in ideal MISD parallelism and reorganize its architecture to enable contextually simplified real time operation. With speech recognition as an example application, we show that the DALI is successful in achieving the performance goal, as it maintains constant real time recognition, scaling almost ideally, with PE numbers up to 16 and vocabulary size up to 220 words. / Master of Science
|
733 |
Multichannel audio processing for speaker localization, separation and enhancementMartí Guerola, Amparo 29 October 2013 (has links)
This thesis is related to the field of acoustic signal processing and its applications to emerging
communication environments. Acoustic signal processing is a very wide research area covering
the design of signal processing algorithms involving one or several acoustic signals to perform
a given task, such as locating the sound source that originated the acquired signals, improving
their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing
the type of source and the content of the message. Among the above tasks, Sound Source
localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in
this thesis. In fact, the localization of sound sources in a room has received a lot of attention in
the last decades. Most real-word microphone array applications require the localization of one
or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation).
Some of these applications are teleconferencing systems, video-gaming, autonomous
robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound
source localization under high noise and reverberation is a very challenging task. One of the
most well-known algorithms for source localization in noisy and reverberant environments is
the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the
baseline framework for the contributions proposed in this thesis. Another challenge in the design
of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable
number of microphones and limited computational resources. Although the SRP-PHAT
algorithm has been shown to be an effective localization algorithm for real-world environments,
its practical implementation is usually based on a costly fine grid-search procedure, making the
computational cost of the method a real issue. In this context, several modifications and optimizations
have been proposed to improve its performance and applicability. An effective strategy
that extends the conventional SRP-PHAT functional is presented in this thesis. This approach
performs a full exploration of the sampled space rather than computing the SRP at discrete spatial
positions, increasing its robustness and allowing for a coarser spatial grid that reduces the
computational cost required in a practical implementation with a small hardware cost (reduced
number of microphones). This strategy allows to implement real-time applications based on
location information, such as automatic camera steering or the detection of speech/non-speech
fragments in advanced videoconferencing systems.
As stated before, besides the contributions related to SSL, this thesis is also related to the
field of ASR. This technology allows a computer or electronic device to identify the words spoken
by a person so that the message can be stored or processed in a useful way. ASR is used on
a day-to-day basis in a number of applications and services such as natural human-machine
interfaces, dictation systems, electronic translators and automatic information desks. However,
there are still some challenges to be solved. A major problem in ASR is to recognize people
speaking in a room by using distant microphones. In distant-speech recognition, the microphone
does not only receive the direct path signal, but also delayed replicas as a result of multi-path
propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple
speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound
Source Separation (SSS) methods can be successfully employed to improve ASR performance
in multi-source scenarios. This is the motivation behind the training method for multiple talk
situations proposed in this thesis. This training, which is based on a robust transformed model
constructed from separated speech in diverse acoustic environments, makes use of a SSS method
as a speech enhancement stage that suppresses the unwanted interferences. The combination
of source separation and this specific training has been explored and evaluated under different
acoustical conditions, leading to improvements of up to a 35% in ASR performance. / Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101
|
734 |
Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and CrowdsourcingGranell Romero, Emilio 01 September 2017 (has links)
Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation.
Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons.
The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process.
The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts.
This problem is faced from three different, but complementary, scenarios:
· Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis.
· Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription.
Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word).
· Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators. / El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos.
En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas.
La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción.
El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados.
Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios:
· Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final.
· Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil).
· Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores, / El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents.
Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques.
La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció.
El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats.
Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris:
· Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final.
· Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil).
· Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombr / Granell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137
|
735 |
Phoneme set design for second language speech recognition / 第二言語音声認識のための音素セットの構築に関する研究 / ダイ2 ゲンゴ オンセイ ニンシキ ノ タメ ノ オンソ セット ノ コウチク ニカンスル ケンキュウ王 暁芸, Xiaoyun Wang 22 March 2017 (has links)
本論文は第二言語話者の発話を高精度で認識するための音素セットの構成方法に関する研究結果を述べている.本論文では,第二言語話者の発話をネイティブ話者の発話とは異なる音響特徴量の頻度分布を持つ情報源とみなし,これを表現する適切な音素セットを構築する手法を提案している.具体的には,対象とする第二言語と母語との調音位置や調音様式などの類似性に加え,同音異義語の発生による単語識別性能の低下を総合した基準に基づき,最適な音素セットを決定する.提案手法を日本人学生の英語発話の音声認識に適用し,種々の条件下で認識精度の向上を検証した. / This dissertation focuses on the problem caused by confused mispronunciation to improve the recognition performance of second language speech. A novel method considering integrated acoustic and linguistic features is proposed to derive a reduced phoneme set for L2 speech recognition. The customized phoneme set is created with a phonetic decision tree (PDT)-based top-down sequential splitting method that utilizes the phonological knowledge between L1 and L2. The dissertation verifies the efficacy of the proposed method for Japanese English and shows that the feasibility of building a speech recognizer with the proposed method is able to alleviate the problem caused by confused mispronunciation by second language speakers. / 博士(工学) / Doctor of Philosophy in Engineering / 同志社大学 / Doshisha University
|
736 |
An integrated approach to feature compensation combining particle filters and Hidden Markov Models for robust speech recognitionMushtaq, Aleem 19 September 2013 (has links)
The performance of automatic speech recognition systems often degrades in adverse conditions where there is a mismatch between training and testing conditions. This is true for most modern systems which employ Hidden Markov Models (HMMs) to decode speech utterances. One strategy is to map the distorted features back to clean speech features that correspond well to the features used for training of HMMs. This can be achieved by treating the noisy speech as the distorted version of the clean speech of interest. Under this framework, we can track and consequently extract the underlying clean speech from the noisy signal and use this derived signal to perform utterance recognition. Particle filter is a versatile tracking technique that can be used where often conventional techniques such as Kalman filter fall short. We propose a particle filters based algorithm to compensate the corrupted features according to an additive noise model incorporating both the statistics from clean speech HMMs and observed background noise to map noisy features back to clean speech features. Instead of using specific knowledge at the model and state levels from HMMs which is hard to estimate, we pool model states into clusters as side information. Since each cluster encompasses more statistics when compared to the original HMM states, there is a higher possibility that the newly formed probability density function at the cluster level can cover the underlying speech variation to generate appropriate particle filter samples for feature compensation. Additionally, a dynamic joint tracking framework to monitor the clean speech signal and noise simultaneously is also introduced to obtain good noise statistics. In this approach, the information available from clean speech tracking can be effectively used for noise estimation. The availability of dynamic noise information can enhance the robustness of the algorithm in case of large fluctuations in noise parameters within an utterance. Testing the proposed PF-based compensation scheme on the Aurora 2 connected digit recognition task, we achieve an error reduction of 12.15% from the best multi-condition trained models using this integrated PF-HMM framework to estimate the cluster-based HMM state sequence information. Finally, we extended the PFC framework and evaluated it on a large-vocabulary recognition task, and showed that PFC works well for large-vocabulary systems also.
|
737 |
Speech recognition in children with unilateral and bilateral cochlear implants in quiet and in noiseDawood, Gouwa 12 1900 (has links)
Thesis (MAud (Interdisciplinary Health Sciences. Speech-Language and Hearing Therapy)--Stellenbosch University, 2008. / Individuals are increasingly undergoing bilateral cochlear implantation in an attempt to
benefit from binaural hearing. The main aim of the present study was to compare the
speech recognition of children fitted with bilateral cochlear implants, under binaural and
monaural listening conditions, in quiet and in noise. Ten children, ranging in age from 5
years 7 months to 15 years 4 months, were tested using the Children’s Realistic Index for
Speech Perception (CRISP). All the children were implanted with Nucleus multi-channel
cochlear implant systems in sequential operations and used the ACE coding strategy
bilaterally. The duration of cochlear implant use ranged from 4 years to 8 years 11
months for the first implant and 7 months to 3 years 5 months for the second implant.
Each child was tested in eight listening conditions, which included testing in the presence
and absence of competing speech. Performance with bilateral cochlear implants was not
statistically better than performance with the first cochlear implant, for both quiet and
noisy listening conditions. A ceiling effect may have resulted in the lack of a significant
finding as the scores obtained during unilateral conditions were already close to
maximum. A positive correlation between the length of use of the second cochlear
implant and speech recognition performance was established. The results of the present
study strongly indicated the need for testing paradigms to be devised which are more
sensitive and representative of the complex auditory environments in which cochlear
implant users communicate.
|
738 |
Measuring, refining and calibrating speaker and language information extracted from speechBrummer, Niko 12 1900 (has links)
Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation
of the goodness of pattern recognizers with probabilistic outputs. The
recognizers of interest take an input, known to belong to one of a discrete set
of classes, and output a calibrated likelihood for each class. This is a generalization
of the traditional use of proper scoring rules to evaluate the goodness
of probability distributions. A recognizer with outputs in well-calibrated probability
distribution form can be applied to make cost-effective Bayes decisions
over a range of applications, having di fferent cost functions. A recognizer
with likelihood output can additionally be employed for a wide range of prior
distributions for the to-be-recognized classes.
We use automatic speaker recognition and automatic spoken language
recognition as prototypes of this type of pattern recognizer. The traditional
evaluation methods in these fields, as represented by the series of NIST Speaker
and Language Recognition Evaluations, evaluate hard decisions made by the
recognizers. This makes these recognizers cost-and-prior-dependent. The proposed
methodology generalizes that of the NIST evaluations, allowing for the
evaluation of recognizers which are intended to be usefully applied over a wide
range of applications, having variable priors and costs.
The proposal includes a family of evaluation criteria, where each member
of the family is formed by a proper scoring rule. We emphasize two members
of this family: (i) A non-strict scoring rule, directly representing error-rate
at a given prior. (ii) The strict logarithmic scoring rule which represents
information content, or which equivalently represents summarized error-rate,
or expected cost, over a wide range of applications.
We further show how to form a family of secondary evaluation criteria,
which by contrasting with the primary criteria, form an analysis of the goodness
of calibration of the recognizers likelihoods.
Finally, we show how to use the logarithmic scoring rule as an objective
function for the discriminative training of fusion and calibration of speaker
and language recognizers. / AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese
sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer
en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender
en meer algemeen toepasbaar.
|
739 |
A comparison of Gaussian mixture variants with application to automatic phoneme recognitionBrand, Rinus 12 1900 (has links)
Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2007. / The diagonal covariance Gaussian Probability Density Function (PDF) has been a very
popular choice as the base PDF for Automatic Speech Recognition (ASR) systems. The
only choices thus far have been between the spherical, diagonal and full covariance Gaussian
PDFs. These classic methods have been used for some time, but no single document could be
found that contains a comparative study on these methods in the use of Pattern Recognition
(PR).
There also is a gap between the complexity and speed of the diagonal and full covariance
Gaussian implementations. The performance differences in accuracy, speed and size between
these two methods differ drastically. There is a need to find one or more models that cover
this area between these two classic methods.
The objectives of this thesis are to evaluate three new PDF types that fit into the area
between the diagonal and full covariance Gaussian implementations to broaden the choices
for ASR, to document a comparative study on the three classic methods and the newly
implemented methods (from previous work) and to construct a test system to evaluate these
methods on phoneme recognition.
The three classic density functions are examined and issues regarding the theory, implementation
and usefulness of each are discussed. A visual example of each is given to show
the impact of assumptions made by each (if any).
The three newly implemented PDFs are the Sparse-, Probabilistic Principal Component
Analysis- (PPCA) and Factor Analysis (FA) covariance Gaussian PDFs. The theory, implementation
and practical usefulness are shown and discussed. Again visual examples are
provided to show the difference in modelling methodologies.
The construction of a test system using two speech corpora is shown and includes issues
involving signal processing, PR and evaluation of the results. The NTIMIT and AST speech
corpora were used in initialisation and training the test system. The usage of the system to
evaluate the PDFs discussed in this work is explained.
The testing results of the three new methods confirmed that they indeed fill the gap
between the diagonal and full covariance Gaussians. In our tests the newly implemented
methods produced a relative improvement in error rate over a similar implemented diagonal
covariance Gaussian of 0.3–4%, but took 35–78% longer to evaluate. When compared relative
to the full covariance Gaussian the error rates were 18–22% worse, but the evaluation times
were 61–70% faster. When all the methods were scaled to approximately the same accuracy,
all the above methods were 29–143% slower than the diagonal covariance Gaussian (excluding the spherical covariance method).
|
740 |
Kalbos atpažinimas kompiuteriu / Speech recognition by computerBardauskas, Justinas 04 July 2014 (has links)
This work focuses on speech recognition by computer, pattern recognition stages and problems. Also there is a goal to create a speech recognition tool. At the beginning, there is a general overview of the audio signal and language concepts. The subsequent presentation of the essential tasks of speech recognition, introduction to matrix algebra, which is used to described algorithm. Information is provided on what basis and how features are extracted. For this work often is used LPC. This algorithm is one of the most popular extracting features of speech signal, so it is reviewed in this paper, as well as its modification WLPC. The following text of the speech recognition gives theory of extracted features use. Section „Acoustic modeling“ describes the recognition of speech units and one of the most commonly used acoustic modeling technologies – Hidden Markov Models and the next section „Speech modeling“ describes the language modeling, which purpose is to correct data referring to dictionaries and speech structure. The rest of the text is focused on speach recognition using specrtogram and implementation of speach recognition system. After that, there were executed experiments, that where used to define quality of speech recognition. / Šiame darbe gilinamasi i kalbos atpažinima kompiuteriu, atpažinimo etapus, problemas, o veliau meginama sukurti kalbos atpažinimo iranki. Pradžioje, bendrai apžvelgiama garso signalo, kalbos savokos. Veliau pateikiamos kalbos atpažinimo esminiai uždaviniai, supažindinama su matricu algebra, kuri naudojama aprašytuose algoritmuose. Pateikiama informacija kuo remiantis ir kaip išskiriami požymiai. Šiam darbui dažnai naudojamas LPC. Šis algoritmas yra vienas iš populiariausiu išskiriant kalbos signalo požymius, todel jis šiame darbe yra apžvelgtas, kaip ir jo modifikacija WLPC. Toliau tekste pateikiama kalbos atpažinimo teorija, apie išskirtu požymiu panaudojima. Skyriuje „Akustinis modeliavimas“, aprašomas kalbos vienetu atpažinimas ir vienas iš dažniausiai naudojamu akustinio modeliavimo technologiju - pasleptieji Markov’o modeliai, sekantis skyrius „Kalbos modeliavimas“, aprašo kalbos modeliavima, skirta jau turimiems duomenims sutvarkyti, remiantis žodynais ir analizuojamos kalbos struktura. Likusioje teksto dalyje koncentruojamasi ties kalbos atpažinimu panaudojant spektrograma ir kalbos atpažinimo sistemos igyvendinimu. Po to atlikti eksperimentai, kuriais buvo tiriama pateikto algoritmo atpažinimo kokybe.
|
Page generated in 0.0702 seconds