Global ETD Search

731	Normalizing Flow based Hidden Markov Models for Phone Recognition / Normalisering av flödesbaserade dolda Markov-modeller för fonemigenkänning Ghosh, Anubhab January 2020 (has links) The task of Phone recognition is a fundamental task in Speech recognition and often serves a critical role in bench-marking purposes. Researchers have used a variety of models used in the past to address this task, using both generative and discriminative learning approaches. Among them, generative approaches such as the use of Gaussian mixture model-based hidden Markov models are always favored because of their mathematical tractability. However, the use of generative models such as hidden Markov models and its hybrid varieties is no longer in fashion owing to a large inclination to discriminative learning approaches, which have been found to perform better. The only downside is that these approaches do not always ensure mathematical tractability or convergence guarantees as opposed to their generative counterparts. So, the research problem was to investigate whether there could be a process of augmenting the modeling capability of generative Models using a kind of neural network based architectures that could simultaneously prove mathematically tractable and expressive. Normalizing flows are a class of generative models that have been garnered a lot of attention recently in the field of density estimation and offer a method for exact likelihood computation and inference. In this project, a few varieties of Normalizing flow-based hidden Markov models were used for the task of Phone recognition on the TIMIT dataset. It was been found that these models and their mixture model varieties outperformed classical generative model varieties like Gaussian mixture models. A decision fusion approach using classical Gaussian and Normalizing flow-based mixtures showed competitive results compared to discriminative learning approaches. Further analysis based on classes of speech phones was carried out to compare the generative models used. Additionally, a study of the robustness of these algorithms to noisy speech conditions was also carried out. / Uppgiften för fonemigenkänning är en grundläggande uppgift i taligenkänning och tjänar ofta en kritisk roll i benchmarkingändamål. Forskare har använt en mängd olika modeller som använts tidigare för att hantera denna uppgift genom att använda både generativa och diskriminerande inlärningssätt. Bland dem är generativa tillvägagångssätt som användning av Gaussian-blandnings modellbaserade dolda Markov-modeller alltid föredragna på grund av deras matematiska spårbarhet. Men användningen av generativa modeller som dolda Markov-modeller och dess hybridvarianter är inte längre på mode på grund av en stor lutning till diskriminerande inlärningsmetoder, som har visat sig fungera bättre. Den enda nackdelen är att dessa tillvägagångssätt inte alltid säkerställer matematisk spårbarhet eller konvergensgarantier i motsats till deras generativa motsvarigheter. Således var forskningsproblemet att undersöka om det kan finnas en process för att förstärka modelleringsförmågan hos generativa modeller med hjälp av ett slags neurala nätverksbaserade arkitekturer som samtidigt kunde visa sig matematiskt spårbart och uttrycksfullt. Normaliseringsflöden är en klass generativa modeller som nyligen har fått mycket uppmärksamhet inom området för densitetsberäkning och erbjuder en metod för exakt sannolikhetsberäkning och slutsats. I detta projekt användes några få varianter av Normaliserande flödesbaserade dolda Markov-modeller för uppgiften att fonemigenkänna i TIMIT-datasatsen. Det visade sig att dessa modeller och deras blandningsmodellvarianter överträffade klassiska generativa modellvarianter som Gaussiska blandningsmodeller. Ett beslutssmältningsstrategi med klassiska Gaussiska och Normaliserande flödesbaserade blandningar visade konkurrenskraftiga resultat jämfört med diskriminerande inlärningsmetoder. Ytterligare analys baserat på klasser av talsignaler utfördes för att jämföra de generativa modellerna som användes. Dessutom genomfördes en studie av robustheten hos dessa algoritmer till bullriga talförhållanden. Phone recognition generative learning Normalizing flows Decision fusion Speech recognition Elektroteknik och elektronik
732	Malleable Contextual Partitioning and Computational Dreaming Brar, Gurkanwal Singh 20 January 2015 (has links) Computer Architecture is entering an era where hundreds of Processing Elements (PE) can be integrated onto single chips even as decades-long, steady advances in instruction, thread level parallelism are coming to an end. And yet, conventional methods of parallelism fail to scale beyond 4-5 PE's, well short of the levels of parallelism found in the human brain. The human brain is able to maintain constant real time performance as cognitive complexity grows virtually unbounded through our lifetime. Our underlying thesis is that contextual categorization leading to simplified algorithmic processing is crucial to the brains performance efficiency. But, since the overheads of such reorganization are unaffordable in real time, we also observe the critical role of sleep and dreaming in the lives of all intelligent beings. Based on the importance of dream sleep in memory consolidation, we propose that it is also responsible for contextual reorganization. We target mobile device applications that can be personalized to the user, including speech, image and gesture recognition, as well as other kinds of personalized classification, which are arguably the foundation of intelligence. These algorithms rely on a knowledge database of symbols, where the database size determines the level of intelligence. Essential to achieving intelligence and a seamless user interface however is that real time performance be maintained. Observing this, we define our chief performance goal as: Maintaining constant real time performance against ever increasing algorithmic and architectural complexities. Our solution is a method for Malleable Contextual Partitioning (MCP) that enables closer personalization to user behavior. We conceptualize a novel architectural framework, the Dream Architecture for Lateral Intelligence (DALI) that demonstrates the MCP approach. The DALI implements a dream phase to execute MCP in ideal MISD parallelism and reorganize its architecture to enable contextually simplified real time operation. With speech recognition as an example application, we show that the DALI is successful in achieving the performance goal, as it maintains constant real time recognition, scaling almost ideally, with PE numbers up to 16 and vocabulary size up to 220 words. / Master of Science Parallel Computing Context Aware Computing Computational Dreaming Brain Inspired Computer Architecture Speech Recognition
733	Multichannel audio processing for speaker localization, separation and enhancement Martí Guerola, Amparo 29 October 2013 (has links) This thesis is related to the field of acoustic signal processing and its applications to emerging communication environments. Acoustic signal processing is a very wide research area covering the design of signal processing algorithms involving one or several acoustic signals to perform a given task, such as locating the sound source that originated the acquired signals, improving their signal to noise ratio, separating signals of interest from a set of interfering sources or recognizing the type of source and the content of the message. Among the above tasks, Sound Source localization (SSL) and Automatic Speech Recognition (ASR) have been specially addressed in this thesis. In fact, the localization of sound sources in a room has received a lot of attention in the last decades. Most real-word microphone array applications require the localization of one or more active sound sources in adverse environments (low signal-to-noise ratio and high reverberation). Some of these applications are teleconferencing systems, video-gaming, autonomous robots, remote surveillance, hands-free speech acquisition, etc. Indeed, performing robust sound source localization under high noise and reverberation is a very challenging task. One of the most well-known algorithms for source localization in noisy and reverberant environments is the Steered Response Power - Phase Transform (SRP-PHAT) algorithm, which constitutes the baseline framework for the contributions proposed in this thesis. Another challenge in the design of SSL algorithms is to achieve real-time performance and high localization accuracy with a reasonable number of microphones and limited computational resources. Although the SRP-PHAT algorithm has been shown to be an effective localization algorithm for real-world environments, its practical implementation is usually based on a costly fine grid-search procedure, making the computational cost of the method a real issue. In this context, several modifications and optimizations have been proposed to improve its performance and applicability. An effective strategy that extends the conventional SRP-PHAT functional is presented in this thesis. This approach performs a full exploration of the sampled space rather than computing the SRP at discrete spatial positions, increasing its robustness and allowing for a coarser spatial grid that reduces the computational cost required in a practical implementation with a small hardware cost (reduced number of microphones). This strategy allows to implement real-time applications based on location information, such as automatic camera steering or the detection of speech/non-speech fragments in advanced videoconferencing systems. As stated before, besides the contributions related to SSL, this thesis is also related to the field of ASR. This technology allows a computer or electronic device to identify the words spoken by a person so that the message can be stored or processed in a useful way. ASR is used on a day-to-day basis in a number of applications and services such as natural human-machine interfaces, dictation systems, electronic translators and automatic information desks. However, there are still some challenges to be solved. A major problem in ASR is to recognize people speaking in a room by using distant microphones. In distant-speech recognition, the microphone does not only receive the direct path signal, but also delayed replicas as a result of multi-path propagation. Moreover, there are multiple situations in teleconferencing meetings when multiple speakers talk simultaneously. In this context, when multiple speaker signals are present, Sound Source Separation (SSS) methods can be successfully employed to improve ASR performance in multi-source scenarios. This is the motivation behind the training method for multiple talk situations proposed in this thesis. This training, which is based on a robust transformed model constructed from separated speech in diverse acoustic environments, makes use of a SSS method as a speech enhancement stage that suppresses the unwanted interferences. The combination of source separation and this specific training has been explored and evaluated under different acoustical conditions, leading to improvements of up to a 35% in ASR performance. / Martí Guerola, A. (2013). Multichannel audio processing for speaker localization, separation and enhancement [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/33101 Sound source localization Sound source separation SRP-PHAT Microphone array Speaker detection Automatic speech recognition. TEORIA DE LA SEÑAL Y COMUNICACIONES
734	Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing Granell Romero, Emilio 01 September 2017 (has links) Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just a word or character to correct (e.g. an on-line handwritten word). · Crowdsourcing: Open distributed collaboration emerges as a powerful tool for massive transcription at a relatively low cost, since the paleographer supervision effort may be dramatically reduced. Multimodal combination allows one to use the speech dictation of handwritten text lines in a multimodal crowdsourcing platform, where collaborators may provide their speech by using their own mobile device instead of using desktop or laptop computers, which makes it possible to recruit more collaborators. / El Procesamiento del Lenguaje Natural (PLN) es un campo de investigación interdisciplinar de las Ciencias de la Computación, Lingüística y Reconocimiento de Patrones que estudia, entre otros, el uso del lenguaje natural humano en la interacción Hombre-Máquina. La mayoría de las tareas de investigación del PLN se pueden aplicar para resolver problemas del mundo real. Este es el caso del reconocimiento y la traducción del lenguaje natural, que se pueden utilizar para construir sistemas automáticos para la transcripción y traducción de documentos. En cuanto a los documentos manuscritos digitalizados, la transcripción se utiliza para facilitar el acceso digital a los contenidos, ya que la simple digitalización de imágenes sólo proporciona, en la mayoría de los casos, la búsqueda por imagen y no por contenidos lingüísticos. La transcripción es aún más importante en el caso de los manuscritos históricos, ya que la mayoría de estos documentos son únicos y la preservación de su contenido es crucial por razones culturales e históricas. La transcripción de manuscritos históricos suele ser realizada por paleógrafos, que son personas expertas en escritura y vocabulario antiguos. Recientemente, los sistemas de Reconocimiento de Escritura (RES) se han convertido en una herramienta común para ayudar a los paleógrafos en su tarea, la cual proporciona un borrador de la transcripción que los paleógrafos pueden corregir con métodos más o menos sofisticados. Este borrador de transcripción es útil cuando presenta una tasa de error suficientemente reducida para que el proceso de corrección sea más cómodo que una completa transcripción desde cero. Por lo tanto, la obtención de un borrador de transcripción con una baja tasa de error es crucial para que esta tecnología de PLN sea incorporada en el proceso de transcripción. El trabajo descrito en esta tesis se centra en la mejora del borrador de transcripción ofrecido por un sistema RES, con el objetivo de reducir el esfuerzo realizado por los paleógrafos para obtener la transcripción de manuscritos históricos digitalizados. Este problema se enfrenta a partir de tres escenarios diferentes, pero complementarios: · Multimodalidad: El uso de sistemas RES permite a los paleógrafos acelerar el proceso de transcripción manual, ya que son capaces de corregir en un borrador de la transcripción. Otra alternativa es obtener el borrador de la transcripción dictando el contenido a un sistema de Reconocimiento Automático de Habla. Cuando ambas fuentes están disponibles, una combinación multimodal de las mismas es posible y se puede realizar un proceso iterativo para refinar la hipótesis final. · Interactividad: El uso de tecnologías asistenciales en el proceso de transcripción permite reducir el tiempo y el esfuerzo humano requeridos para obtener la transcripción correcta, gracias a la cooperación entre el sistema asistencial y el paleógrafo para obtener la transcripción perfecta. La realimentación multimodal se puede utilizar en el sistema asistencial para proporcionar otras fuentes de información adicionales con señales que representen la misma secuencia de palabras a transcribir (por ejemplo, una imagen de texto, o la señal de habla del dictado del contenido de dicha imagen de texto), o señales que representen sólo una palabra o carácter a corregir (por ejemplo, una palabra manuscrita mediante una pantalla táctil). · Crowdsourcing: La colaboración distribuida y abierta surge como una poderosa herramienta para la transcripción masiva a un costo relativamente bajo, ya que el esfuerzo de supervisión de los paleógrafos puede ser drásticamente reducido. La combinación multimodal permite utilizar el dictado del contenido de líneas de texto manuscrito en una plataforma de crowdsourcing multimodal, donde los colaboradores pueden proporcionar las muestras de habla utilizando su propio dispositivo móvil en lugar de usar ordenadores, / El Processament del Llenguatge Natural (PLN) és un camp de recerca interdisciplinar de les Ciències de la Computació, la Lingüística i el Reconeixement de Patrons que estudia, entre d'altres, l'ús del llenguatge natural humà en la interacció Home-Màquina. La majoria de les tasques de recerca del PLN es poden aplicar per resoldre problemes del món real. Aquest és el cas del reconeixement i la traducció del llenguatge natural, que es poden utilitzar per construir sistemes automàtics per a la transcripció i traducció de documents. Quant als documents manuscrits digitalitzats, la transcripció s'utilitza per facilitar l'accés digital als continguts, ja que la simple digitalització d'imatges només proporciona, en la majoria dels casos, la cerca per imatge i no per continguts lingüístics (paraules clau, expressions, categories sintàctiques o semàntiques). La transcripció és encara més important en el cas dels manuscrits històrics, ja que la majoria d'aquests documents són únics i la preservació del seu contingut és crucial per raons culturals i històriques. La transcripció de manuscrits històrics sol ser realitzada per paleògrafs, els quals són persones expertes en escriptura i vocabulari antics. Recentment, els sistemes de Reconeixement d'Escriptura (RES) s'han convertit en una eina comuna per ajudar els paleògrafs en la seua tasca, la qual proporciona un esborrany de la transcripció que els paleògrafs poden esmenar amb mètodes més o menys sofisticats. Aquest esborrany de transcripció és útil quan presenta una taxa d'error prou reduïda perquè el procés de correcció siga més còmode que una completa transcripció des de zero. Per tant, l'obtenció d'un esborrany de transcripció amb un baixa taxa d'error és crucial perquè aquesta tecnologia del PLN siga incorporada en el procés de transcripció. El treball descrit en aquesta tesi se centra en la millora de l'esborrany de la transcripció ofert per un sistema RES, amb l'objectiu de reduir l'esforç realitzat pels paleògrafs per obtenir la transcripció de manuscrits històrics digitalitzats. Aquest problema s'enfronta a partir de tres escenaris diferents, però complementaris: · Multimodalitat: L'ús de sistemes RES permet als paleògrafs accelerar el procés de transcripció manual, ja que són capaços de corregir un esborrany de la transcripció. Una altra alternativa és obtenir l'esborrany de la transcripció dictant el contingut a un sistema de Reconeixement Automàtic de la Parla. Quan les dues fonts (imatge i parla) estan disponibles, una combinació multimodal és possible i es pot realitzar un procés iteratiu per refinar la hipòtesi final. · Interactivitat: L'ús de tecnologies assistencials en el procés de transcripció permet reduir el temps i l'esforç humà requerits per obtenir la transcripció real, gràcies a la cooperació entre el sistema assistencial i el paleògraf per obtenir la transcripció perfecta. La realimentació multimodal es pot utilitzar en el sistema assistencial per proporcionar fonts d'informació addicionals amb senyals que representen la mateixa seqüencia de paraules a transcriure (per exemple, una imatge de text, o el senyal de parla del dictat del contingut d'aquesta imatge de text), o senyals que representen només una paraula o caràcter a corregir (per exemple, una paraula manuscrita mitjançant una pantalla tàctil). · Crowdsourcing: La col·laboració distribuïda i oberta sorgeix com una poderosa eina per a la transcripció massiva a un cost relativament baix, ja que l'esforç de supervisió dels paleògrafs pot ser reduït dràsticament. La combinació multimodal permet utilitzar el dictat del contingut de línies de text manuscrit en una plataforma de crowdsourcing multimodal, on els col·laboradors poden proporcionar les mostres de parla utilitzant el seu propi dispositiu mòbil en lloc d'utilitzar ordinadors d'escriptori o portàtils, la qual cosa permet ampliar el nombr / Granell Romero, E. (2017). Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86137 Handwritten Text Recognition Automatic Speech Recognition Multimodality Combination Interactivity Crowdsourcing ESTADISTICA E INVESTIGACION OPERATIVA LENGUAJES Y SISTEMAS INFORMATICOS
735	Phoneme set design for second language speech recognition / 第二言語音声認識のための音素セットの構築に関する研究 / ダイ2 ゲンゴオンセイニンシキノタメノオンソセットノコウチクニカンスルケンキュウ王暁芸, Xiaoyun Wang 22 March 2017 (has links) 本論文は第二言語話者の発話を高精度で認識するための音素セットの構成方法に関する研究結果を述べている．本論文では，第二言語話者の発話をネイティブ話者の発話とは異なる音響特徴量の頻度分布を持つ情報源とみなし，これを表現する適切な音素セットを構築する手法を提案している．具体的には，対象とする第二言語と母語との調音位置や調音様式などの類似性に加え，同音異義語の発生による単語識別性能の低下を総合した基準に基づき，最適な音素セットを決定する．提案手法を日本人学生の英語発話の音声認識に適用し，種々の条件下で認識精度の向上を検証した． / This dissertation focuses on the problem caused by confused mispronunciation to improve the recognition performance of second language speech. A novel method considering integrated acoustic and linguistic features is proposed to derive a reduced phoneme set for L2 speech recognition. The customized phoneme set is created with a phonetic decision tree (PDT)-based top-down sequential splitting method that utilizes the phonological knowledge between L1 and L2. The dissertation verifies the efficacy of the proposed method for Japanese English and shows that the feasibility of building a speech recognizer with the proposed method is able to alleviate the problem caused by confused mispronunciation by second language speakers. / 博士(工学) / Doctor of Philosophy in Engineering / 同志社大学 / Doshisha University Automatic Speech Recognition ASR Acoustic Likelihood CALL Reduced Phoneme Set Phonetic Decision Tree Second Language Second Language Speech Recognition Target Language Linguistic Discrimination Ability Language Proﬁciency Multiple Reduced Phoneme Sets Mother Tongue Non-native Speech Recognition
736	An integrated approach to feature compensation combining particle filters and Hidden Markov Models for robust speech recognition Mushtaq, Aleem 19 September 2013 (has links) The performance of automatic speech recognition systems often degrades in adverse conditions where there is a mismatch between training and testing conditions. This is true for most modern systems which employ Hidden Markov Models (HMMs) to decode speech utterances. One strategy is to map the distorted features back to clean speech features that correspond well to the features used for training of HMMs. This can be achieved by treating the noisy speech as the distorted version of the clean speech of interest. Under this framework, we can track and consequently extract the underlying clean speech from the noisy signal and use this derived signal to perform utterance recognition. Particle ﬁlter is a versatile tracking technique that can be used where often conventional techniques such as Kalman filter fall short. We propose a particle filters based algorithm to compensate the corrupted features according to an additive noise model incorporating both the statistics from clean speech HMMs and observed background noise to map noisy features back to clean speech features. Instead of using speciﬁc knowledge at the model and state levels from HMMs which is hard to estimate, we pool model states into clusters as side information. Since each cluster encompasses more statistics when compared to the original HMM states, there is a higher possibility that the newly formed probability density function at the cluster level can cover the underlying speech variation to generate appropriate particle ﬁlter samples for feature compensation. Additionally, a dynamic joint tracking framework to monitor the clean speech signal and noise simultaneously is also introduced to obtain good noise statistics. In this approach, the information available from clean speech tracking can be effectively used for noise estimation. The availability of dynamic noise information can enhance the robustness of the algorithm in case of large ﬂuctuations in noise parameters within an utterance. Testing the proposed PF-based compensation scheme on the Aurora 2 connected digit recognition task, we achieve an error reduction of 12.15% from the best multi-condition trained models using this integrated PF-HMM framework to estimate the cluster-based HMM state sequence information. Finally, we extended the PFC framework and evaluated it on a large-vocabulary recognition task, and showed that PFC works well for large-vocabulary systems also. Particle filter Hidden Markov model Robust speech recognition Clustering Markov chain Monte Carlo Hidden Markov models Speech perception Monte Carlo method Algorithms
737	Speech recognition in children with unilateral and bilateral cochlear implants in quiet and in noise Dawood, Gouwa 12 1900 (has links) Thesis (MAud (Interdisciplinary Health Sciences. Speech-Language and Hearing Therapy)--Stellenbosch University, 2008. / Individuals are increasingly undergoing bilateral cochlear implantation in an attempt to benefit from binaural hearing. The main aim of the present study was to compare the speech recognition of children fitted with bilateral cochlear implants, under binaural and monaural listening conditions, in quiet and in noise. Ten children, ranging in age from 5 years 7 months to 15 years 4 months, were tested using the Children’s Realistic Index for Speech Perception (CRISP). All the children were implanted with Nucleus multi-channel cochlear implant systems in sequential operations and used the ACE coding strategy bilaterally. The duration of cochlear implant use ranged from 4 years to 8 years 11 months for the first implant and 7 months to 3 years 5 months for the second implant. Each child was tested in eight listening conditions, which included testing in the presence and absence of competing speech. Performance with bilateral cochlear implants was not statistically better than performance with the first cochlear implant, for both quiet and noisy listening conditions. A ceiling effect may have resulted in the lack of a significant finding as the scores obtained during unilateral conditions were already close to maximum. A positive correlation between the length of use of the second cochlear implant and speech recognition performance was established. The results of the present study strongly indicated the need for testing paradigms to be devised which are more sensitive and representative of the complex auditory environments in which cochlear implant users communicate. Cochlear implants Unilateral cochlear implants Bilateral cochlear implants Speech recognition Speech perception Binaural benefit Competing noise
738	Measuring, refining and calibrating speaker and language information extracted from speech Brummer, Niko 12 1900 (has links) Thesis (PhD (Electrical and Electronic Engineering))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having di fferent cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizers likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers. / AFRIKAANSE OPSOMMING: Ons wys hoe om die onsekerheid in die uittree van outomatiese sprekerherkenning- en taalherkenningstelsels voor te stel, te meet, te kalibreer en te optimeer. Dit maak die bestaande tegnologie akkurater, doeltre ender en meer algemeen toepasbaar. Automatic speaker recognition Automatic spoken language recognition Proper scoring rule Calibration Dissertations -- Electronic engineering Theses -- Electronic engineering Automatic speech recognition Speech processing systems
739	A comparison of Gaussian mixture variants with application to automatic phoneme recognition Brand, Rinus 12 1900 (has links) Thesis (MScEng (Electrical and Electronic Engineering))--University of Stellenbosch, 2007. / The diagonal covariance Gaussian Probability Density Function (PDF) has been a very popular choice as the base PDF for Automatic Speech Recognition (ASR) systems. The only choices thus far have been between the spherical, diagonal and full covariance Gaussian PDFs. These classic methods have been used for some time, but no single document could be found that contains a comparative study on these methods in the use of Pattern Recognition (PR). There also is a gap between the complexity and speed of the diagonal and full covariance Gaussian implementations. The performance differences in accuracy, speed and size between these two methods differ drastically. There is a need to find one or more models that cover this area between these two classic methods. The objectives of this thesis are to evaluate three new PDF types that fit into the area between the diagonal and full covariance Gaussian implementations to broaden the choices for ASR, to document a comparative study on the three classic methods and the newly implemented methods (from previous work) and to construct a test system to evaluate these methods on phoneme recognition. The three classic density functions are examined and issues regarding the theory, implementation and usefulness of each are discussed. A visual example of each is given to show the impact of assumptions made by each (if any). The three newly implemented PDFs are the Sparse-, Probabilistic Principal Component Analysis- (PPCA) and Factor Analysis (FA) covariance Gaussian PDFs. The theory, implementation and practical usefulness are shown and discussed. Again visual examples are provided to show the difference in modelling methodologies. The construction of a test system using two speech corpora is shown and includes issues involving signal processing, PR and evaluation of the results. The NTIMIT and AST speech corpora were used in initialisation and training the test system. The usage of the system to evaluate the PDFs discussed in this work is explained. The testing results of the three new methods confirmed that they indeed fill the gap between the diagonal and full covariance Gaussians. In our tests the newly implemented methods produced a relative improvement in error rate over a similar implemented diagonal covariance Gaussian of 0.3–4%, but took 35–78% longer to evaluate. When compared relative to the full covariance Gaussian the error rates were 18–22% worse, but the evaluation times were 61–70% faster. When all the methods were scaled to approximately the same accuracy, all the above methods were 29–143% slower than the diagonal covariance Gaussian (excluding the spherical covariance method). Gaussian density Principal components Speech processing Automatic speech recognition Factor analysis Electrical and Electronic Engineering
740	Kalbos atpažinimas kompiuteriu / Speech recognition by computer Bardauskas, Justinas 04 July 2014 (has links) This work focuses on speech recognition by computer, pattern recognition stages and problems. Also there is a goal to create a speech recognition tool. At the beginning, there is a general overview of the audio signal and language concepts. The subsequent presentation of the essential tasks of speech recognition, introduction to matrix algebra, which is used to described algorithm. Information is provided on what basis and how features are extracted. For this work often is used LPC. This algorithm is one of the most popular extracting features of speech signal, so it is reviewed in this paper, as well as its modification WLPC. The following text of the speech recognition gives theory of extracted features use. Section „Acoustic modeling“ describes the recognition of speech units and one of the most commonly used acoustic modeling technologies – Hidden Markov Models and the next section „Speech modeling“ describes the language modeling, which purpose is to correct data referring to dictionaries and speech structure. The rest of the text is focused on speach recognition using specrtogram and implementation of speach recognition system. After that, there were executed experiments, that where used to define quality of speech recognition. / Šiame darbe gilinamasi i kalbos atpažinima kompiuteriu, atpažinimo etapus, problemas, o veliau meginama sukurti kalbos atpažinimo iranki. Pradžioje, bendrai apžvelgiama garso signalo, kalbos savokos. Veliau pateikiamos kalbos atpažinimo esminiai uždaviniai, supažindinama su matricu algebra, kuri naudojama aprašytuose algoritmuose. Pateikiama informacija kuo remiantis ir kaip išskiriami požymiai. Šiam darbui dažnai naudojamas LPC. Šis algoritmas yra vienas iš populiariausiu išskiriant kalbos signalo požymius, todel jis šiame darbe yra apžvelgtas, kaip ir jo modifikacija WLPC. Toliau tekste pateikiama kalbos atpažinimo teorija, apie išskirtu požymiu panaudojima. Skyriuje „Akustinis modeliavimas“, aprašomas kalbos vienetu atpažinimas ir vienas iš dažniausiai naudojamu akustinio modeliavimo technologiju - pasleptieji Markov’o modeliai, sekantis skyrius „Kalbos modeliavimas“, aprašo kalbos modeliavima, skirta jau turimiems duomenims sutvarkyti, remiantis žodynais ir analizuojamos kalbos struktura. Likusioje teksto dalyje koncentruojamasi ties kalbos atpažinimu panaudojant spektrograma ir kalbos atpažinimo sistemos igyvendinimu. Po to atlikti eksperimentai, kuriais buvo tiriama pateikto algoritmo atpažinimo kokybe. Speech recognition by computer Speech recognintion by spectrogram Lpc Spektrogram features Kalbos atpa=inimas kompiuteriu Kalbos atpažinimas pagal spektrograma Spektrogramos požymiai

Search results