Spelling suggestions: "subject:"[een] SPEECH PROCESSING"" "subject:"[enn] SPEECH PROCESSING""
191 |
Interactive speech-driven facial animationHodgkinson, Warren 18 July 2008 (has links)
One of the fastest developing areas in the entertainment industry is digital animation. Television programmes and movies frequently use 3D animations to enhance or replace actors and scenery. With the increase in computing power, research is also being done to apply these animations in an interactive manner. Two of the biggest obstacles to the success of these undertakings are control (manipulating the models) and realism. This text describes many of the ways to improve control and realism aspects, in such a way that interactive animation becomes possible. Specifically, lip-synchronisation (driven by human speech), and various modeling and rendering techniques are discussed. A prototype that shows that interactive animation is feasible, is also described. / Mr. A. Hardy Prof. S. von Solms
|
192 |
Recording and automatic detection of African elephant (Loxodonta africana) infrasonic rumblesVenter, Petrus Jacobus 01 October 2008 (has links)
The value of studying elephant vocalizations lies in the abundant information that can be retrieved from it. Recordings of elephant rumbles can be used by researchers to determine the size and composition of the herd, the sexual state, as well as the emotional condition of an elephant. It is a difficult task for researchers to obtain large volumes of continuous recordings of elephant vocalizations. Recordings are normally analysed manually to identify the location of rumbles via the tedious and time consuming methods of sped up listening and the visual evaluation of spectrograms. The application of speech processing on elephant vocalizations is a highly unexploited resource. The aim of this study was to contribute to the current body of knowledge and resources of elephant research by developing a tool for recording high volumes of continuous acoustic data in harsh natural conditions as well as examining the possibilities of applying human speech processing techniques to elephant rumbles to achieve automatic detection of these rumbles in recordings. The recording tool was designed and implemented as an elephant recording collar that has an onboard data storage capacity of 128 gigabytes, enough memory to record sound data continuously for a period of nine months. Data is stored in the wave file format and the device has the ability to navigate and control the FAT32 file system so that the files can be read and downloaded to a personal computer. The collar also has the ability to stamp sound files with the time and date, ambient temperature and GPS coordinates. Several different options for microphone placement and protection have been tested experimentally to find an acceptable solution. A relevant voice activity detection algorithm was chosen as a base for the automatic detection of infrasonic elephant rumbles. The chosen algorithm is based on a robust pitch determination algorithm that has been experimentally verified to function correctly under a signal-to-noise ratio as low as -8 dB when more than four harmonic structures exist in a sound. The algorithm was modified to be used for elephant rumbles and was tested with previously recorded elephant vocalization data. The results obtained suggest that the algorithm can accurately detect elephant rumbles from recordings. The number of false alarms and undetected calls increase when recordings are contaminated with unwanted noise that contains harmonic structures or when the harmonic nature of a rumble is lost. Data obtained from the recording collar is less prone to being contaminated than far field recordings and the automatic detection algorithm should provide an accurate tool for detecting any rumbles that appear in the recordings. / Dissertation (MEng)--University of Pretoria, 2008. / Electrical, Electronic and Computer Engineering / unrestricted
|
193 |
Exploiting phonological constraints and automatic identification of speaker classes for Arabic speech recognitionAlsharhan, Iman January 2014 (has links)
The aim of this thesis is to investigate a number of factors that could affect the performance of an Arabic automatic speech understanding (ASU) system. The work described in this thesis belongs to the speech recognition (ASR) phase, but the fact that it is part of an ASU project rather than a stand-alone piece of work on ASR influences the way in which it will be carried out. Our main concern in this work is to determine the best way to exploit the phonological properties of the Arabic language in order to improve the performance of the speech recogniser. One of the main challenges facing the processing of Arabic is the effect of the local context, which induces changes in the phonetic representation of a given text, thereby causing the recognition engine to misclassifiy it. The proposed solution is to develop a set of language-dependent grapheme-to-allophone rules that can predict such allophonic variations and eventually provide a phonetic transcription that is sensitive to the local context for the ASR system. The novel aspect of this method is that the pronunciation of each word is extracted directly from a context-sensitive phonetic transcription rather than a predened dictionary that typically does not reect the actual pronunciation of the word. Besides investigating the boundary effect on pronunciation, the research also seeks to address the problem of Arabic's complex morphology. Two solutions are proposed to tackle this problem, namely, using underspecified phonetic transcription to build the system, and using phonemes instead of words to build the hidden markov models (HMMS). The research also seeks to investigate several technical settings that might have an effect on the system's performance. These include training on the sub-population to minimise the variation caused by training on the main undifferentiated population, as well as investigating the correlation between training size and performance of the ASR system.
|
194 |
Estudo de um sistema de conversão texto-fala baseado em HMM / Study of a HMM-based text-to-speech systemCarvalho, Sarah Negreiros de, 1985- 22 August 2018 (has links)
Orientador: Fábio Violaro / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação / Made available in DSpace on 2018-08-22T07:58:43Z (GMT). No. of bitstreams: 1
Carvalho_SarahNegreirosde_M.pdf: 2350561 bytes, checksum: 950d33430acbd816700ef5de4c78fa5d (MD5)
Previous issue date: 2013 / Resumo: Com o contínuo desenvolvimento da tecnologia, há uma demanda crescente por sistemas de síntese de fala que sejam capazes de falar como humanos, para integrá-los nas mais diversas aplicações, seja no âmbito da automação robótica, sejam para acessibilidade de pessoas com deficiências, seja em aplicativos destinados a cultura e lazer. A síntese de fala baseada em modelos ocultos de Markov (HMM) mostra-se promissora em suprir esta necessidade tecnológica. A sua natureza estatística e paramétrica a tornam um sistema flexível, capaz de adaptar vozes artificiais, inserir emoções no discurso e obter fala sintética de boa qualidade usando uma base de treinamento limitada. Esta dissertação apresenta o estudo realizado sobre o sistema de síntese de fala baseado em HMM (HTS), descrevendo as etapas que envolvem o treinamento dos modelos HMMs e a geração do sinal de fala. São apresentados os modelos espectrais, de pitch e de duração que constituem estes modelos HMM dos fonemas dependentes de contexto, considerando as diversas técnicas de estruturação deles. Alguns dos problemas encontrados no HTS, tais como a característica abafada e monótona da fala artificial, são analisados juntamente com algumas técnicas propostas para aprimorar a qualidade final do sinal de fala sintetizado / Abstract: With the continuous development of technology, there is a growing demand for text-to-speech systems that are able to speak like humans, in order to integrate them in the most diverse applications whether in the field of automation and robotics, or for accessibility of people with disabilities, as for culture and leisure activities. Speech synthesis based on hidden Markov models (HMM) shows to be promising in addressing this need. Their statistical and parametric nature make it a flexible system capable of adapting artificial voices, insert emotions in speech and get artificial speech of good quality using a limited amount of speech data for HMM training. This thesis presents the study realized on HMM-based speech synthesis system (HTS), describing the steps that involve the training of HMM models and the artificial speech generation. Spectral, pitch and duration models are presented, which form context-dependent HMM models, and also are considered the various techniques for structuring them. Some of the problems encountered in the HTS, such as the characteristic muffled and monotone of artificial speech, are analyzed along with some of the proposed techniques to improve the final quality of the synthesized speech signal / Mestrado / Telecomunicações e Telemática / Mestra em Engenharia Elétrica
|
195 |
Characterizing Dysarthric Speech with Transfer LearningJanuary 2020 (has links)
abstract: Speech is known to serve as an early indicator of neurological decline, particularly in motor diseases. There is significant interest in developing automated, objective signal analytics that detect clinically-relevant changes and in evaluating these algorithms against the existing gold-standard: perceptual evaluation by trained speech and language pathologists. Hypernasality, the result of poor control of the velopharyngeal flap---the soft palate regulating airflow between the oral and nasal cavities---is one such speech symptom of interest, as precise velopharyngeal control is difficult to achieve under neuromuscular disorders. However, a host of co-modulating variables give hypernasal speech a complex and highly variable acoustic signature, making it difficult for skilled clinicians to assess and for automated systems to evaluate. Previous work in rating hypernasality from speech relies on either engineered features based on statistical signal processing or machine learning models trained end-to-end on clinical ratings of disordered speech examples. Engineered features often fail to capture the complex acoustic patterns associated with hypernasality, while end-to-end methods tend to overfit to the small datasets on which they are trained. In this thesis, I present a set of acoustic features, models, and strategies for characterizing hypernasality in dysarthric speech that split the difference between these two approaches, with the aim of capturing the complex perceptual character of hypernasality without overfitting to the small datasets available. The features are based on acoustic models trained on a large corpus of healthy speech, integrating expert knowledge to capture known perceptual characteristics of hypernasal speech. They are then used in relatively simple linear models to predict clinician hypernasality scores. These simple models are robust, generalizing across diseases and outperforming comprehensive set of baselines in accuracy and correlation. This novel approach represents a new state-of-the-art in objective hypernasality assessment. / Dissertation/Thesis / Masters Thesis Electrical Engineering 2020
|
196 |
Speech Recognition Using a Synthesized CodebookSmith, Lloyd A. (Lloyd Allen) 08 1900 (has links)
Speech sounds generated by a simple waveform synthesizer were used to create a vector quantization codebook for use in speech recognition. Recognition was tested over the TI-20 isolated word data base using a conventional DTW matching algorithm. Input speech was band limited to 300 - 3300 Hz, then passed through the Scott Instruments Corp. Coretechs process, implemented on a VET3 speech terminal, to create the speech representation for matching. Synthesized sounds were processed in software by a VET3 signal processing emulation program. Emulation and recognition were performed on a DEC VAX 11/750.
The experiments were organized in 2 series. A preliminary experiment, using no vector quantization, provided a baseline for comparison.
The original codebook contained 109 vectors, all derived from 2 formant synthesized sounds. This codebook was decimated through the course of the first series of experiments, based on the number of times each vector was used in quantizing the training data for the previous experiment, in order to determine the smallest subset of vectors suitable for coding the speech data base. The second series of experiments altered several test conditions in order to evaluate the applicability of the minimal synthesized codebook to conventional codebook training.
The baseline recognition rate was 97%. The recognition rate for synthesized codebooks was approximately 92% for sizes ranging from 109 to 16 vectors. Accuracy for smaller codebooks was slightly less than 90%. Error analysis showed that the primary loss in dropping below 16 vectors was in coding of voiced sounds with high frequency second formants. The 16 vector synthesized codebook was chosen as the seed for the second series of experiments.
After one training iteration, and using a normalized distortion score, trained codebooks performed with an accuracy of 95.1%. When codebooks were trained and tested on different sets of speakers, accuracy was 94.9%, indicating that very little speaker dependence was introduced by the training.
|
197 |
Odhad formantových kmitočtů pomocí strojového učení / Estimation of formant frequencies using machine learningKáčerová, Erika January 2019 (has links)
This Master's thesis deals with the issue of formant extraction. A system of scripts in Matlab interface is created to generate values of the first three formant frequencies from speech recordings with the use of Praat and Snack(WaveSurfer). Mel Frequency Cepstral Coefficients and Linear Predictive Coefficients are extracted from the audio files in order to be added to the database. This database is then used to train a neural network. Finally, the designed neural network is tested.
|
198 |
Modul pro výuku výslovnosti cizích jazyků / Module for Pronunciation Training and Foreign Language LearningKudláč, Vladan January 2021 (has links)
Cílem této práce je vylepšit implementaci modulu pro mobilní aplikace pro výuku výslovnosti, najít místa vhodná pro optimalizaci a provést optimalizaci s cílem zvýšit přesnost, snížit čas zpracování a snížit paměťovou náročnost zpracování.
|
199 |
Webový prohlížeč audio/video záznamů přednášek: převod prohlížeče na MySQL databázi / Web Based Audio/Video Lecture Browser: Porting of the Browser to MySQL DatabaseJanovič, Jakub January 2010 (has links)
This project deals with a web-based lecture browser, whose goal is to simplify the gaining of knowledge with the use of multimedia. It presents an existing lecture browser that was created for a diploma thesis at FIT VUT Brno. Demonstrated are the technologies that are used and which will be used to migrate the browser to a MySQL database and to develop a transcription module for speeches. The reader will be acquainted with an analysis and model of the new application. Furthermore, implementation methods for development and subsequent testing are discussed. At the end of the project is a conclusion about the future development of web-based lecture browsers.
|
200 |
Phoneme-based Video Indexing Using Phonetic Disparity SearchBarth, Carlos Leon 01 January 2010 (has links)
This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search.
|
Page generated in 0.0689 seconds