Global ETD Search

411	Speaker normalisation for large vocabulary multiparty conversational speech recognition Garau, Giulia January 2009 (has links) One of the main problems faced by automatic speech recognition is the variability of the testing conditions. This is due both to the acoustic conditions (different transmission channels, recording devices, noises etc.) and to the variability of speech across different speakers (i.e. due to different accents, coarticulation of phonemes and different vocal tract characteristics). Vocal tract length normalisation (VTLN) aims at normalising the acoustic signal, making it independent from the vocal tract length. This is done by a speaker specific warping of the frequency axis parameterised through a warping factor. In this thesis the application of VTLN to multiparty conversational speech was investigated focusing on the meeting domain. This is a challenging task showing a great variability of the speech acoustics both across different speakers and across time for a given speaker. VTL, the distance between the lips and the glottis, varies over time. We observed that the warping factors estimated using Maximum Likelihood seem to be context dependent: appearing to be influenced by the current conversational partner and being correlated with the behaviour of formant positions and the pitch. This is because VTL also influences the frequency of vibration of the vocal cords and thus the pitch. In this thesis we also investigated pitch-adaptive acoustic features with the goal of further improving the speaker normalisation provided by VTLN. We explored the use of acoustic features obtained using a pitch-adaptive analysis in combination with conventional features such as Mel frequency cepstral coefficients. These spectral representations were combined both at the acoustic feature level using heteroscedastic linear discriminant analysis (HLDA), and at the system level using ROVER. We evaluated this approach on a challenging large vocabulary speech recognition task: multiparty meeting transcription. We found that VTLN benefits the most from pitch-adaptive features. Our experiments also suggested that combining conventional and pitch-adaptive acoustic features using HLDA results in a consistent, significant decrease in the word error rate across all the tasks. Combining at the system level using ROVER resulted in a further significant improvement. Further experiments compared the use of pitch adaptive spectral representation with the adoption of a smoothed spectrogram for the extraction of cepstral coefficients. It was found that pitch adaptive spectral analysis, providing a representation which is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to be advantageous when HLDA is applied. The combination of a pitch adaptive spectral representation and VTLN based speaker normalisation in the context of LVCSR for multiparty conversational speech led to more speaker independent acoustic models improving the overall recognition performances. 005.3
412	Linear dynamic models for automatic speech recognition Frankel, Joe January 2004 (has links) The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterize entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrization and parameter initialization. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesized segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A* search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data. 621.382
413	End-to-End Speech Recognition Models Chan, William 01 December 2016 (has links) For the past few decades, the bane of Automatic Speech Recognition (ASR) systems have been phonemes and Hidden Markov Models (HMMs). HMMs assume conditional indepen-dence between observations, and the reliance on explicit phonetic representations requires expensive handcrafted pronunciation dictionaries. Learning is often via detached proxy problems, and there especially exists a disconnect between acoustic model performance and actual speech recognition performance. Connectionist Temporal Classification (CTC) character models were recently proposed attempts to solve some of these issues, namely jointly learning the pronunciation model and acoustic model. However, HMM and CTC models still suffer from conditional independence assumptions and must rely heavily on language models during decoding. In this thesis, we question the traditional paradigm of ASR and highlight the limitations of HMM and CTC models. We propose a novel approach to ASR with neural attention models and we directly optimize speech transcriptions. Our proposed method is not only an end-to- end trained system but also an end-to-end model. The end-to-end model jointly learns all the traditional components of a speech recognition system: the pronunciation model, acoustic model and language model. Our model can directly emit English/Chinese characters or even word pieces given the audio signal. There is no need for explicit phonetic representations, intermediate heuristic loss functions or conditional independence assumptions. We demonstrate our end-to-end speech recognition model on various ASR tasks. We show competitive results compared to a state-of-the-art HMM based system on the Google voice search task. We demonstrate an online end-to-end Chinese Mandarin model and show how to jointly optimize the Pinyin transcriptions during training. Finally, we also show state-of-the-art results on the Wall Street Journal ASR task compared to other end-to-end models. Artificial Intelligence Automatic Speech Recognition Deep Learning Machine Learning Neural Networks
414	Development of an English public transport information dialogue system / Development of an English public transport information dialogue system Vejman, Martin January 2015 (has links) This thesis presents a development of an English spoken dialogue system based on the Alex dialogue system framework. The work describes a component adaptation of the framework for a different domain and language. The system provides public transport information in New York. This work involves creating a statistical model and the deployment of custom Kaldi speech recognizer. Its performance was better in comparison with the Google Speech API. The comparison was based on a subjective user satisfaction acquired by crowdsourcing. Powered by TCPDF (www.tcpdf.org)
415	Rozpoznávání řeči pomocí KALDI / Rozpoznávání řeči pomocí KALDI Plátek, Ondřej January 2014 (has links) The topic of this thesis is to implement efficient decoder for speech recognition training system ASR Kaldi (http://kaldi.sourceforge.net/). Kaldi is already deployed with decoders, but they are not convenient for dialogue systems. The main goal of this thesis to develop a real time decoder for a dialogue system, which minimize latency and optimize speed. Methods used for speeding up the decoder are not limited to multi-threading decoding or usage of GPU cards for general computations. Part of this work is devoted to training an acoustic model and also testing it in the "Vystadial" dialogue system. Powered by TCPDF (www.tcpdf.org)
416	Adaptive threshold optimisation for colour-based lip segmentation in automatic lip-reading systems Gritzman, Ashley Daniel January 2016 (has links) A thesis submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in ful lment of the requirements for the degree of Doctor of Philosophy. Johannesburg, September 2016 / Having survived the ordeal of a laryngectomy, the patient must come to terms with the resulting loss of speech. With recent advances in portable computing power, automatic lip-reading (ALR) may become a viable approach to voice restoration. This thesis addresses the image processing aspect of ALR, and focuses three contributions to colour-based lip segmentation. The rst contribution concerns the colour transform to enhance the contrast between the lips and skin. This thesis presents the most comprehensive study to date by measuring the overlap between lip and skin histograms for 33 di erent colour transforms. The hue component of HSV obtains the lowest overlap of 6:15%, and results show that selecting the correct transform can increase the segmentation accuracy by up to three times. The second contribution is the development of a new lip segmentation algorithm that utilises the best colour transforms from the comparative study. The algorithm is tested on 895 images and achieves percentage overlap (OL) of 92:23% and segmentation error (SE) of 7:39 %. The third contribution focuses on the impact of the histogram threshold on the segmentation accuracy, and introduces a novel technique called Adaptive Threshold Optimisation (ATO) to select a better threshold value. The rst stage of ATO incorporates -SVR to train the lip shape model. ATO then uses feedback of shape information to validate and optimise the threshold. After applying ATO, the SE decreases from 7:65% to 6:50%, corresponding to an absolute improvement of 1:15 pp or relative improvement of 15:1%. While this thesis concerns lip segmentation in particular, ATO is a threshold selection technique that can be used in various segmentation applications. / MT2017 Automatic speech recognition Speech processing systems Lipreading--Computer simulation Speech synthesis
417	Triboelectricity and Piezoelectricity Based 3D Printed Bio-skin Sensor for Capturing Subtle Human Movements Mo Lv (6640484) 14 May 2019 (has links) This thesis present the fabrication of 2 types of soft wearable electrical devices, utilizing the 3D printing technique. The devices are capable to detect human heart pulse waves and sound waves for health evaluation and speech recognition. Nanotechnology not elsewhere classified Piezoelectric Nanogenerator triboelectric generator 3D Printing heart pulse speech recognition performance
418	How Speech Recognition can be Implemented in a VR Helicopter Door-Gunner Simulator : A Qualitative Study / Hur röststyrning kan implementeras i en VR-simulator för dörrskyttar i helikoptrar Löfstrand, Alexander January 2019 (has links) This study investigates the possibilities for implementation of speech recognition software in order to ease the usage of a Virtual Reality Simulator. The Pitch Door-Gunner simulator is described followed by a general discussion about simulator and simulator environments. Previous research and theories regarding speech recognition technology are presented, and relevant aspects for training effects such as stress are accounted for. Interviews are conducted with military personnel in order to better grasp how the simulator is actually used and how it can be used to elicit learning. An implementation with a subsequent feasibility test is conducted to investigate practical limitations and give more insight into the possibilities regarding the usage of speech recognition in the simulator. The results show that it is feasible to use speech recognition software to control simple functions, more elaborate functionality requires further research. Furthermore, the study discusses which functions would be favourable to control considering the pros and cons of speech recognition. It is suggested that speech recognition can be important as a tool to make usage more convenient and as a tool to support the instructor, for example, by bookmarking certain segments of training for later review with the help of voice-commands. Speech-Recognition Simulator VR Training Learning Helicopter Human Computer Interaction
419	Face recognition and speech recognition for access control Tran, Thao, Tkauc, Nathalie January 2019 (has links) This project is a collaboration with the company JayWay in Halmstad. In order to enter theoffice today, a tag-key is needed for the employees and a doorbell for the guests. If someonerings the doorbell, someone on the inside has to open the door manually which is consideredas a disturbance during work time. The purpose with the project is to minimize thedisturbances in the office. The goal with the project is to develop a system that uses facerecognition and speech-to-text to control the lock system for the entrance door. The components used for the project are two Raspberry Pi’s, a 7 inch LCD-touch display, aRaspberry Pi Camera Module V2, a external sound card, a microphone and speaker. Thewhole project was written in Python and the platform used was Amazon Web Services (AWS)for storage and the face recognition while speech-to-text was provided by Google.The system is divided in three functions for employees, guests and deliveries. The employeefunction has two authentication steps, the face recognition and a random generated code that needs to be confirmed to avoid biometric spoofing. The guest function includes the speech-to-text service to state an employee's name that the guest wants to meet and the employee is then notified. The delivery function informs the specific persons in the office that are responsiblefor the deliveries by sending a notification.The test proves that the system will always match with the right person when using the facerecognition. It also shows what the threshold for the face recognition can be set to, to makesure that only authorized people enters the office.Using the two steps authentication, the face recognition and the code makes the system secureand protects the system against spoofing. One downside is that it is an extra step that takestime. The speech-to-text is set to swedish and works quite well for swedish-speaking persons.However, for a multicultural company it can be hard to use the speech-to-text service. It canalso be hard for the service to listen and translate if there is a lot of background noise or ifseveral people speak at the same time. Face recognition speech recognition Other Engineering and Technologies Annan teknik Human Computer Interaction
420	Automated biometrics of audio-visual multiple modals Unknown Date (has links) Biometrics is the science and technology of measuring and analyzing biological data for authentication purposes. Its progress has brought in a large number of civilian and government applications. The candidate modalities used in biometrics include retinas, fingerprints, signatures, audio, faces, etc. There are two types of biometric system: single modal systems and multiple modal systems. Single modal systems perform person recognition based on a single biometric modality and are affected by problems like noisy sensor data, intra-class variations, distinctiveness and non-universality. Applying multiple modal systems that consolidate evidence from multiple biometric modalities can alleviate those problems of single modal ones. Integration of evidence obtained from multiple cues, also known as fusion, is a critical part in multiple modal systems, and it may be consolidated at several levels like feature fusion level, matching score fusion level and decision fusion level. Among biometric modalities, both audio and face modalities are easy to use and generally acceptable by users. Furthermore, the increasing availability and the low cost of audio and visual instruments make it feasible to apply such Audio-Visual (AV) systems for security applications. Therefore, this dissertation proposes an algorithm of face recognition. In addition, it has developed some novel algorithms of fusion in different levels for multiple modal biometrics, which have been tested by a virtual database and proved to be more reliable and robust than systems that rely on a single modality. / by Lin Huang. / Thesis (Ph.D.)--Florida Atlantic University, 2010. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2010. Mode of access: World Wide Web. Pattern recognition systems Optical pattern recognition Biometric identification Identification--Automation Automatic speech recognition

Search results