431 |
Continuous space models with neural networks in natural language processingLe, Hai Son 20 December 2012 (has links) (PDF)
The purpose of language models is in general to capture and to model regularities of language, thereby capturing morphological, syntactical and distributional properties of word sequences in a given language. They play an important role in many successful applications of Natural Language Processing, such as Automatic Speech Recognition, Machine Translation and Information Extraction. The most successful approaches to date are based on n-gram assumption and the adjustment of statistics from the training data by applying smoothing and back-off techniques, notably Kneser-Ney technique, introduced twenty years ago. In this way, language models predict a word based on its n-1 previous words. In spite of their prevalence, conventional n-gram based language models still suffer from several limitations that could be intuitively overcome by consulting human expert knowledge. One critical limitation is that, ignoring all linguistic properties, they treat each word as one discrete symbol with no relation with the others. Another point is that, even with a huge amount of data, the data sparsity issue always has an important impact, so the optimal value of n in the n-gram assumption is often 4 or 5 which is insufficient in practice. This kind of model is constructed based on the count of n-grams in training data. Therefore, the pertinence of these models is conditioned only on the characteristics of the training text (its quantity, its representation of the content in terms of theme, date). Recently, one of the most successful attempts that tries to directly learn word similarities is to use distributed word representations in language modeling, where distributionally words, which have semantic and syntactic similarities, are expected to be represented as neighbors in a continuous space. These representations and the associated objective function (the likelihood of the training data) are jointly learned using a multi-layer neural network architecture. In this way, word similarities are learned automatically. This approach has shown significant and consistent improvements when applied to automatic speech recognition and statistical machine translation tasks. A major difficulty with the continuous space neural network based approach remains the computational burden, which does not scale well to the massive corpora that are nowadays available. For this reason, the first contribution of this dissertation is the definition of a neural architecture based on a tree representation of the output vocabulary, namely Structured OUtput Layer (SOUL), which makes them well suited for large scale frameworks. The SOUL model combines the neural network approach with the class-based approach. It achieves significant improvements on both state-of-the-art large scale automatic speech recognition and statistical machine translations tasks. The second contribution is to provide several insightful analyses on their performances, their pros and cons, their induced word space representation. Finally, the third contribution is the successful adoption of the continuous space neural network into a machine translation framework. New translation models are proposed and reported to achieve significant improvements over state-of-the-art baseline systems.
|
432 |
Multisensor Segmentation-based Noise Suppression for Intelligibility Improvement in MELP CodersDemiroglu, Cenk 18 January 2006 (has links)
This thesis investigates the use of an auxiliary sensor, the GEMS device, for improving the quality of noisy speech and designing noise preprocessors to MELP speech coders. Use of auxiliary sensors for noise-robust
ASR applications is also investigated to develop speech enhancement algorithms that use acoustic-phonetic
properties of the speech signal.
A Bayesian risk minimization framework is developed that can incorporate the acoustic-phonetic properties
of speech sounds and knowledge of human auditory perception into the speech enhancement framework. Two noise suppression
systems are presented using the ideas developed in the mathematical framework. In the first system, an aharmonic
comb filter is proposed for voiced speech where low-energy frequencies are severely suppressed while
high-energy frequencies are suppressed mildly. The proposed
system outperformed an MMSE estimator in subjective listening tests and DRT intelligibility test for MELP-coded noisy speech.
The effect of aharmonic
comb filtering on the linear predictive coding (LPC) parameters is analyzed using a missing data approach.
Suppressing the low-energy frequencies without any modification of the high-energy frequencies is shown to
improve the LPC spectrum using the Itakura-Saito distance measure.
The second system combines the aharmonic comb filter with the acoustic-phonetic properties of speech
to improve the intelligibility of the MELP-coded noisy speech.
Noisy speech signal is segmented into broad level sound classes using a multi-sensor automatic
segmentation/classification tool, and each sound class is enhanced differently based on its
acoustic-phonetic properties. The proposed system is shown to outperform both the MELPe noise preprocessor
and the aharmonic comb filter in intelligibility tests when used in concatenation with the MELP coder.
Since the second noise suppression system uses an automatic segmentation/classification algorithm, exploiting the GEMS signal in an automatic
segmentation/classification task is also addressed using an ASR
approach. Current ASR engines can segment and classify speech utterances
in a single pass; however, they are sensitive to ambient noise.
Features that are extracted from the GEMS signal can be fused with the noisy MFCC features
to improve the noise-robustness of the ASR system. In the first phase, a voicing
feature is extracted from the clean speech signal and fused with the MFCC features.
The actual GEMS signal could not be used in this phase because of insufficient sensor data to train the ASR system.
Tests are done using the Aurora2 noisy digits database. The speech-based voicing
feature is found to be effective at around 10 dB but, below 10 dB, the effectiveness rapidly drops with decreasing SNR
because of the severe distortions in the speech-based features at these SNRs. Hence, a novel system is proposed that treats the
MFCC features in a speech frame as missing data if the global SNR is below 10 dB and the speech frame is
unvoiced. If the global SNR is above 10 dB of the speech frame is voiced, both MFCC features and voicing feature are used. The proposed
system is shown to outperform some of the popular noise-robust techniques at all SNRs.
In the second phase, a new isolated monosyllable database is prepared that contains both speech and GEMS data. ASR experiments conducted
for clean speech showed that the GEMS-based feature, when fused with the MFCC features, decreases the performance.
The reason for this unexpected result is found to be partly related to some of the GEMS data that is severely noisy.
The non-acoustic sensor noise exists in all GEMS data but the severe noise happens rarely. A missing
data technique is proposed to alleviate the effects of severely noisy sensor data. The GEMS-based feature is treated as missing data
when it is detected to be severely noisy. The combined features are shown to outperform the MFCC features for clean
speech when the missing data technique is applied.
|
433 |
Acoustic segment modeling and preference ranking for music information retrievalReed, Jeremy T. 27 October 2010 (has links)
This dissertation focuses on improving content-based recommendation systems for music. Specifically, progress in the development in music content-based recommendation systems has stalled in recent years due to some faulty assumptions:
1. most acoustic content-based systems for music information retrieval (MIR) assume a bag-of-frames model, where it is assumed that a song contains a simplistic, global audio texture
2. genre, style, mood, and authors are appropriate categories for machine-oriented recommendation
3. similarity is a universal construct and does not vary among different users
The main contribution of this dissertation is to address these faulty assumptions by describing a novel approach in MIR that provides user-centric, content-based recommendations based on statistics of acoustic sound elements. First, this dissertation presents the acoustic segment modeling framework that describes a piece of music as a temporal sequence of acoustic segment models (ASMs), which represent individual polyphonic sound elements. A dictionary of ASMs generated in an unsupervised process defines a vocabulary of acoustic tokens that are able to transcribe new musical pieces. Next, standard text-based information retrieval algorithms use statistics of ASM counts to perform various retrieval tasks. Despite a simple feature set compared to other content-based genre recommendation algorithms, the acoustic segment modeling approach is highly competitive on standard genre classification databases. Fundamental to the success of the acoustic segment modeling approach is the ability to model acoustical semantics in a musical piece, which is demonstrated by the detection of musical attributes on temporal characteristics. Further, it is shown that the acoustic segment modeling procedure is able to capture the inherent structure of melody by providing near state-of-the-art performance on an automatic chord recognition task.
This dissertation demonstrates that some classification tasks, such as genre, possess information that is not contained in the acoustic signal; therefore, attempts at modeling these categories using only the acoustic content is ill-fated. Further, notions of music similarity are personal in nature and are not derived from a universal ontology. Therefore, this dissertation addresses the second and third limitation of previous content-based retrieval approaches by presenting a user-centric preference rating algorithm. Individual users possess their own cognitive construct of similarity; therefore, retrieval algorithms must demonstrate this flexibility. The proposed rating algorithm is based on the principle of minimum classification error (MCE) training, which has been demonstrated to be robust against outliers and also minimizes the Parzen estimate of the theoretical classification risk. The outlier immunity property limits the effect of labels that arise from non-content-based sources. The MCE-based algorithm performs better than a similar ratings prediction algorithm. Further, this dissertation discusses extensions and future work.
|
434 |
Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J.De Vries, Nicolaas Johannes January 2011 (has links)
As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing
automatic speech recognition (ASR) technologies for such languages, a key step in developing
these technologies is the effective collection of ASR data, consisting of transcribed audio and associated
meta data.
The problem is that no suitable tool currently exists for effectively collecting ASR data for such
languages. The specific context and requirements for effectively collecting ASR data for underresourced
languages, render all currently known solutions unsuitable for such a task. Such requirements
include portability, Internet independence and an open–source code–base.
This work documents the development of such a tool, called Woefzela, from the determination
of the requirements necessary for effective data collection in this context, to the verification and
validation of its functionality. The study demonstrates the effectiveness of using smartphones without
any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–
time quality control philosophy which increases the amount of usable ASR data collected from
speakers.
Woefzela was developed for the Android Operating System, and is freely available for use on
Android smartphones, with its source code also being made available. A total of more than 790 hours
of ASR data for the eleven official languages of South Africa have been successfully collected with
Woefzela.
As part of this study a benchmark for the performance of a new National Centre for Human
Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012.
|
435 |
Probabilistic modeling of neural data for analysis and synthesis of speechMatthews, Brett Alexander 13 August 2012 (has links)
This research consists of probabilistic modeling of speech audio signals and deep-brain neurological signals in brain-computer interfaces. A significant portion of this research consists of a collaborative effort with Neural Signals Inc., Duluth, GA, and Boston University to develop an intracortical neural prosthetic system for speech restoration in a human subject living with Locked-In Syndrome, i.e., he is paralyzed and unable to speak. The work is carried out in three major phases. We first use kernel-based classifiers to detect evidence of articulation gestures and phonological attributes speech audio signals. We demonstrate that articulatory information can be used to decode speech content in speech audio signals. In the second phase of the research, we use neurological signals collected from a human subject with Locked-In Syndrome to predict intended speech content. The neural data were collected with a microwire electrode surgically implanted in speech motor cortex of the subject's brain, with the implant location chosen to capture extracellular electric potentials related to speech motor activity. The data include extracellular traces, and firing occurrence times for neural clusters in the vicinity of the electrode identified by an expert. We compute continuous firing rate estimates for the ensemble of neural clusters using several rate estimation methods and apply statistical classifiers to the rate estimates to predict intended speech content. We use Gaussian mixture models to classify short frames of data into 5 vowel classes and to discriminate intended speech activity in the data from non-speech. We then perform a series of data collection experiments with the subject designed to test explicitly for several speech articulation gestures, and decode the data offline. Finally, in the third phase of the research we develop an original probabilistic method for the task of spike-sorting in intracortical brain-computer interfaces, i.e., identifying and distinguishing action potential waveforms in extracellular traces. Our method uses both action potential waveforms and their occurrence times to cluster the data. We apply the method to semi-artificial data and partially labeled real data. We then classify neural spike waveforms, modeled with single multivariate Gaussians, using the method of minimum classification error for parameter estimation. Finally, we apply our joint waveforms and occurrence times spike-sorting method to neurological data in the context of a neural prosthesis for speech.
|
436 |
Objective-driven discriminative training and adaptation based on an MCE criterion for speech recognition and detectionShin, Sung-Hwan 13 January 2014 (has links)
Acoustic modeling in state-of-the-art speech recognition systems is commonly based on discriminative criteria. Different from the paradigm of the conventional distribution estimation such as maximum a posteriori (MAP) and maximum likelihood (ML), the most popular discriminative criteria such as MCE and MPE aim at direct minimization of the empirical error rate. As recent ASR applications become diverse, it has been increasingly recognized that realistic applications often require a model that can be optimized for a task-specific goal or a particular scenario beyond the general purposes of the current discriminative criteria. These specific requirements cannot be directly handled by the current discriminative criteria since the objective of the criteria is to minimize the overall empirical error rate.
In this thesis, we propose novel objective-driven discriminative training and adaptation frameworks, which are generalized from the minimum classification error (MCE) criterion, for various tasks and scenarios of speech recognition and detection. The proposed frameworks are constructed to formulate new discriminative criteria which satisfy various requirements of the recent ASR applications. In this thesis, each objective required by an application or a developer is directly embedded into the learning criterion. Then, the objective-driven discriminative criterion is used to optimize an acoustic model in order to achieve the required objective.
Three task-specific requirements that the recent ASR applications often require in practice are mainly taken into account in developing the objective-driven discriminative criteria. First, an issue of individual error minimization of speech recognition is addressed and we propose a direct minimization algorithm for each error type of speech recognition. Second, a rapid adaptation scenario is embedded into formulating discriminative linear transforms under the MCE criterion. A regularized MCE criterion is proposed to efficiently improve the generalization capability of the MCE estimate in a rapid adaptation scenario. Finally, the particular operating scenario that requires a system model optimized at a given specific operating point is discussed over the conventional receiver operating characteristic (ROC) optimization. A constrained discriminative training algorithm which can directly optimize a system model for any particular operating need is proposed. For each of the developed algorithms, we provide an analytical solution and an appropriate optimization procedure.
|
437 |
Effective automatic speech recognition data collection for under–resourced languages / de Vries N.J.De Vries, Nicolaas Johannes January 2011 (has links)
As building transcribed speech corpora for under–resourced languages plays a pivotal role in developing
automatic speech recognition (ASR) technologies for such languages, a key step in developing
these technologies is the effective collection of ASR data, consisting of transcribed audio and associated
meta data.
The problem is that no suitable tool currently exists for effectively collecting ASR data for such
languages. The specific context and requirements for effectively collecting ASR data for underresourced
languages, render all currently known solutions unsuitable for such a task. Such requirements
include portability, Internet independence and an open–source code–base.
This work documents the development of such a tool, called Woefzela, from the determination
of the requirements necessary for effective data collection in this context, to the verification and
validation of its functionality. The study demonstrates the effectiveness of using smartphones without
any Internet connectivity for ASR data collection for under–resourced languages. It introduces a semireal–
time quality control philosophy which increases the amount of usable ASR data collected from
speakers.
Woefzela was developed for the Android Operating System, and is freely available for use on
Android smartphones, with its source code also being made available. A total of more than 790 hours
of ASR data for the eleven official languages of South Africa have been successfully collected with
Woefzela.
As part of this study a benchmark for the performance of a new National Centre for Human
Language Technology (NCHLT) English corpus was established. / Thesis (M.Ing. (Electrical Engineering))--North-West University, Potchefstroom Campus, 2012.
|
438 |
Graphical Models for Robust Speech Recognition in Adverse EnvironmentsRennie, Steven J. 01 August 2008 (has links)
Robust speech recognition in acoustic environments that contain multiple speech sources and/or complex non-stationary noise is a difficult problem, but one of great practical interest. The formalism of probabilistic graphical models constitutes a relatively new and very powerful tool for better understanding and extending existing
models, learning, and inference algorithms; and a bedrock for the creative, quasi-systematic development of new ones. In this thesis a collection of new graphical models and inference algorithms for robust speech recognition are presented.
The problem of speech separation using multiple microphones is first treated. A family of variational algorithms for tractably combining multiple acoustic models of speech with observed sensor likelihoods is presented. The algorithms recover high quality estimates of the speech sources even when there are more sources than microphones, and have improved upon the state-of-the-art in terms of SNR gain by over 10 dB.
Next the problem of background compensation in non-stationary acoustic environments is treated. A new dynamic noise adaptation (DNA) algorithm for robust noise compensation is presented, and shown to outperform several existing state-of-the-art
front-end denoising systems on the new DNA + Aurora II and Aurora II-M extensions of the Aurora II task.
Finally, the problem of speech recognition in speech using a single microphone is treated. The Iroquois system for multi-talker speech separation and recognition
is presented. The system won the 2006 Pascal International Speech Separation Challenge, and amazingly, achieved super-human recognition performance on a majority of test cases in the task. The result marks a significant first in automatic speech recognition, and a milestone in computing.
|
439 |
An ensemble speaker and speaking environment modeling approach to robust speech recognitionTsao, Yu 18 November 2008 (has links)
In this study, an ensemble speaker and speaking environment modeling (ESSEM) approach is proposed to characterize environments in order to enhance performance robustness of automatic speech recognition (ASR) systems under adverse conditions. The ESSEM process comprises two stages, the offline and online phases. In the offline phase, we prepare an ensemble speaker and speaking environment space formed by a collection of super-vectors. Each super-vector consists of the entire set of means from all the Gaussian mixture components of a set of hidden Markov Models that characterizes a particular environment. In the online phase, with the ensemble environment space prepared in the offline phase, we estimate the super-vector for a new testing environment based on a stochastic matching criterion. A series of techniques is proposed to further improve the original ESSEM approach on both offline and online phases. For the offline phase, we focus on methods to enhance the construction and coverage of the environment space. We first demonstrate environment clustering and environment partitioning algorithms to well structure the environment space; then, we propose a discriminative training algorithm to enhance discrimination across environment super-vectors and therefore broaden the coverage of the ensemble environment space. For the online phase, we study methods to increase the efficiency and precision in estimating the target super-vector for the testing condition. To enhance the efficiency, we incorporate dimensionality reduction techniques to reduce the complexity of the original environment space. To improve the precision, we first study different forms of mapping function and propose a weighted N-best information technique; then, we propose cohort selection, environment space adaptation and multiple cluster matching algorithms to facilitate the environment characterization. We evaluate the proposed ESSEM framework on the Aurora-2 connected digit recognition task. Experimental results verify that the original ESSEM approach already provides clear improvement over a baseline system without environment compensation. Moreover, the performance of ESSEM can be further enhanced by using the proposed offline and online algorithms. A significant improvement of 16.08% word error rate reduction is achieved by ESSEM with optimal offline and online configuration over our best baseline system on the Aurora-2 task.
|
440 |
Lipreading across multiple viewsLucey, Patrick Joseph January 2007 (has links)
Visual information from a speaker's mouth region is known to improve automatic speech recognition (ASR) robustness, especially in the presence of acoustic noise. Currently, the vast majority of audio-visual ASR (AVASR) studies assume frontal images of the speaker's face, which is a rather restrictive human-computer interaction (HCI) scenario. The lack of research into AVASR across multiple views has been dictated by the lack of large corpora that contains varying pose/viewpoint speech data. Recently, research has concentrated on recognising human be- haviours within "meeting " or "lecture " type scenarios via "smart-rooms ". This has resulted in the collection of audio-visual speech data which allows for the recognition of visual speech from both frontal and non-frontal views to occur. Using this data, the main focus of this thesis was to investigate and develop various methods within the confines of a lipreading system which can recognise visual speech across multiple views. This reseach constitutes the first published work within the field which looks at this particular aspect of AVASR. The task of recognising visual speech from non-frontal views (i.e. profile) is in principle very similar to that of frontal views, requiring the lipreading system to initially locate and track the mouth region and subsequently extract visual features. However, this task is far more complicated than the frontal case, because the facial features required to locate and track the mouth lie in a much more limited spatial plane. Nevertheless, accurate mouth region tracking can be achieved by employing techniques similar to frontal facial feature localisation. Once the mouth region has been extracted, the same visual feature extraction process can take place to the frontal view. A novel contribution of this thesis, is to quantify the degradation in lipreading performance between the frontal and profile views. In addition to this, novel patch-based analysis of the various views is conducted, and as a result a novel multi-stream patch-based representation is formulated. Having a lipreading system which can recognise visual speech from both frontal and profile views is a novel contribution to the field of AVASR. How- ever, given both the frontal and profile viewpoints, this begs the question, is there any benefit of having the additional viewpoint? Another major contribution of this thesis, is an exploration of a novel multi-view lipreading system. This system shows that there does exist complimentary information in the additional viewpoint (possibly that of lip protrusion), with superior performance achieved in the multi-view system compared to the frontal-only system. Even though having a multi-view lipreading system which can recognise visual speech from both front and profile views is very beneficial, it can hardly considered to be realistic, as each particular viewpoint is dedicated to a single pose (i.e. front or profile). In an effort to make the lipreading system more realistic, a unified system based on a single camera was developed which enables a lipreading system to recognise visual speech from both frontal and profile poses. This is called pose-invariant lipreading. Pose-invariant lipreading can be performed on either stationary or continuous tasks. Methods which effectively normalise the various poses into a single pose were investigated for the stationary scenario and in another contribution of this thesis, an algorithm based on regularised linear regression was employed to project all the visual speech features into a uniform pose. This particular method is shown to be beneficial when the lipreading system was biased towards the dominant pose (i.e. frontal). The final contribution of this thesis is the formulation of a continuous pose-invariant lipreading system which contains a pose-estimator at the start of the visual front-end. This system highlights the complexity of developing such a system, as introducing more flexibility within the lipreading system invariability means the introduction of more error. All the works contained in this thesis present novel and innovative contributions to the field of AVASR, and hopefully this will aid in the future deployment of an AVASR system in realistic scenarios.
|
Page generated in 0.1484 seconds