Spelling suggestions: "subject:"nonrobust"" "subject:"morerobust""
1 |
Auditory Front-Ends for Noise-Robust Automatic Speech RecognitionYeh, Ja-Zang 25 August 2010 (has links)
The human auditory perception system is much more noise-robust than any state-of the art automatic speech recognition (ASR) system. It is expected that the noise-robustness of speech feature can be improved by employing the human auditory based
feature extraction procedure.
In this thesis, we investigate modifying the commonly-used feature extraction process for automatic speech recognition systems. A novel frequency masking curve, which is based on modeling the basilar
membrane as a cascade system of damped simple harmonic oscillators, is used to replace the critical-band masking curve to compute the masking
threshold. We mathematically analyze the coupled motion of the oscillator system (basilar membrane) when they are driven by short-time stationary (speech) signals. Based on the analysis, we derive the relation between the amplitudes of neighboring oscillators,
and accordingly insert a masking module in the front-end signal processing stage to modify the speech spectrum.
We evaluate the proposed method on the Aurora 2.0
noisy-digit speech database. When combined with the commonly-used cepstral mean subtraction post-processing, the proposed auditory front-end module achieves a significant improvement. The method
of correlational masking effect curve combine with CMS can achieves relative improvements of 25.9%
over the baseline respectively. After applying the methods iteratively, the relative improvement
improves from 25.9% to 30.3%.
|
2 |
Data-Driven Rescaling of Energy Features for Noisy Speech RecognitionLuan, Miau 18 July 2012 (has links)
In this paper, we investigate rescaling of energy features for noise-robust speech recognition.
The performance of the speech recognition system will degrade very quickly by the influence
of environmental noise. As a result, speech robustness technique has become an important
research issue for a long time. However, many studies have pointed out that the impact of
speech recognition under the noisy environment is enormous. Therefore, we proposed the
data-driven energy features rescaling (DEFR) to adjust the features. The method is divided
into three parts, that are voice activity detection (VAD), piecewise log rescaling function and
parameter searching algorithm. The purpose is to reduce the difference of noisy and clean
speech features. We apply this method on Mel-frequency cepstral coefficients (MFCC) and
Teager energy cepstral coefficients (TECC), and we compare the proposed method with mean
subtraction (MS) and mean and variance normalization (MVN). We use the Aurora 2.0 and
Aurora 3.0 databases to evaluate the performance. From the experimental results, we proved
that the proposed method can effectively improve the recognition accuracy.
|
3 |
Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech RecognitionChiou, Sheng-chiuan 01 September 2009 (has links)
The human auditory perception system is much more noise-robust than any state-of theart
automatic speech recognition (ASR) system. It is expected that the noise-robustness of
speech feature vectors may be improved by employing more human auditory functions in the
feature extraction procedure.
Forward masking is a phenomenon of human auditory perception, that a weaker sound
is masked by the preceding stronger masker. In this work, two human auditory mechanisms,
synaptic adaptation and temporal integration are implemented by filter functions and incorporated
to model forward masking into MFCC feature extraction. A filter optimization algorithm
is proposed to optimize the filter parameters.
The performance of the proposed method is evaluated on Aurora 3 corpus, and the procedure
of training/testing follows the standard setting provided by the Aurora 3 task. The
synaptic adaptation filter achieves relative improvements of 16.6% over the baseline. The
temporal integration and modified temporal integration filter achieve relative improvements
of 21.6% and 22.5% respectively. The combination of synaptic adaptation with each of temporal
integration filters results in further improvements of 26.3% and 25.5%. Applying the
filter optimization improves the synaptic adaptation filter and two temporal integration filters,
results in the 18.4%, 25.2%, 22.6% improvements respectively. The performance of the
combined-filters models are also improved, the relative improvement are 26.9% and 26.3%.
|
4 |
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary EvaluationParihar, Naveen 13 December 2003 (has links)
Over the past few years, speech recognition technology performance on tasks ranging from isolated digit recognition to conversational speech has dramatically improved. Performance on limited recognition tasks in noiseree environments is comparable to that achieved by human transcribers. This advancement in automatic speech recognition technology along with an increase in the compute power of mobile devices, standardization of communication protocols, and the explosion in the popularity of the mobile devices, has created an interest in flexible voice interfaces for mobile devices. However, speech recognition performance degrades dramatically in mobile environments which are inherently noisy. In the recent past, a great amount of effort has been spent on the development of front ends based on advanced noise robust approaches. The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO and MFA front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry-standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends.
|
5 |
Bio-inspired noise robust auditory featuresJavadi, Ailar 12 June 2012 (has links)
The purpose of this work
is to investigate a series of biologically inspired modifications to state-of-the-art Mel-
frequency cepstral coefficients (MFCCs) that may improve automatic speech recognition
results. We have provided recommendations to improve speech recognition results de-
pending on signal-to-noise ratio levels of input signals. This work has been motivated by
noise-robust auditory features (NRAF). In the feature extraction technique, after a signal is filtered using bandpass filters, a
spatial derivative step is used to sharpen the results, followed by an envelope detector (recti-
fication and smoothing) and down-sampling for each filter bank before being compressed.
DCT is then applied to the results of all filter banks to produce features. The Hidden-
Markov Model Toolkit (HTK) is used as the recognition back-end to perform speech
recognition given the features we have extracted. In this work, we investigate the
role of filter types, window size, spatial derivative, rectification types, smoothing, down-
sampling and compression and compared the final results to state-of-the-art Mel-frequency
cepstral coefficients (MFCC). A series of conclusions and insights are provided for each
step of the process. The goal of this work has not been to outperform MFCCs; however,
we have shown that by changing the compression type from log compression to 0.07 root
compression we are able to outperform MFCCs for all noisy conditions.
|
6 |
Multisensor Segmentation-based Noise Suppression for Intelligibility Improvement in MELP CodersDemiroglu, Cenk 18 January 2006 (has links)
This thesis investigates the use of an auxiliary sensor, the GEMS device, for improving the quality of noisy speech and designing noise preprocessors to MELP speech coders. Use of auxiliary sensors for noise-robust
ASR applications is also investigated to develop speech enhancement algorithms that use acoustic-phonetic
properties of the speech signal.
A Bayesian risk minimization framework is developed that can incorporate the acoustic-phonetic properties
of speech sounds and knowledge of human auditory perception into the speech enhancement framework. Two noise suppression
systems are presented using the ideas developed in the mathematical framework. In the first system, an aharmonic
comb filter is proposed for voiced speech where low-energy frequencies are severely suppressed while
high-energy frequencies are suppressed mildly. The proposed
system outperformed an MMSE estimator in subjective listening tests and DRT intelligibility test for MELP-coded noisy speech.
The effect of aharmonic
comb filtering on the linear predictive coding (LPC) parameters is analyzed using a missing data approach.
Suppressing the low-energy frequencies without any modification of the high-energy frequencies is shown to
improve the LPC spectrum using the Itakura-Saito distance measure.
The second system combines the aharmonic comb filter with the acoustic-phonetic properties of speech
to improve the intelligibility of the MELP-coded noisy speech.
Noisy speech signal is segmented into broad level sound classes using a multi-sensor automatic
segmentation/classification tool, and each sound class is enhanced differently based on its
acoustic-phonetic properties. The proposed system is shown to outperform both the MELPe noise preprocessor
and the aharmonic comb filter in intelligibility tests when used in concatenation with the MELP coder.
Since the second noise suppression system uses an automatic segmentation/classification algorithm, exploiting the GEMS signal in an automatic
segmentation/classification task is also addressed using an ASR
approach. Current ASR engines can segment and classify speech utterances
in a single pass; however, they are sensitive to ambient noise.
Features that are extracted from the GEMS signal can be fused with the noisy MFCC features
to improve the noise-robustness of the ASR system. In the first phase, a voicing
feature is extracted from the clean speech signal and fused with the MFCC features.
The actual GEMS signal could not be used in this phase because of insufficient sensor data to train the ASR system.
Tests are done using the Aurora2 noisy digits database. The speech-based voicing
feature is found to be effective at around 10 dB but, below 10 dB, the effectiveness rapidly drops with decreasing SNR
because of the severe distortions in the speech-based features at these SNRs. Hence, a novel system is proposed that treats the
MFCC features in a speech frame as missing data if the global SNR is below 10 dB and the speech frame is
unvoiced. If the global SNR is above 10 dB of the speech frame is voiced, both MFCC features and voicing feature are used. The proposed
system is shown to outperform some of the popular noise-robust techniques at all SNRs.
In the second phase, a new isolated monosyllable database is prepared that contains both speech and GEMS data. ASR experiments conducted
for clean speech showed that the GEMS-based feature, when fused with the MFCC features, decreases the performance.
The reason for this unexpected result is found to be partly related to some of the GEMS data that is severely noisy.
The non-acoustic sensor noise exists in all GEMS data but the severe noise happens rarely. A missing
data technique is proposed to alleviate the effects of severely noisy sensor data. The GEMS-based feature is treated as missing data
when it is detected to be severely noisy. The combined features are shown to outperform the MFCC features for clean
speech when the missing data technique is applied.
|
Page generated in 0.0285 seconds