Acoustic Model Adaptation For Reverberation Robust Automatic Speech Recognition

Reverberation is a natural phenomenon observed in enclosed environments. It occurs due to the reflection of the signal from the walls and objects in the room. For humans, reverberation is beneficial as it reinforces sound and also provide the sensation of space. However, for automatic speech recognition even moderate amount of reverberation is very harmful. It corrupts the clean speech which leads to deterioration in the performance of the speech recognizer. Moreover, in the enclosed environment, reverberation has the most damaging affect over the accuracy of the recognizer. In literature, to improve speech recognition performance against environmental artifacts mostly noise compensation techniques have been proposed. As a consequence, the problem of reverberation has received relatively less attention. Lately, some techniques have emerged which are specifically tailored for compensating the effects of reverberation. Nevertheless, the problem of reverberation is far from being solved. Therefore, to handle reverberation and provide robustness to speech recognition, we propose "Semi-blind adaptation" technique which adapts the clean acoustic models to the reverberant environment and thus provide improved performance. Semi-blind adaptation technique works in two phases, in the first phase reverberation model is estimated and in the second phase using the reverberation model, adaptation of the clean acoustic models is performed. The reverberation model (Pw-EDC) proposed in this technique models the non-diffuse nature of the rooms. Therefore, the Pw-EDC model has dual slope energy decay where the first slope represents the steep decay of early reflections and second slope represents the slow decay of late reflections. The parameters to model early reflections decay were empirically calculated and to find the parameter of late reflections decay we proposed Gaussian mixture models (GMMs) based reverberation time estimation technique. Late reflections decay parameter is estimated by first training a pool of GMMs where each model represents the reverberation time of the data on which it is trained. In the test phase, test data is matched with these models and the GMM which matches with highest probability provide the estimate of late reflections decay parameter. To adapt the acoustic models, reverberation energy contributions are estimated by using the Pw-EDC model. The parameters of the current state in the model (i.e., only means) are adapted by adding the reverberation energy contributions of the previous states to the current state. In this manner, the dispersion of energy caused by the reverberation is compensated. Adaptation is performed not only on static parameters but also on dynamic parameters of the model. After adaptation, the models are evaluated on data from low, medium and high reverberant environments. The efficacy of the proposed adaptation technique is evaluated on small and medium vocabulary tasks. For these tasks reverberant data is generated by convolving clean signals with impulse responses taken from SIREAC and AIR databases. SIREAC provides RIRs of office and living room and it also has a facility to modify the reverberation time. Therefore, in our experiments the reverberation time of the RIRs is varied from 200 to 900 ms in steps of 100 ms for both rooms of SIREAC. In AIR environment, RIRs are obtained from studio booth, meeting, office and lecture rooms. These rooms have very low, low, medium and high reverberation times respectively. For small vocabulary task, Pw-EDC adaptation provide considerable improvements compared to the baseline results especially at medium and high reverberation times in both environments. Pw-EDC adaptation is compared with contemporary adaptation technique (Exp-EDC adaptation) which also adapts the models in the same manner, except it uses a crude reverberation model. It was found, Pw-EDC adaptation gives better performance in all the rooms of both environments. Pw-EDC is also compared with state-of-the-art adaptation technique i.e., unsupervised MLLR and it was found that Pw-EDC adaptation provide similar performance to MLLR only when the models contain static coefficients. For medium vocabulary task, Pw-EDC adaptation provide better performance than Exp-EDC adaptation in both the environments. However, when compared against unsupervised MLLR it shows relatively poor performance. The reason for such dismal performance is the inaccurate adaptation of dynamic coefficients of the models. In the end, the robustness of proposed adaptation technique is due to the precise modeling and estimation of reverberation energy decay by Pw-EDC model. Using Pw-EDC model, the semi-blind adaptation has shown consistent improvements across low, medium and high reverberant environments in both small and medium vocabulary speech recognition task.

Identiferoai:union.ndltd.org:unitn.it/oai:iris.unitn.it:11572/368890
Date January 2014
CreatorsMohammed, Abdul Waheed
ContributorsMohammed, Abdul Waheed, Matassoni, Marco
PublisherUniversità degli studi di Trento, place:TRENTO
Source SetsUniversità di Trento
LanguageEnglish
Detected LanguageEnglish
Typeinfo:eu-repo/semantics/doctoralThesis
Rightsinfo:eu-repo/semantics/closedAccess
Relationfirstpage:1, lastpage:153, numberofpages:153

Page generated in 0.0022 seconds