Return to search

Model-based single-microphone speech separation using conditional random fields.

單麥克風語音分離的目標是從一個語音混合 (speech mixture) 中重建兩個或更多的語音源 (source)。這技術可作為語音應用的前置處理,例如從多媒體音軌中抽取資訊。雖然作為欠定 (under-determined) 語音分離的極端例子,基本上沒可能確切地還原語音源,但透過語音源的統計模型,仍可重構出最有可能的語音源。 / 語音分離的性能藉著圖模式 (graphical modeling) 的應用而得以提升。本論文比較了因子隱馬爾可夫模型(factorial Hidden Markov Model (HMM) )的精確算法和近似算法的複雜度和對語音分離性能的影響,並且調查語音源統計模型中的狀態轉移機率 (state transition probabilities) 對語音分離性能的影響。 / 統計模型錯配在語音分離中時有發生。有限的訓練資料和使用有限的狀態空間 (acoustic states) 對語音源建模都會導致錯配。本論文研究了使用條件隨機域 (conditional random field (CRF) ) 來對語音源狀態空間的後驗概率直接建模。計算語音源的最小均方差估計 (minimum mean-square error)時,這後驗概率是必須的。條件隨機域是一種判別模型 (discriminative model),比生成模型 (generative model) 例如隱馬爾可夫模型對模型錯配有更高的耐受性。使用大間隔 (large-margin) 參數估計更進一步提升語音分離的效能。 / 實驗結果證明當不同語音源的功率比 (signal-to-signal ratio) 相近時,使用條件隨機域作語音分離可以獲得更好的語音音質客觀測量參數(objective quality measures) 和語音識別結果。即使使用簡化了的條件隨機域,結果仍和使用因子隱馬爾可夫模型相當。 / Single-microphone speech separation requires to reconstruct two or more sources from only one speech mixture. It can serve as the front-end for speech applications that demand for robustness against interfering signals, such as information extraction from sound streams of multimedia. As an extreme case of under-determined source separation problem, a unique solution for source reconstruction is unlikely to be achieved, but the most probable source observations can be obtained through statistical inference given their prior information in a statistical model-based setting. / The performance of statistical model-based methods has been progressively improved by the use of graphical models to organize the prior information. In this thesis, the performance of the exact and the approximated statistical inference algorithms on single-microphone speech separation with factorial Hidden Markov models (HMM) are evaluated in terms of speech quality and computational complexity. The important role of state transitions in the source models is also investigated. / Model mis-specification is a major problem in model-based speech separation. These mis-specifications are caused by various factors, including limited amount of training data and finite number of acoustic states. Compared with generative approach such as factorial HMM, direct models like conditional random fields (CRF) are considered to be more robust to model mis-specification due to the inherent discrimination ability. In this thesis, the application of conditional random field (CRF) for single-microphone speech separation is investigated. The posterior probabilities of acoustic states given the mixture, which are essential to minimum mean-square error estimation of the sources, are modeled in a maximum entropy probability distribution. The performance of CRF formulations is further improved with a largemargin approach of parameter estimation. / Experimental results confirm that CRF formulations achieve the improved objective quality measures and automatic speech recognition accuracy of the reconstructed sources, especially when the sources are competing with similar signal-to-signal ratio. Even with a simplified CRF formulation, the performance is still comparable to factorial HMM. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Yeung, Yu Ting. / Thesis (Ph.D.) Chinese University of Hong Kong, 2014. / Includes bibliographical references (leaves 102-118). / Abstracts also in Chinese.

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_1077687
Date January 2014
ContributorsYeung, Yu Ting (author.), Lee, Tan (thesis advisor.), Chinese University of Hong Kong Graduate School. Division of Electronic Engineering, (degree granting institution.)
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography, text
Formatelectronic resource, electronic resource, remote, 1 online resource (xiii, 118 leaves) : illustrations (some color), computer, online resource
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0028 seconds