Return to search

Learning non-Gaussian factor analysis with different structures: comparative investigations on model selection and applications. / 基於多種結構的非高斯因數分析的模型選擇學習演算法比較研究及其應用 / CUHK electronic theses & dissertations collection / Ji yu duo zhong jie gou de fei Gaosi yin shu fen xi de mo xing xuan ze xue xi yan suan fa bi jiao yan jiu ji qi ying yong

高維資料的隱含結構挖掘是機器學習、模式識別和生物資訊學等領域中的重要問題。本論文從實踐和理論上研究了具有不同隱含結構模式的非高斯因數分析(Non-Gaussian Factor Analysis)模型。本文既從兩步法又從自動法的角度重點研究確定隱因數個數的模型選擇問題,及其在模式識別和生物資訊學上的實際應用。 / 非高斯因數分析在單高斯因數的情況下退化為傳統的因數分析(Factor Analysis)模型。我們發展了一套系統地比較模型選擇方法性能的工具,比較研究了經典的模型選擇準則(比如AIC 等),及近年來基於隨機矩陣理論的統計檢驗方法,還有貝葉斯陰陽(Bayesian Ying-Yang)和諧學習理論。同時,我們也對四個經典準則提供了一個適用於小樣本的低估因數數目傾向的相對排序的理論結果。 / 基於傳統的因數分析模型,我們還研究了參數化形式對模型選擇方法的性能的影響,一個重要的但被忽略或很少研究的問題,因為似然函數等價的參數化形式在傳統的模型選擇準則像AIC 下不會有性能差異。但是,我們通過大量的模擬資料和實際資料上的結果發現,在兩個常用的似然函數等價的因數分析參數化形式中,其中一個更加有利於在變分貝葉斯(Variational Bayes)和貝葉斯陰陽理論框架下做模型選擇。 進一步地,該兩個參數化形式被作為兩端拓展成一系列具有等價似然函數的參數化形式。實驗結果更加可靠地揭示了參數化形式的逐漸變化對模型選擇的影響。同時,實驗結果也顯示參數先驗分佈的引入可以提高模型選擇的準確度,並給出了相應的新的學習演算法。系統比較表明,不僅是兩步法還是自動法,貝葉斯陰陽學習理論都比變分貝葉斯的模型選擇的性能更佳,並且能在有利的參數化形式中獲得更大的提高。 / 二元因數分析(Binary FA)也是一種非高斯因數分析模型,它用伯努利因數去解釋隱含結構。首先,我們引入一種叫做正則對偶(canonical dual)的方法去解決在二元因數分析學習演算法中遇到的一個計算複雜度很大的二值二次規劃(Binary Quadratic Programming)問題。雖然它不能準確找到二值二次規劃的全域最優,它卻提高了整個學習演算法的計算速度和自動模型選擇的準確性。由此表明,局部嵌套的子優化問題的解不需要太精確反而能對整個學習演算法的性能更有利。然後,先驗分佈的引入進一步提高了模型選擇的性能,並且貝葉斯陰陽學習理論被系統的實驗結果證實要優於變分貝葉斯。接著,我們進一步發展了一個適用於二值資料的二元矩陣分解演算法。該演算法有理論的結果保證它的性能,並且在實際應用中,能以比其他相關演算法更優的性能從大規模的蛋白相互作用網路中檢測出蛋白功能複合物。 / 進一步,我們在一個半盲(semi-blind)的框架下研究了非高斯因數分析的演算法及其在系統生物學中的應用。非高斯因數分析模型被用於基因轉錄調控建模,並引入稀疏約束到連接矩陣,從而提出一個能有效估計轉錄因數調控信號的方法,而不需要像網路分量分析(Network Component Analysis)方法那樣預先給定轉錄因數調控基因的拓撲網路結構。特別地,借助二元因數分析,調控信號中的二元特徵能被直接捕捉。這種似開關的模式在很多生物過程的調控機制裡面起著重要作用。 / 最後,基於半盲非高斯因數分析學習演算法,我們提出了一套分析外顯子測序數據的方法,能有效地找出與疾病關聯的易感基因,提供了一個可能的方向去解決傳統的全基因組關聯分析(GWAS)方法在低頻高雜訊的外顯子測序數據上失效的問題。在一個1457 個樣本的大規模外顯子測序數據的初步結果顯示,我們的方法既能確認很多已經被認為是與疾病相關的基因,又能找到新的被重複驗證有顯著性的易感基因。相關的表達譜資料進一步顯示所找到的新基因在疾病和對照上有顯著的上下調的表達差異。 / Mining the underlying structure from high dimensional observations is of critical importance in machine learning, pattern recognition and bioinformatics. In this thesis, we, empirically or theoretically, investigate non-Gaussian Factor Analysis (NFA) models with different underlying structures. We focus on the problem of determining the number of latent factors of NFA, from two-stage approach model selection to automatic model selection, with real applications in pattern recognition and bioinformatics. / We start by a degenerate case of NFA, the conventional Factor Analysis (FA) with latent Gaussian factors. Many model selection methods have been proposed and used for FA, and it is important to examine their relative strengths and weaknesses. We develop an empirical analysis tool, to facilitate a systematic comparison on model selection performances of not only classical criteria (e.g., Akaike’s information criterion or shortly AIC) but also recently developed methods (e.g., Kritchman & Nadler’s hypothesis tests), as well as the Bayesian Ying-Yang (BYY) harmony learning. Also, we prove a theoretical relative order of underestimation tendency of four classical criteria. / Then, we investigate how parameterizations affect model selection performance, an issue that has been ignored or seldom studied since traditional model selection criteria, like AIC, perform equivalently on different parameterizations that have equivalent likelihood functions. Focusing on two typical parameterizations of FA, one of which is found to be better than the other under both Variational Bayes (VB) and BYY via extensive experiments on synthetic and real data. Moreover, a family of FA parameterizations that have equivalent likelihood functions are presented, where each one is featured by an integer r, with the two known parameterizations being both ends as r varies from zero to its upper bound. Investigations on this FA family not only confirm the significant difference between the two parameterizations in terms of model selection performance, but also provide insights into what makes a better parameterization. With a Bayesian treatment to the new FA family, alternative VB algorithms on FA are derived, and also BYY algorithms on FA are extended to be equipped with prior distributions on the parameters. A systematic comparison shows that BYY generally outperforms VB under various scenarios including varying simulation configurations and incrementally adding priors to parameters, as well as automatic model selection. / To describe binary latent features, we proceed to binary factor analysis (BFA), which considers Bernoulli factors. First, we introduce a canonical dual approach to tackling a difficult Binary Quadratic Programming (BQP) problem encountered as a computational bottleneck in BFA learning. Although it is not an exact BQP solver, it improves the learning speed and model selection accuracy, which indicates that some amount of error in solving the BQP, a problem nested in the hierarchy of the whole learning process, brings gain on both computational efficiency and model selection performance. The results also imply that optimization is important in learning, but learning is not just a simple optimization. Second, we develop BFA algorithms under VB and BYY to incorporate Bayesian priors on the parameters to improve the automatic model selection performance, and also show that BYY is superior to VB under a systematic comparison. Third, for binary observations, we propose a Bayesian Binary Matrix Factorization (BMF) algorithm under the BYY framework. The performance of the BMF algorithm is guaranteed with theoretical proofs and verified by experiments. We apply it to discovering protein complexes from protein-protein interaction (PPI) networks, an important problem in bioinformatics, with outperformance comparing to other related methods. / Furthermore, we investigate NFA under a semi-blind learning framework. In practice, there exist many scenarios of knowing partially either or both of the system and the input. Here, we modify Network Component Analysis (NCA) to model gene transcriptional regulation in system biology by NFA. The previous hardcut NFA algorithm is extended here as sparse BYY-NFA by considering either or both of a priori connectivity and a priori sparse constraint. Therefore, the a priori knowledge about the connection topology of the TF-gene regulatory network required by NCA is not necessary for our NFA algorithm. The sparse BYY-NFA can be further modified to get a sparse BYY-BFA algorithm, which directly models the switching patterns of latent transcription factor (TF) activities in gene regulation, e.g., whether or not a TF is activated. Mining switching patterns provides insights into exploring regulation mechanism of many biological processes. / Finally, the semi-blind NFA learning is applied to identify those single nucleotide polymorphisms (SNPs) that are significantly associated with a disease or a complex trait from exome sequencing data. By encoding each exon/gene (which may contain multiple SNPs) as a vector, an NFA classifier, obtained in a supervised way on a training set, is used for prediction on a testing set. The genes are selected according to the p-values of Fisher’s exact test on the confusion tables collected from prediction results. The selected genes on a real dataset from an exome sequencing project on psoriasis are consistent in part with published results, and some of them are probably novel susceptible genes of the disease according to the validation results. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Tu, Shikui. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 196-212). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.1.1 --- Motivations --- p.1 / Chapter 1.1.2 --- Independent Factor Analysis (IFA) --- p.2 / Chapter 1.1.3 --- Learning Methods --- p.6 / Chapter 1.2 --- Related Work --- p.14 / Chapter 1.2.1 --- Learning Gaussian FA --- p.14 / Chapter 1.2.2 --- Learning NFA --- p.16 / Chapter 1.2.3 --- Learning Semi-blind NFA --- p.18 / Chapter 1.3 --- Main Contribution of the Thesis --- p.18 / Chapter 1.4 --- Thesis Organization --- p.25 / Chapter 1.5 --- Publication List --- p.27 / Chapter 2 --- FA comparative analysis --- p.31 / Chapter 2.1 --- Determining the factor number --- p.32 / Chapter 2.2 --- Model Selection Methods --- p.34 / Chapter 2.2.1 --- Two-Stage Procedure and Classical Model Selection Criteria --- p.34 / Chapter 2.2.2 --- Kritchman&Nadler's Hypothesis Test (KN) --- p.35 / Chapter 2.2.3 --- Minimax Rank Estimation (MM) --- p.37 / Chapter 2.2.4 --- Minka's Criterion (MK) for PCA --- p.38 / Chapter 2.2.5 --- Bayesian Ying-Yang (BYY) Harmony Learning --- p.39 / Chapter 2.3 --- Empirical Analysis --- p.42 / Chapter 2.3.1 --- A New Tool for Empirical Comparison --- p.42 / Chapter 2.3.2 --- Investigation On Model Selection Performance --- p.44 / Chapter 2.4 --- A Theoretic Underestimation Partial Order --- p.49 / Chapter 2.4.1 --- Events of Estimating the Hidden Dimensionality --- p.49 / Chapter 2.4.2 --- The Structural Property of the Criterion Function --- p.49 / Chapter 2.4.3 --- Experimental Justification --- p.54 / Chapter 2.5 --- Concluding Remarks --- p.58 / Chapter 3 --- FA parameterizations affect model selection --- p.70 / Chapter 3.1 --- Parameterization Issue in Model Selection --- p.71 / Chapter 3.2 --- FAr: ML-equivalent Parameterizations of FA --- p.72 / Chapter 3.3 --- Variational Bayes on FAr --- p.74 / Chapter 3.4 --- Bayesian Ying-Yang Harmony Learning on FAr --- p.77 / Chapter 3.5 --- Empirical Analysis --- p.82 / Chapter 3.5.1 --- Three levels of investigations --- p.82 / Chapter 3.5.2 --- FA-a vs FA-b: performances of BYY, VB, AIC, BIC, and DNLL --- p.84 / Chapter 3.5.3 --- FA-r: performances of VB versus BYY --- p.87 / Chapter 3.5.4 --- FA-a vs FA-b: automatic model selection performance of BYYandVB --- p.90 / Chapter 3.5.5 --- Classification Performance on Real World Data Sets --- p.92 / Chapter 3.6 --- Concluding remarks --- p.93 / Chapter 4 --- BFA learning versus optimization --- p.104 / Chapter 4.1 --- Binary Factor Analysis --- p.105 / Chapter 4.2 --- BYY Harmony Learning on BFA --- p.107 / Chapter 4.3 --- Empirical Analysis --- p.108 / Chapter 4.3.1 --- BIC and Variational Bayes (VB) on BFA --- p.108 / Chapter 4.3.2 --- Error in solving BQP affects model selection --- p.110 / Chapter 4.3.3 --- Priors over parameters affect model selection --- p.114 / Chapter 4.3.4 --- Comparisons among BYY, VB, and BIC --- p.115 / Chapter 4.3.5 --- Applications in recovering binary images --- p.116 / Chapter 4.4 --- Concluding Remarks --- p.117 / Chapter 5 --- BMF for PPI network analysis --- p.124 / Chapter 5.1 --- The problem of protein complex prediction --- p.125 / Chapter 5.2 --- A novel binary matrix factorization (BMF) algorithm --- p.126 / Chapter 5.3 --- Experimental Results --- p.130 / Chapter 5.3.1 --- Other methods in comparison --- p.130 / Chapter 5.3.2 --- Data sets --- p.131 / Chapter 5.3.3 --- Evaluation criteria --- p.131 / Chapter 5.3.4 --- On altered graphs by randomly adding and deleting edges --- p.132 / Chapter 5.3.5 --- On real PPI data sets --- p.137 / Chapter 5.3.6 --- On gene expression data for biclustering --- p.137 / Chapter 5.4 --- A Theoretical Analysis on BYY-BMF --- p.138 / Chapter 5.4.1 --- Main results --- p.138 / Chapter 5.4.2 --- Experimental justification --- p.140 / Chapter 5.4.3 --- Proofs --- p.143 / Chapter 5.5 --- Concluding Remarks --- p.147 / Chapter 6 --- Semi-blind NFA: algorithms and applications --- p.148 / Chapter 6.1 --- Determining transcription factor activity --- p.148 / Chapter 6.1.1 --- A brief review on NCA --- p.149 / Chapter 6.1.2 --- Sparse NFA --- p.150 / Chapter 6.1.3 --- Sparse BFA --- p.156 / Chapter 6.1.4 --- On Yeast cell-cycle data --- p.160 / Chapter 6.1.5 --- On E. coli carbon source transition data --- p.166 / Chapter 6.2 --- Concluding Remarks --- p.170 / Chapter 7 --- Applications on Exome Sequencing Data Analysis --- p.172 / Chapter 7.1 --- From GWAS to Exome Sequencing --- p.172 / Chapter 7.2 --- Encoding An Exon/Gene --- p.173 / Chapter 7.3 --- An NFA Classifier --- p.175 / Chapter 7.4 --- Results --- p.176 / Chapter 7.4.1 --- Simulation --- p.176 / Chapter 7.4.2 --- On a real exome sequencing data set: AHMUe --- p.177 / Chapter 7.5 --- Concluding Remarks --- p.186 / Chapter 8 --- Conclusion and FutureWork --- p.187 / Chapter A --- Derivations of the learning algorithms on FA-r --- p.190 / Chapter A.1 --- The VB learning algorithm on FA-r --- p.190 / Chapter A.2 --- The BYY learning algorithm on FA-r --- p.193 / Bibliography --- p.195

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_327991
Date January 2012
ContributorsTu, Shikui., Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography
Formatelectronic resource, electronic resource, remote, 1 online resource (xiv, 212 leaves) : ill. (chiefly col.)
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0031 seconds