Return to search

Bayesian approach for two model-selection-related bioinformatics problems. / CUHK electronic theses & dissertations collection

在貝葉斯推理框架下,貝葉斯方法可以通過數據推斷複雜概率模型中的參數和結構。它被廣泛應用於多个領域。對於生物信息學問題,貝葉斯方法同樣也是一個理想的方法。本文通過介紹新的貝葉斯模型和計算方法討論並解決了兩個與模型選擇相關的生物信息學問題。 / 第一個問題是關於在DNA 序列中的模式識別的相關研究。串聯重複序列片段在DNA 序列中經常出現。它對於基因組進化和人類疾病的研究非常重要。在這一部分,本文主要討論不確定數目的同一模式的串聯重複序列彌散分佈在同一個序列中的情況。我們首先對串聯重複序列片段構建概率模型。然後利用馬爾可夫鏈蒙特卡羅算法探索後驗分佈進而推斷出串聯重複序列的重複片段的模式矩陣和位置。此外,利用RJMCMC 算法解決由不確定數目的重複片段引起的模型選擇問題。 / 另一個問題是對於生物分子的構象轉換的分析。一組生物分子的構象可被分成幾個不同的亞穩定狀態。由於生物分子的功能和構象之間的固有聯繫,構象轉變在不同的生物分子的生物過程中都扮演者非常重要的角色。一般我們從分子動力學模擬中可以得到構象轉換的數據。基於從分子動力學模擬中得到的微觀狀態水準上的構象轉換資訊,我們利用貝葉斯方法研究從微觀狀態到可變數目的亞穩定狀態的聚合問題。 / 本文通過對以上兩個問題討論闡釋貝葉斯方法在生物信息學研究的多個方面具備優勢。這包括闡述生物問題的多變性,處理噪聲和失數據,以及解決模型選擇問題。 / Bayesian approach is a powerful framework for inferring the parameters and structures of complicated probabilistic models from data. It is widely applied in many areas and also ideal for Bioinformatics problems due to their usually high complexity. In this thesis, new Bayesian models and computing methods are introduced to solve two Bioinformatics problems which are both related to model selection. / The first problem is about the repeat pattern recognition. Tandem repeats occur frequently in DNA sequences. They are important for studying genome evolution and human disease. This thesis focuses on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. A probabilistic generative model is introduced for the tandem repeats. Markov chain Monte Carlo algorithms are used to explore the posterior distribution as an effort to infer both the specific pattern of the tandem repeats and the location of repeat segments. Furthermore, reversible jump Markov chain Monte Carlo algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. / The second part of this thesis is engaged in the conformational transitions of biomolecules. Because the function of a biological biomolecule is inherently related to its variable conformations which can be grouped into a set of metastable or long-live states, conformational transitions are important in biological processes. The 3D structure changes are generally simulated from the molecular dynamics computer simulation. Based on the conformational transitions on microstate level from molecular dynamics simulation, a Bayesian approach is developed to cluster the microstates into an uncertainty number of metastable that induces the model selection problem. / With these two problems, this thesis shows that the Bayesian approach for bioinformatics problems has its advantages in terms of taking account of the inherent uncertainty in biological data, handling noisy or missing data, and dealing with the model selection problem. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Liang, Tong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 120-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Statistical Background --- p.2 / Chapter 1.3 --- Tandem Repeats --- p.4 / Chapter 1.4 --- Conformational Space --- p.5 / Chapter 1.5 --- Outlines --- p.7 / Chapter 2 --- Preliminaries --- p.9 / Chapter 2.1 --- Bayesian Inference --- p.9 / Chapter 2.2 --- Markov chain Monte Carlo --- p.10 / Chapter 2.2.1 --- Gibbs sampling --- p.11 / Chapter 2.2.2 --- Metropolis - Hastings algorithm --- p.12 / Chapter 2.2.3 --- Reversible Jump MCMC --- p.12 / Chapter 3 --- Detection of Dispersed Short Tandem Repeats Using Reversible Jump MCMC --- p.14 / Chapter 3.1 --- Background --- p.14 / Chapter 3.2 --- Generative Model --- p.17 / Chapter 3.3 --- Statistical inference --- p.18 / Chapter 3.3.1 --- Likelihood --- p.19 / Chapter 3.3.2 --- Prior Distributions --- p.19 / Chapter 3.3.3 --- Sampling from Posterior Distribution via RJMCMC --- p.20 / Chapter 3.3.4 --- Extra MCMC moves for better mixing --- p.26 / Chapter 3.3.5 --- The complete algorithm --- p.29 / Chapter 3.4 --- Experiments --- p.29 / Chapter 3.4.1 --- Evaluation and comparison of the two RJMCMC versions using synthetic data --- p.30 / Chapter 3.4.2 --- Comparison with existing methods using synthetic data --- p.33 / Chapter 3.4.3 --- Sensitivity to Priors --- p.43 / Chapter 3.4.4 --- Real data experiment --- p.45 / Chapter 3.5 --- Discussion --- p.50 / Chapter 4 --- A Probabilistic Clustering Algorithm for Conformational Changes of Biomolecules --- p.53 / Chapter 4.1 --- Introduction --- p.53 / Chapter 4.1.1 --- Molecular dynamic simulation --- p.54 / Chapter 4.1.2 --- Hierarchical Conformational Space --- p.55 / Chapter 4.1.3 --- Clustering Algorithms --- p.56 / Chapter 4.2 --- Generative Model --- p.58 / Chapter 4.2.1 --- Model 1: Vanilla Model --- p.59 / Chapter 4.2.2 --- Model 2: Zero-Inflated Model --- p.60 / Chapter 4.2.3 --- Model 3: Constrained Model --- p.61 / Chapter 4.2.4 --- Model 4: Constrained and Zero-Inflated Model --- p.61 / Chapter 4.3 --- Statistical Inference for Vanilla Model --- p.62 / Chapter 4.3.1 --- Priors --- p.62 / Chapter 4.3.2 --- Posterior distribution --- p.63 / Chapter 4.3.3 --- Collapsed Gibbs for Vanilla Model with a Fixed Number of Clusters --- p.63 / Chapter 4.3.4 --- Inference on the Number of Clusters --- p.65 / Chapter 4.3.5 --- Synthetic Data Study --- p.68 / Chapter 4.4 --- Statistical Inference for Zero-Inflated Model --- p.76 / Chapter 4.4.1 --- Method 1 --- p.78 / Chapter 4.4.2 --- Method 2 --- p.81 / Chapter 4.4.3 --- Synthetic Data Study --- p.84 / Chapter 4.5 --- Statistical Inference for Constrained Model --- p.85 / Chapter 4.5.1 --- Priors --- p.85 / Chapter 4.5.2 --- Posterior Distribution --- p.86 / Chapter 4.5.3 --- Collapsed Posterior Distribution --- p.86 / Chapter 4.5.4 --- Updating for Cluster Labels K --- p.89 / Chapter 4.5.5 --- Updating for Constrained Λ from Truncated Distribution --- p.89 / Chapter 4.5.6 --- Updating the Number of Clusters --- p.91 / Chapter 4.5.7 --- Uniform Background Parameters on Λ --- p.92 / Chapter 4.6 --- Real Data Experiments --- p.93 / Chapter 4.7 --- Discussion --- p.104 / Chapter 5 --- Conclusion and FutureWork --- p.107 / Chapter A --- Appendix --- p.109 / Chapter A.1 --- Post-processing for indel treatment --- p.109 / Chapter A.2 --- Consistency Score --- p.111 / Chapter A.3 --- A Proof for Collapsed Posterior distribution in Constrained Model in Chapter 4 --- p.111 / Chapter A.4 --- Estimated Transition Matrices for Alanine Dipeptide by Chodera et al. (2006) --- p.117 / Bibliography --- p.120

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_328107
Date January 2013
ContributorsLiang, Tong, Chinese University of Hong Kong Graduate School. Division of Information Engineering.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, bibliography
Formatelectronic resource, electronic resource, remote, 1 online resource (xi, 130 leaves) : ill. (some col.)
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.009 seconds