Global ETD Search

1	A study of the prediction performance and multivariate extensions of the horseshoe estimator Yunfan Li (6624032) 14 May 2019 (has links) The horseshoe prior has been shown to successfully handle high-dimensional sparse estimation problems. It both adapts to sparsity efficiently and provides nearly unbiased estimates for large signals. In addition, efficient sampling algorithms have been developed and successively applied to a vast array of high-dimensional sparse estimation problems. In this dissertation, we investigate the prediction performance of the horseshoe prior in sparse regression, and extend the horseshoe prior to two multivariate settings.<br><br>We begin with a study of the finite sample prediction performance of shrinkage regression methods, where the risk can be unbiasedly estimated using Stein's approach. We show that the horseshoe prior achieves an improved prediction risk over global shrinkage rules, by using a component-specific local shrinkage term that is learned from the data under a heavy-tailed prior, in combination with a global term providing shrinkage towards zero. We demonstrate improved prediction performance in a simulation study and in a pharmacogenomics data set, confirming our theoretical findings.<br><br>We then shift to extending the horseshoe prior to handle two high-dimensional multivariate problems. First, we develop a new estimator of the inverse covariance matrix for high-dimensional multivariate normal data. The proposed graphical horseshoe estimator has attractive properties compared to other popular estimators. The most prominent benefit is that when the true inverse covariance matrix is sparse, the graphical horseshoe estimator provides estimates with small information divergence from the sampling model. The posterior mean under the graphical horseshoe prior can also be almost unbiased under certain conditions. In addition to these theoretical results, we provide a full Gibbs sampler for implementation. The graphical horseshoe estimator compares favorably to existing techniques in simulations and in a human gene network data analysis.<br><br>In our second setting, we apply the horseshoe prior to the joint estimation of regression coefficients and the inverse covariance matrix in normal models. The computational challenge in this problem is due to the dimensionality of the parameter space that routinely exceeds the sample size. We show that the advantages of the horseshoe prior in estimating a mean vector, or an inverse covariance matrix, separately are also present when addressing both simultaneously. We propose a full Bayesian treatment, with a sampling algorithm that is linear in the number of predictors. Extensive performance comparisons are provided with both frequentist and Bayesian alternatives, and both estimation and prediction performances are verified on a genomic data set. Statistics Bayesian statistical model multivariate analysis Gaussian graphical model (GGM)
2	Gaussian Graphical Model Selection for Gene Regulatory Network Reverse Engineering and Function Prediction Kontos, Kevin 02 July 2009 (has links) One of the most important and challenging ``knowledge extraction' tasks in bioinformatics is the reverse engineering of gene regulatory networks (GRNs) from DNA microarray gene expression data. Indeed, as a result of the development of high-throughput data-collection techniques, biology is experiencing a data flood phenomenon that pushes biologists toward a new view of biology--systems biology--that aims at system-level understanding of biological systems. Unfortunately, even for small model organisms such as the yeast Saccharomyces cerevisiae, the number p of genes is much larger than the number n of expression data samples. The dimensionality issue induced by this ``small n, large p' data setting renders standard statistical learning methods inadequate. Restricting the complexity of the models enables to deal with this serious impediment. Indeed, by introducing (a priori undesirable) bias in the model selection procedure, one reduces the variance of the selected model thereby increasing its accuracy. Gaussian graphical models (GGMs) have proven to be a very powerful formalism to infer GRNs from expression data. Standard GGM selection techniques can unfortunately not be used in the ``small n, large p' data setting. One way to overcome this issue is to resort to regularization. In particular, shrinkage estimators of the covariance matrix--required to infer GGMs--have proven to be very effective. Our first contribution consists in a new shrinkage estimator that improves upon existing ones through the use of a Monte Carlo (parametric bootstrap) procedure. Another approach to GGM selection in the ``small n, large p' data setting consists in reverse engineering limited-order partial correlation graphs (q-partial correlation graphs) to approximate GGMs. Our second contribution consists in an inference algorithm, the q-nested procedure, that builds a sequence of nested q-partial correlation graphs to take advantage of the smaller order graphs' topology to infer higher order graphs. This allows us to significantly speed up the inference of such graphs and to avoid problems related to multiple testing. Consequently, we are able to consider higher order graphs, thereby increasing the accuracy of the inferred graphs. Another important challenge in bioinformatics is the prediction of gene function. An example of such a prediction task is the identification of genes that are targets of the nitrogen catabolite repression (NCR) selection mechanism in the yeast Saccharomyces cerevisiae. The study of model organisms such as Saccharomyces cerevisiae is indispensable for the understanding of more complex organisms. Our third contribution consists in extending the standard two-class classification approach by enriching the set of variables and comparing several feature selection techniques and classification algorithms. Finally, our fourth contribution formulates the prediction of NCR target genes as a network inference task. We use GGM selection to infer multivariate dependencies between genes, and, starting from a set of genes known to be sensitive to NCR, we classify the remaining genes. We hence avoid problems related to the choice of a negative training set and take advantage of the robustness of GGM selection techniques in the ``small n, large p' data setting. ``small n large p' bioinformatics machine learning Gaussian graphical model (GGM)
3	Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction Kontos, Kevin 02 July 2009 (has links) One of the most important and challenging ``knowledge extraction' tasks in bioinformatics is the reverse engineering of gene regulatory networks (GRNs) from DNA microarray gene expression data. Indeed, as a result of the development of high-throughput data-collection techniques, biology is experiencing a data flood phenomenon that pushes biologists toward a new view of biology--systems biology--that aims at system-level understanding of biological systems.<p><p>Unfortunately, even for small model organisms such as the yeast Saccharomyces cerevisiae, the number p of genes is much larger than the number n of expression data samples. The dimensionality issue induced by this ``small n, large p' data setting renders standard statistical learning methods inadequate. Restricting the complexity of the models enables to deal with this serious impediment. Indeed, by introducing (a priori undesirable) bias in the model selection procedure, one reduces the variance of the selected model thereby increasing its accuracy.<p><p>Gaussian graphical models (GGMs) have proven to be a very powerful formalism to infer GRNs from expression data. Standard GGM selection techniques can unfortunately not be used in the ``small n, large p' data setting. One way to overcome this issue is to resort to regularization. In particular, shrinkage estimators of the covariance matrix--required to infer GGMs--have proven to be very effective. Our first contribution consists in a new shrinkage estimator that improves upon existing ones through the use of a Monte Carlo (parametric bootstrap) procedure.<p><p>Another approach to GGM selection in the ``small n, large p' data setting consists in reverse engineering limited-order partial correlation graphs (q-partial correlation graphs) to approximate GGMs. Our second contribution consists in an inference algorithm, the q-nested procedure, that builds a sequence of nested q-partial correlation graphs to take advantage of the smaller order graphs' topology to infer higher order graphs. This allows us to significantly speed up the inference of such graphs and to avoid problems related to multiple testing. Consequently, we are able to consider higher order graphs, thereby increasing the accuracy of the inferred graphs.<p><p>Another important challenge in bioinformatics is the prediction of gene function. An example of such a prediction task is the identification of genes that are targets of the nitrogen catabolite repression (NCR) selection mechanism in the yeast Saccharomyces cerevisiae. The study of model organisms such as Saccharomyces cerevisiae is indispensable for the understanding of more complex organisms. Our third contribution consists in extending the standard two-class classification approach by enriching the set of variables and comparing several feature selection techniques and classification algorithms.<p><p>Finally, our fourth contribution formulates the prediction of NCR target genes as a network inference task. We use GGM selection to infer multivariate dependencies between genes, and, starting from a set of genes known to be sensitive to NCR, we classify the remaining genes. We hence avoid problems related to the choice of a negative training set and take advantage of the robustness of GGM selection techniques in the ``small n, large p' data setting. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Informatique générale Sciences exactes et naturelles Bioinformatics DNA microarrays Genetic regulation -- Data processing Bio-informatique Puces à ADN Régulation génétique -- Informatique machine learning bioinformatics large p' small n Gaussian graphical model (GGM)

1

Page generated in 0.1063 seconds