Spelling suggestions: "subject:"1genetic data 2analysis"" "subject:"1genetic data 3analysis""
1 |
Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous TraitsJanuary 2018 (has links)
acase@tulane.edu / Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in the extreme phenotypic samples within the top and bottom percentiles, EPS can boost the study power compared with the random sampling with the same sample size. The existing statistical methods for EPS data test the variants/regions individually. However, many disorders are caused by multiple genetic factors. Therefore, it is critical to simultaneously model the effects of genetic factors, which may increase the power of current genetic studies and identify novel disease-associated genetic factors in EPS. The challenge of the simultaneous analysis of genetic data is that the number (p ~10,000) of genetic factors is typically greater than the sample size (n ~1,000) in a single study. The standard linear model would be inappropriate for this p>n problem due to the rank deficiency of the design matrix. An alternative solution is to apply a penalized regression method – the least absolute shrinkage and selection operator (LASSO).
LASSO can deal with this high-dimensional (p>n) problem by forcing certain regression coefficients to be zero. Although the application of LASSO in genetic studies under random sampling has been widely studied, its statistical inference and testing under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function to investigate the genetic associations, including the gene expression and rare variant analyses. The comprehensive simulation shows EPS-LASSO outperforms existing methods with superior power when the effects are large and stable type I error and FDR control. Together with the real data analysis of genetic study for obesity, our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. / 1 / Chao Xu
|
2 |
Application of Bayesian Hierarchical Models in Genetic Data AnalysisZhang, Lin 14 March 2013 (has links)
Genetic data analysis has been capturing a lot of attentions for understanding the mechanism of the development and progressing of diseases like cancers, and is crucial in discovering genetic markers and treatment targets in medical research. This dissertation focuses on several important issues in genetic data analysis, graphical network modeling, feature selection, and covariance estimation. First, we develop a gene network modeling method for discrete gene expression data, produced by technologies such as serial analysis of gene expression and RNA sequencing experiment, which generate counts of mRNA transcripts in cell samples. We propose a generalized linear model to fit the discrete gene expression data and assume that the log ratios of the mean expression levels follow a Gaussian distribution. We derive the gene network structures by selecting covariance matrices of the Gaussian distribution with a hyper-inverse Wishart prior. We incorporate prior network models based on Gene Ontology information, which avails existing biological information on the genes of interest. Next, we consider a variable selection problem, where the variables have natural grouping structures, with application to analysis of chromosomal copy number data. The chromosomal copy number data are produced by molecular inversion probes experiments which measure probe-specific copy number changes. We propose a novel Bayesian variable selection method, the hierarchical structured variable se- lection (HSVS) method, which accounts for the natural gene and probe-within-gene architecture to identify important genes and probes associated with clinically relevant outcomes. We propose the HSVS model for grouped variable selection, where simultaneous selection of both groups and within-group variables is of interest. The HSVS model utilizes a discrete mixture prior distribution for group selection and group-specific Bayesian lasso hierarchies for variable selection within groups. We further provide methods for accounting for serial correlations within groups that incorporate Bayesian fused lasso methods for within-group selection. Finally, we propose a Bayesian method of estimating high-dimensional covariance matrices that can be decomposed into a low rank and sparse component. This covariance structure has a wide range of applications including factor analytical model and random effects model. We model the covariance matrices with the decomposition structure by representing the covariance model in the form of a factor analytic model where the number of latent factors is unknown. We introduce binary indicators for estimating the rank of the low rank component combined with a Bayesian graphical lasso method for estimating the sparse component. We further extend our method to a graphical factor analytic model where the graphical model of the residuals is of interest. We achieve sparse estimation of the inverse covariance of the residuals in the graphical factor model by employing a hyper-inverse Wishart prior method for a decomposable graph and a Bayesian graphical lasso method for an unrestricted graph.
|
3 |
GENETIC FEATURE SELECTION USING DIMENSIONALITY REDUCTION APPROACHES: A COMPARATIVE STUDYNAHLAWI, Layan 16 December 2010 (has links)
The recent decade has witnessed great advances in microarray and genotyping technologies which allow genome-wide single nucleotide polymorphism (SNP) data to be captured on a single chip. As a consequence, genome-wide association studies require the development of algorithms capable of manipulating ultra-large-scale SNP datasets. Towards this goal, this thesis proposes two SNP selection methods; the first using Independent Component Analysis (ICA) and the second based on a modified version of Fast Orthogonal Search.
The first proposed technique, based on ICA, is a filtering technique; it reduces the number of SNPs in a dataset, without the need for any class labels. The second proposed technique, orthogonal search based SNP selection, is a multivariate regression approach; it selects the most informative features in SNP data to accurately model the entire dataset.
The proposed methods are evaluated by applying them to publicly available gene SNP datasets, and comparing the accuracies of each method in reconstructing the datasets. In addition, the selection results are compared with those of another SNP selection method based on Principal Component Analysis (PCA), which was also applied to the same datasets.
The results demonstrate the ability of orthogonal search to capture a higher amount of information than ICA SNP selection approach, all while using a smaller number of SNPs. Furthermore, SNP reconstruction accuracies using the proposed ICA methodology demonstrated the ability to summarize a greater or equivalent amount of information in comparison with the amount of information captured by the PCA-based technique reported in the literature.
The execution time of the second developed methodology, mFOS, has paved the way for its application to large-scale genome wide datasets. / Thesis (Master, Computing) -- Queen's University, 2010-12-15 18:03:00.208
|
Page generated in 0.0887 seconds