Global ETD Search

1	Independence Screening in High-Dimensional Data Wauters, John, Wauters, John January 2016 (has links) High-dimensional data, data in which the number of dimensions exceeds the number of observations, is increasingly common in statistics. The term "ultra-high dimensional" is defined by Fan and Lv (2008) as describing the situation where log(p) is of order O(na) for some a in the interval (0, ½). It arises in many contexts such as gene expression data, proteomic data, imaging data, tomography, and finance, as well as others. High-dimensional data present a challenge to traditional statistical techniques. In traditional statistical settings, models have a small number of features, chosen based on an assumption of what features may be relevant to the response of interest. In the high-dimensional setting, many of the techniques of traditional feature selection become computationally intractable, or does not yield unique solutions. Current research in modeling high-dimensional data is heavily focused on methods that screen the features before modeling; that is, methods that eliminate noise-features as a pre-modeling dimension reduction. Typically noise feature are identified by exploiting properties of independent random variables, thus the term "independence screening." There are methods for modeling high-dimensional data without feature screening first (e.g. LASSO or SCAD), but simulation studies show screen-first methods perform better as dimensionality increases. Many proposals for independence screening exist, but in my literature review certain themes recurred: A) The assumption of sparsity: that all the useful information in the data is actually contained in a small fraction of the features (the "active features"), the rest being essentially random noise (the "inactive" features). B) In many newer methods, initial dimension reduction by feature screening reduces the problem from the high-dimensional case to a classical case; feature selection then proceeds by a classical method. C) In the initial screening, removal of features independent of the response is highly desirable, as such features literally give no information about the response. D) For the initial screening, some statistic is applied pairwise to each feature in combination with the response; the specific statistic chosen so that in the case that the two random variables are independent, a specific known value is expected for the statistic. E) Features are ranked by the absolute difference between the calculated statistic and the expected value of that statistic in the independent case, i.e. features that are most different from the independent case are most preferred. F) Proof is typically offered that, asymptotically, the method retains the true active features with probability approaching one. G) Where possible, an iterative version of the process is explored, as iterative versions do much better at identifying features that are active in their interactions, but not active individually. feature screening high-dimensional data independence screening modeling dimension reduction
2	Applications of Sure Independence Screening Analysis for Supersaturated Designs Nicely, Lindsey 25 April 2012 (has links) Experimental design has applications in many fields, from medicine to manufacturing. Incorporating statistics into both the planning and analysis stages of the experiment will ensure that appropriate data are collected to allow for meaningful analysis and interpretation of the results. If the number of factors of interest is very large, or if the experimental runs are very expensive, then a supersaturated design (SSD) can be used for factor screening. These designs have n runs and k > n - 1 factors, so there are not enough degrees of freedom to allow estimation of all of the main effects. This paper will first review some of the current techniques for the construction and analysis of SSDs, as well as the analysis challenges inherent to SSDs. Analysis techniques of Sure Independence Screening (SIS) and Iterative Sure Independence Screening (ISIS) are discussed, and their applications for SSDs are explored using simulation, in combination with the Smoothly Clipped Absolute Deviation (SCAD) approach for down-selecting and estimating the effects. supersaturated design sure independence screening SIS ISIS SCAD design of experiments Physical Sciences and Mathematics
3	Partition Models for Variable Selection and Interaction Detection Jiang, Bo 27 September 2013 (has links) Variable selection methods play important roles in modeling high-dimensional data and are key to data-driven scientific discoveries. In this thesis, we consider the problem of variable selection with interaction detection. Instead of building a predictive model of the response given combinations of predictors, we start by modeling the conditional distribution of predictors given partitions based on responses. We use this inverse modeling perspective as motivation to propose a stepwise procedure for effectively detecting interaction with few assumptions on parametric form. The proposed procedure is able to detect pairwise interactions among p predictors with a computational time of \(O(p)\) instead of \(O(p^2)\) under moderate conditions. We establish consistency of the proposed procedure in variable selection under a diverging number of predictors and sample size. We demonstrate its excellent empirical performance in comparison with some existing methods through simulation studies as well as real data examples. Next, we combine the forward and inverse modeling perspectives under the Bayesian framework to detect pleiotropic and epistatic effects in effects in expression quantitative loci (eQTLs) studies. We augment the Bayesian partition model proposed by Zhang et al. (2010) to capture complex dependence structure among gene expression and genetic markers. In particular, we propose a sequential partition prior to model the asymmetric roles played by the response and the predictors, and we develop an efficient dynamic programming algorithm for sampling latent individual partitions. The augmented partition model significantly improves the power in detecting eQTLs compared to previous methods in both simulations and real data examples pertaining to yeast. Finally, we study the application of Bayesian partition models in the unsupervised learning of transcription factor (TF) families based on protein binding microarray (PBM). The problem of TF subclass identification can be viewed as the clustering of TFs with variable selection on their binding DNA sequences. Our model provides simultaneous identification of TF families and their shared sequence preferences, as well as DNA sequences bound preferentially by individual members of TF families. Our analysis may aid in deciphering cis regulatory codes and determinants of protein-DNA binding specificity. / Statistics Statistics Hierarchical models Inverse models Quantitative trait loci Sliced inverse regression Sure independence screening Transcriptional regulation

1

Page generated in 0.1061 seconds