High-dimensional data, data in which the number of dimensions exceeds the number of observations, is increasingly common in statistics. The term "ultra-high dimensional" is defined by Fan and Lv (2008) as describing the situation where log(p) is of order O(na) for some a in the interval (0, ½). It arises in many contexts such as gene expression data, proteomic data, imaging data, tomography, and finance, as well as others. High-dimensional data present a challenge to traditional statistical techniques. In traditional statistical settings, models have a small number of features, chosen based on an assumption of what features may be relevant to the response of interest. In the high-dimensional setting, many of the techniques of traditional feature selection become computationally intractable, or does not yield unique solutions. Current research in modeling high-dimensional data is heavily focused on methods that screen the features before modeling; that is, methods that eliminate noise-features as a pre-modeling dimension reduction. Typically noise feature are identified by exploiting properties of independent random variables, thus the term "independence screening." There are methods for modeling high-dimensional data without feature screening first (e.g. LASSO or SCAD), but simulation studies show screen-first methods perform better as dimensionality increases. Many proposals for independence screening exist, but in my literature review certain themes recurred: A) The assumption of sparsity: that all the useful information in the data is actually contained in a small fraction of the features (the "active features"), the rest being essentially random noise (the "inactive" features). B) In many newer methods, initial dimension reduction by feature screening reduces the problem from the high-dimensional case to a classical case; feature selection then proceeds by a classical method. C) In the initial screening, removal of features independent of the response is highly desirable, as such features literally give no information about the response. D) For the initial screening, some statistic is applied pairwise to each feature in combination with the response; the specific statistic chosen so that in the case that the two random variables are independent, a specific known value is expected for the statistic. E) Features are ranked by the absolute difference between the calculated statistic and the expected value of that statistic in the independent case, i.e. features that are most different from the independent case are most preferred. F) Proof is typically offered that, asymptotically, the method retains the true active features with probability approaching one. G) Where possible, an iterative version of the process is explored, as iterative versions do much better at identifying features that are active in their interactions, but not active individually.
Identifer | oai:union.ndltd.org:arizona.edu/oai:arizona.openrepository.com:10150/623083 |
Date | January 2016 |
Creators | Wauters, John, Wauters, John |
Contributors | Niu, Yue, Niu, Yue, Billheimer, David D., Hurwitz, Bonnie L. |
Publisher | The University of Arizona. |
Source Sets | University of Arizona |
Language | en_US |
Detected Language | English |
Type | text, Electronic Thesis |
Rights | Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. |
Page generated in 0.0021 seconds