Global ETD Search

41	Feature Screening for High-Dimensional Variable Selection In Generalized Linear Models Jiang, Jinzhu 02 September 2021 (has links) No description available. Statistics Feature Screening Point-Biserial Correlation High-Dimensional Data Generalized Linear Models
42	High Dimensional Data Methods in Industrial Organization Type Discrete Choice Models Lopez Gomez, Daniel Felipe 11 August 2022 (has links) No description available. Economics
43	Sparse Ridge Fusion For Linear Regression Mahmood, Nozad 01 January 2013 (has links) For a linear regression, the traditional technique deals with a case where the number of observations n more than the number of predictor variables p (n > p). In the case n < p, the classical method fails to estimate the coefficients. A solution of the problem is the case of correlated predictors is provided in this thesis. A new regularization and variable selection is proposed under the name of Sparse Ridge Fusion (SRF). In the case of highly correlated predictor, the simulated examples and a real data show that the SRF always outperforms the lasso, eleastic net, and the S-Lasso, and the results show that the SRF selects more predictor variables than the sample size n while the maximum selected variables by lasso is n size. Lasso coordinate descent elastic net smooth lasso sparsity collinearity high dimensional data variable selection Statistics and Probability
44	LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA Lou, Qiang January 2013 (has links) Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from two instances. Based on this distance, we define the new objective function based on the temporal margin of each data instance. A fixed-point gradient descent method is proposed to solve the formulated objective function to learn the optimal feature weights. The experimental results on real temporal microarray data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance. / Computer and Information Science Computer Science Data Mining Feature Selection High-dimensional Data Incomplete Data Machine Learning
45	Variable Selection and Supervised Dimension Reduction for Large-Scale Genomic Data with Censored Survival Outcomes Spirko, Lauren Nicole January 2017 (has links) One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes, providing insight into the disease's process. With the rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of thousands of genes and proteins resulting in enormous data sets where the number of genomic variables (covariates) is far greater than the number of subjects. It is also typical for such data sets to have a high proportion of censored observations. Methods based on univariate Cox regression are often used to select genes related to survival outcome. However, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each gene. When applied to genes exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. In this thesis, we develop methods that will directly address t / Statistics Statistics Biostatistics Dimension Reduction Gene Expression High-dimensional Data Non-proportional Hazards Survival Variable Selection
46	Inference of nonparametric hypothesis testing on high dimensional longitudinal data and its application in DNA copy number variation and micro array data analysis Zhang, Ke January 1900 (has links) Doctor of Philosophy / Department of Statistics / Haiyan Wang / High throughput screening technologies have generated a huge amount of biological data in the last ten years. With the easy availability of array technology, researchers started to investigate biological mechanisms using experiments with more sophisticated designs that pose novel challenges to statistical analysis. We provide theory for robust statistical tests in three flexible models. In the first model, we consider the hypothesis testing problems when there are a large number of variables observed repeatedly over time. A potential application is in tumor genomics where an array comparative genome hybridization (aCGH) study will be used to detect progressive DNA copy number changes in tumor development. In the second model, we consider hypothesis testing theory in a longitudinal microarray study when there are multiple treatments or experimental conditions. The tests developed can be used to detect treatment effects for a large group of genes and discover genes that respond to treatment over time. In the third model, we address a hypothesis testing problem that could arise when array data from different sources are to be integrated. We perform statistical tests by assuming a nested design. In all models, robust test statistics were constructed based on moment methods allowing unbalanced design and arbitrary heteroscedasticity. The limiting distributions were derived under the nonclassical setting when the number of probes is large. The test statistics are not targeted at a single probe. Instead, we are interested in testing for a selected set of probes simultaneously. Simulation studies were carried out to compare the proposed methods with some traditional tests using linear mixed-effects models and generalized estimating equations. Interesting results obtained with the proposed theory in two cancer genomic studies suggest that the new methods are promising for a wide range of biological applications with longitudinal arrays. high dimensional data longitudinal analysis nonparametric inference hypothesis testing DNA copy number variation Biology, Biostatistics (0308) Statistics (0463)
47	Applications of stochastic control and statistical inference in macroeconomics and high-dimensional data Han, Zhi 07 January 2016 (has links) This dissertation is dedicated to study the modeling of drift control in foreign exchange reserves management and design the fast algorithm of statistical inference with its application in high dimensional data analysis. The thesis has two parts. The first topic involves the modeling of foreign exchange reserve management as an drift control problem. We show that, under certain conditions, the control band policies are optimal for the discounted cost drift control problem and develop an algorithm to calculate the optimal thresholds of the optimal control band policy. The second topic involves the fast computing algorithm of partial distance covariance statistics with its application in feature screening in high dimensional data. We show that an O(n log n) algorithm for a version of the partial distance covariance exists, compared with the O(n^2) algorithm implemented directly accordingly to its definition. We further propose an iterative feature screening procedure in high dimensional data based on the partial distance covariance. This procedure enjoys two advantages over the correlation learning. First, an important predictor that is marginally uncorrelated but jointly correlated with the response can be picked by our procedure and thus entering the estimation model. Second, our procedure is robust to model mis- specification. Stochastic control Foreign exchange reserve Drift control Verification theorem Partial distance covariance High dimensional data Feature screening
48	Regularization Methods for Predicting an Ordinal Response using Longitudinal High-dimensional Genomic Data Hou, Jiayi 25 November 2013 (has links) Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) is smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for obtaining a more accurate diagnosis and prognosis, a novel type of data, known as high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. However, corresponding statistical methodologies as well as computational software are lacking for analyzing high-dimensional data with an ordinal or a longitudinal ordinal response. In this thesis, we develop a regularization algorithm to build a parsimonious model for predicting an ordinal response. In addition, we utilize the classical ordinal model with longitudinal measurements to incorporate the cutting-edge data mining tool for a comprehensive understanding of the causes of complex disease on both the molecular level and environmental level. Moreover, we develop the corresponding R package for general utilization. The algorithm was applied to several real datasets as well as to simulated data to demonstrate the efficiency in variable selection and precision in prediction and classification. The four real datasets are from: 1) the National Institute of Mental Health Schizophrenia Collaborative Study; 2) the San Diego Health Services Research Example; 3) A gene expression experiment to understand `Decreased Expression of Intelectin 1 in The Human Airway Epithelium of Smokers Compared to Nonsmokers' by Weill Cornell Medical College; and 4) the National Institute of General Medical Sciences Inflammation and the Host Response to Burn Injury Collaborative Study. classification high-dimensional data longitudinal data ordinal response prediction regularization methods Biostatistics Physical Sciences and Mathematics Statistics and Probability
49	Etude des projections de données comme support interactif de l’analyse visuelle de la structure de données de grande dimension / Study of multidimensional scaling as an interactive visualization to help the visual analysis of high dimensional data Heulot, Nicolas 04 July 2014 (has links) Acquérir et traiter des données est de moins en moins coûteux, à la fois en matériel et en temps, mais encore faut-il pouvoir les analyser et les interpréter malgré leur complexité. La dimensionnalité est un des aspects de cette complexité intrinsèque. Pour aider à interpréter et à appréhender ces données le recours à la visualisation est indispensable au cours du processus d’analyse. La projection représente les données sous forme d’un nuage de points 2D, indépendamment du nombre de dimensions. Cependant cette technique de visualisation souffre de distorsions dues à la réduction de dimension, ce qui pose des problèmes d’interprétation et de confiance. Peu d’études ont été consacrées à la considération de l’impact de ces artefacts, ainsi qu’à la façon dont des utilisateurs non-familiers de ces techniques peuvent analyser visuellement une projection. L’approche soutenue dans cette thèse repose sur la prise en compte interactive des artefacts, afin de permettre à des analystes de données ou des non-experts de réaliser de manière fiable les tâches d’analyse visuelle des projections. La visualisation interactive des proximités colore la projection en fonction des proximités d’origine par rapport à une donnée de référence dans l’espace des données. Cette technique permet interactivement de révéler les artefacts de projection pour aider à appréhender les détails de la structure sous-jacente aux données. Dans cette thèse, nous revisitons la conception de cette technique et présentons ses apports au travers de deux expérimentations contrôlées qui étudient l’impact des artefacts sur l’analyse visuelle des projections. Nous présentons également une étude de l’espace de conception d’une technique basée sur la métaphore de lentille et visant à s’affranchir localement des problématiques d’artefacts de projection. / The cost of data acquisition and processing has radically decreased in both material and time. But we also need to analyze and interpret the large amounts of complex data that are stored. Dimensionality is one aspect of their intrinsic complexity. Visualization is essential during the analysis process to help interpreting and understanding these data. Projection represents data as a 2D scatterplot, regardless the amount of dimensions. However, this visualization technique suffers from artifacts due to the dimensionality reduction. Its lack of reliability implies issues of interpretation and trust. Few studies have been devoted to the consideration of the impact of these artifacts, and especially to give feedbacks on how non-expert users can visually analyze projections. The main approach of this thesis relies on an taking these artifacts into account using interactive techniques, in order to allow data scientists or non-expert users to perform a trustworthy visual analysis of projections. The interactive visualization of the proximities applies a coloring of the original proximities relatives to a reference in the data-space. This interactive technique allows revealing projection artifacts in order to help grasping details of the underlying data-structure. In this thesis, we redesign this technique and we demonstrate its potential by presenting two controlled experiments studying the impact of artifacts on the visual analysis of projections. We also present a design-space based on the lens metaphor, in order to improve this technique and to locally visualize a projection free of artifacts issues. Visualisation d’information Fouille visuelle de données Données de grande dimension Projection de données Information Visualization Visual Analytics High-Dimensional Data Multidimensional Scaling
50	THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION Xie, Jin 01 January 2018 (has links) When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia. Generalized Conditional Adaptive Lasso High-dimensional Data Variable Screening Variable Selection Applied Statistics Statistical Methodology Statistical Models Statistical Theory

Search results