Spelling suggestions: "subject:"highdimensional statistics"" "subject:"higherdimensional statistics""
1 |
Model selection and estimation in high dimensional settingsNgueyep Tzoumpe, Rodrigue 08 June 2015 (has links)
Several statistical problems can be described as estimation problem, where the goal is to learn a set of parameters, from some data, by maximizing a criterion. These type of problems are typically encountered in a supervised learning setting, where we want to relate an output (or many outputs) to multiple inputs. The relationship between these outputs and these inputs can be complex, and this complexity can be attributed to the high dimensionality of the space containing the inputs and the outputs; the existence of a structural prior knowledge within the inputs or the outputs that if ignored may lead to inefficient estimates of the parameters; and the presence of a non-trivial noise structure in the data. In this thesis we propose new statistical methods to achieve model selection and estimation when there are more predictors than observations. We also design a new set of algorithms to efficiently solve the proposed statistical models. We apply the implemented methods to genetic data sets of cancer patients and to some economics data.
|
2 |
Learning with high-dimensional noisy dataChen, Yudong 25 September 2013 (has links)
Learning an unknown parameter from data is a problem of fundamental importance across many fields of engineering and science. Rapid development in information technology allows a large amount of data to be collected. The data is often highly non-uniform and noisy, sometimes subject to gross errors and even direct manipulations. Data explosion also highlights the importance of the so-called high-dimensional regime, where the number of variables might exceed the number of samples. Extracting useful information from the data requires high-dimensional learning algorithms that are robust to noise. However, standard algorithms for the high-dimensional regime are often brittle to noise, and the suite of techniques developed in Robust Statistics are often inapplicable to large and high-dimensional data. In this thesis, we study the problem of robust statistical learning in high-dimensions from noisy data. Our goal is to better understand the behaviors and effect of noise in high-dimensional problems, and to develop algorithms that are statistically efficient, computationally tractable, and robust to various types of noise. We forge into this territory by considering three important sub-problems. We first look at the problem of recovering a sparse vector from a few linear measurements, where both the response vector and the covariate matrix are subject to noise. Both stochastic and arbitrary noise are considered. We show that standard approaches are inadequate in these settings. We then develop robust efficient algorithms that provably recover the support and values of the sparse vector under different noise models and require minimum knowledge of the nature of the noise. Next, we study the problem of recovering a low-rank matrix from partially observed entries, with some of the observations arbitrarily corrupted. We consider the entry-wise corruption setting where no row or column has too many entries corrupted, and provide performance guarantees for a natural convex relaxation approach. Our unified guarantees cover both randomly and deterministically located corruptions, and improve upon existing results. We then turn to the column-wise corruption case where all observations from some columns are arbitrarily contaminated. We propose a new convex optimization approach and show that it simultaneously identify the corrupted columns and recover unobserved entries in the uncorrupted columns. Lastly, we consider the graph clustering problem, i.e., arranging the nodes of a graph into clusters such that there are relatively dense connections inside the clusters and sparse connections across different clusters. We propose a semi-random Generalized Stochastic Blockmodel for clustered graphs and develop a new algorithm based on convexified maximum likelihood estimators. We provide theoretical performance guarantees which recover, and sometimes improve on, all exiting results for the classical stochastic blockmodel, the planted k-clique model and the planted coloring models. We extend our algorithm to the case where the clusters are allowed to overlap with each other, and provide theoretical characterization of the performance of the algorithm. A further extension is studied when the graph may change over time. We develop new approaches to incorporate the time dynamics and show that it can identify stable overlapping communities in real-world time-evolving graphs. / text
|
3 |
Random Subspace Analysis on Canonical Correlation of High Dimensional DataYamazaki, Ryo January 2016 (has links)
High dimensional, low sample, data have singular sample covariance matrices,rendering them impossible to analyse by regular canonical correlation (CC). Byusing random subspace method (RSM) calculation of canonical correlation be-comes possible, and a Monte Carlo analysis shows resulting maximal CC canreliably distinguish between data with true correlation (above 0.5) and with-out. Statistics gathered from RSMCCA can be used to model true populationcorrelation by beta regression, given certain characteristic of data set. RSM-CCA applied on real biological data however show that the method can besensitive to deviation from normality and high degrees of multi-collinearity.
|
4 |
Random Matrix Theory: Selected Applications from Statistical Signal Processing and Machine LearningElkhalil, Khalil 06 1900 (has links)
Random matrix theory is an outstanding mathematical tool that has demonstrated its usefulness in many areas ranging from wireless communication to finance and economics. The main motivation behind its use comes from the fundamental role that random matrices play in modeling unknown and unpredictable physical quantities. In many situations, meaningful metrics expressed as scalar functionals of these random matrices arise naturally. Along this line, the present work consists in leveraging tools from random matrix theory in an attempt to answer fundamental questions related to applications from statistical signal processing and machine learning. In a first part, this thesis addresses the development of analytical tools for the computation of the inverse moments of random Gram matrices with one side correlation. Such a question is mainly driven by applications in signal processing and wireless communications wherein such matrices naturally arise. In particular, we derive closed-form expressions for the inverse moments and show that the obtained results can help approximate several performance metrics of common estimation techniques. Then, we carry out a large dimensional study of discriminant analysis classifiers. Under mild assumptions, we show that the asymptotic classification error approaches a deterministic quantity that depends only on the means and covariances associated with each class as well as the problem dimensions. Such result permits a better understanding of the underlying classifiers, in practical large but finite dimensions, and can be used to optimize the performance. Finally, we revisit kernel ridge regression and study a centered version of it that we call centered kernel ridge regression or CKRR in short. Relying on recent advances on the asymptotic properties of random kernel matrices, we carry out a large dimensional analysis of CKRR under the assumption that both the data dimesion and the training size grow simultaneiusly large at the same rate. We particularly show that both the empirical and prediction risks converge to a limiting risk that relates the performance to the data statistics and the parameters involved. Such a result is important as it permits a better undertanding of kernel ridge regression and allows to efficiently optimize the performance.
|
5 |
A-Optimal Subsampling For Big Data General Estimating EquationsCheung, Chung Ching 08 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.
|
6 |
High-dimensional statistics : model specification and elementary estimatorsYang, Eunho 16 January 2015 (has links)
Modern statistics typically deals with complex data, in particular where the ambient dimension of the problem p may be of the same order as, or even substantially larger than, the sample size n. It has now become well understood that even in this type of high-dimensional scaling, statistically consistent estimators can be achieved provided one imposes structural constraints on the statistical models. In spite of great success over the last few decades, we are still experiencing bottlenecks of two distinct kinds: (I) in multivariate modeling, data modeling assumption is typically limited to instances such as Gaussian or Ising models, and hence handling varied types of random variables is still restricted, and (II) in terms of computation, learning or estimation process is not efficient especially when p is extremely large, since in the current paradigm for high-dimensional statistics, regularization terms induce non-differentiable optimization problems, which do not have closed-form solutions in general. The thesis addresses these two distinct but highly complementary problems: (I) statistical model specification beyond the standard Gaussian or Ising models for data of varied types, and (II) computationally efficient elementary estimators for high-dimensional statistical models. / text
|
7 |
Scalable sparse machine learning methods for big dataZeng, Yaohui 15 December 2017 (has links)
Sparse machine learning models have become increasingly popular in analyzing high-dimensional data. With the evolving era of Big Data, ultrahigh-dimensional, large-scale data sets are constantly collected in many areas such as genetics, genomics, biomedical imaging, social media analysis, and high-frequency finance. Mining valuable information efficiently from these massive data sets requires not only novel statistical models but also advanced computational techniques. This thesis focuses on the development of scalable sparse machine learning methods to facilitate Big Data analytics.
Built upon the feature screening technique, the first part of this thesis proposes a family of hybrid safe-strong rules (HSSR) that incorporate safe screening rules into the sequential strong rule to remove unnecessary computational burden for solving the \textit{lasso-type} models. We present two instances of HSSR, namely SSR-Dome and SSR-BEDPP, for the standard lasso problem. We further extend SSR-BEDPP to the elastic net and group lasso problems to demonstrate the generalizability of the hybrid screening idea. In the second part, we design and implement an R package called \texttt{biglasso} to extend the lasso model fitting to Big Data in R. Our package \texttt{biglasso} utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle \textit{data-larger-than-RAM} cases seamlessly. Moreover, it's built upon our redesigned algorithm incorporated with the proposed HSSR screening, making it much more memory- and computation-efficient than existing R packages. Extensive numerical experiments with synthetic and real data sets are conducted in both parts to show the effectiveness of the proposed methods.
In the third part, we consider a novel statistical model, namely the overlapping group logistic regression model, that allows for selecting important groups of features that are associated with binary outcomes in the setting where the features belong to overlapping groups. We conduct systematic simulations and real-data studies to show its advantages in the application of genetic pathway selection. We implement an R package called \texttt{grpregOverlap} that has HSSR screening built in for fitting overlapping group lasso models.
|
8 |
Application de la théorie des matrices aléatoires pour les statistiques en grande dimension / Application of Random Matrix Theory to High Dimensional StatisticsBun, Joël 06 September 2016 (has links)
De nos jours, il est de plus en plus fréquent de travailler sur des bases de données de très grandes tailles dans plein de domaines différents. Cela ouvre la voie à de nouvelles possibilités d'exploitation ou d'exploration de l'information, et de nombreuses technologies numériques ont été créées récemment dans cette optique. D'un point de vue théorique, ce problème nous contraint à revoir notre manière d'analyser et de comprendre les données enregistrées. En effet, dans cet univers communément appelé « Big Data », un bon nombre de méthodes traditionnelles d'inférence statistique multivariée deviennent inadaptées. Le but de cette thèse est donc de mieux comprendre ce phénomène, appelé fléau (ou malédiction) de la dimension, et ensuite de proposer différents outils statistiques exploitant explicitement la dimension du problème et permettant d'extraire des informations fiables des données. Pour cela, nous nous intéresserons beaucoup aux vecteurs propres de matrices symétriques. Nous verrons qu’il est possible d’extraire de l'information présentant un certain degré d’universalité. En particulier, cela nous permettra de construire des estimateurs optimaux, observables, et cohérents avec le régime de grande dimension. / Nowadays, it is easy to get a lot ofquantitative or qualitative data in a lot ofdifferent fields. This access to new databrought new challenges about data processingand there are now many different numericaltools to exploit very large database. In atheoretical standpoint, this framework appealsfor new or refined results to deal with thisamount of data. Indeed, it appears that mostresults of classical multivariate statisticsbecome inaccurate in this era of “Big Data”.The aim of this thesis is twofold: the first one isto understand theoretically this so-called curseof dimensionality that describes phenomenawhich arise in high-dimensional space.Then, we shall see how we can use these toolsto extract signals that are consistent with thedimension of the problem. We shall study thestatistics of the eigenvalues and especially theeigenvectors of large symmetrical matrices. Wewill highlight that we can extract someuniversal properties of these eigenvectors andthat will help us to construct estimators that areoptimal, observable and consistent with thehigh dimensional framework.
|
9 |
Inferential GANs and Deep Feature Selection with ApplicationsYao Chen (8892395) 15 June 2020 (has links)
Deep nueral networks (DNNs) have become popular due to their predictive power and flexibility in model fitting. In unsupervised learning, variational autoencoders (VAEs) and generative adverarial networks (GANs) are two most popular and successful generative models. How to provide a unifying framework combining the best of VAEs and GANs in a principled way is a challenging task. In supervised learning, the demand for high-dimensional data analysis has grown significantly, especially in the applications of social networking, bioinformatics, and neuroscience. How to simultaneously approximate the true underlying nonlinear system and identify relevant features based on high-dimensional data (typically with the sample size smaller than the dimension, a.k.a. small-n-large-p) is another challenging task.<div><br></div><div>In this dissertation, we have provided satisfactory answers for these two challenges. In addition, we have illustrated some promising applications using modern machine learning methods.<br></div><div><br></div><div>In the first chapter, we introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs. GANs have been impactful on many problems and applications but suffer from unstable training. The Wasserstein GAN (WGAN) leverages the Wasserstein distance to avoid the caveats in the minmax two-player training of GANs but has other defects such as mode collapse and lack of metric to detect the convergence. The iWGAN model jointly learns an encoder network and a generator network motivated by the iterative primal dual optimization process. The encoder network maps the observed samples to the latent space and the generator network maps the samples from the latent space to the data space. We establish the generalization error bound of iWGANs to theoretically justify the performance of iWGANs. We further provide a rigorous probabilistic interpretation of our model under the framework of maximum likelihood estimation. The iWGAN, with a clear stopping criteria, has many advantages over other autoencoder GANs. The empirical experiments show that the iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample. We illustrate the ability of iWGANs by obtaining a competitive and stable performance with state-of-the-art for benchmark datasets. <br></div><div><br></div><div>In the second chapter, we present a general framework for high-dimensional nonlinear variable selection using deep neural networks under the framework of supervised learning. The network architecture includes both a selection layer and approximation layers. The problem can be cast as a sparsity-constrained optimization with a sparse parameter in the selection layer and other parameters in the approximation layers. This problem is challenging due to the sparse constraint and the nonconvex optimization. We propose a novel algorithm, called Deep Feature Selection, to estimate both the sparse parameter and the other parameters. Theoretically, we establish the algorithm convergence and the selection consistency when the objective function has a Generalized Stable Restricted Hessian. This result provides theoretical justifications of our method and generalizes known results for high-dimensional linear variable selection. Simulations and real data analysis are conducted to demonstrate the superior performance of our method.<br></div><div><br></div><div><div>In the third chapter, we develop a novel methodology to classify the electrocardiograms (ECGs) to normal, atrial fibrillation and other cardiac dysrhythmias as defined by the Physionet Challenge 2017. More specifically, we use piecewise linear splines for the feature selection and a gradient boosting algorithm for the classifier. In the algorithm, the ECG waveform is fitted by a piecewise linear spline, and morphological features related to the piecewise linear spline coefficients are extracted. XGBoost is used to classify the morphological coefficients and heart rate variability features. The performance of the algorithm was evaluated by the PhysioNet Challenge database (3658 ECGs classified by experts). Our algorithm achieves an average F1 score of 81% for a 10-fold cross validation and also achieved 81% for F1 score on the independent testing set. This score is similar to the top 9th score (81%) in the official phase of the Physionet Challenge 2017.</div></div><div><br></div><div>In the fourth chapter, we introduce a novel region-selection penalty in the framework of image-on-scalar regression to impose sparsity of pixel values and extract active regions simultaneously. This method helps identify regions of interest (ROI) associated with certain disease, which has a great impact on public health. Our penalty combines the Smoothly Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the SCAD of total variation (TV) regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zero-valued background. Efficient algorithm is based on the alternative direction method of multipliers (ADMM) which decomposes the non-convex problem into two iterative optimization problems with explicit solutions. Another virtue of the proposed method is that a divide and conquer learning algorithm is developed, thereby allowing scaling to large images. Several examples are presented and the experimental results are compared with other state-of-the-art approaches. <br></div>
|
10 |
Some statistical results in high-dimensional dependence modeling / Contributions à l'analyse statistique des modèles de dépendance en grande dimensionDerumigny, Alexis 15 May 2019 (has links)
Cette thèse peut être divisée en trois parties.Dans la première partie, nous étudions des méthodes d'adaptation au niveau de bruit dans le modèle de régression linéaire en grande dimension. Nous prouvons que deux estimateurs à racine carrée, peuvent atteindre les vitesses minimax d'estimation et de prédiction. Nous montrons qu'une version similaire construite à parti de médianes de moyenne, peut encore atteindre les mêmes vitesses optimales en plus d'être robuste vis-à-vis de l'éventuelle présence de données aberrantes.La seconde partie est consacrée à l'analyse de plusieurs modèles de dépendance conditionnelle. Nous proposons plusieurs tests de l'hypothèse simplificatrice qu'une copule conditionnelle est constante vis-à-vis de son évènement conditionnant, et nous prouvons la consistance d'une technique de ré-échantillonage semi-paramétrique. Si la copule conditionnelle n'est pas constante par rapport à sa variable conditionnante, alors elle peut être modélisée via son tau de Kendall conditionnel. Nous étudions donc l'estimation de ce paramètre de dépendance conditionnelle sous 3 approches différentes : les techniques à noyaux, les modèles de type régression et les algorithmes de classification.La dernière partie regroupe deux contributions dans le domaine de l'inférence.Nous comparons et proposons différents estimateurs de fonctionnelles conditionnelles régulières en utilisant des U-statistiques. Finalement, nous étudions la construction et les propriétés théoriques d'intervalles de confiance pour des ratios de moyenne sous différents choix d'hypothèses et de paradigmes. / This thesis can be divided into three parts.In the first part, we study adaptivity to the noise level in the high-dimensional linear regression framework. We prove that two square-root estimators attains the minimax rates of estimation and prediction. We show that a corresponding median-of-means version can still attains the same optimal rates while being robust to outliers in the data.The second part is devoted to the analysis of several conditional dependence models.We propose some tests of the simplifying assumption that a conditional copula is constant with respect to its conditioning event, and prove the consistency of a semiparametric bootstrap scheme.If the conditional copula is not constant with respect to the conditional event, then it can be modelled using the corresponding Kendall's tau.We study the estimation of this conditional dependence parameter using 3 different approaches : kernel techniques, regression-type models and classification algorithms.The last part regroups two different topics in inference.We review and propose estimators for regular conditional functionals using U-statistics.Finally, we study the construction and the theoretical properties of confidence intervals for ratios of means under different sets of assumptions and paradigms.
|
Page generated in 0.1429 seconds