Spelling suggestions: "subject:"highdimensional"" "subject:"higherdimensional""
81 |
Inferential GANs and Deep Feature Selection with ApplicationsYao Chen (8892395) 15 June 2020 (has links)
Deep nueral networks (DNNs) have become popular due to their predictive power and flexibility in model fitting. In unsupervised learning, variational autoencoders (VAEs) and generative adverarial networks (GANs) are two most popular and successful generative models. How to provide a unifying framework combining the best of VAEs and GANs in a principled way is a challenging task. In supervised learning, the demand for high-dimensional data analysis has grown significantly, especially in the applications of social networking, bioinformatics, and neuroscience. How to simultaneously approximate the true underlying nonlinear system and identify relevant features based on high-dimensional data (typically with the sample size smaller than the dimension, a.k.a. small-n-large-p) is another challenging task.<div><br></div><div>In this dissertation, we have provided satisfactory answers for these two challenges. In addition, we have illustrated some promising applications using modern machine learning methods.<br></div><div><br></div><div>In the first chapter, we introduce a novel inferential Wasserstein GAN (iWGAN) model, which is a principled framework to fuse auto-encoders and WGANs. GANs have been impactful on many problems and applications but suffer from unstable training. The Wasserstein GAN (WGAN) leverages the Wasserstein distance to avoid the caveats in the minmax two-player training of GANs but has other defects such as mode collapse and lack of metric to detect the convergence. The iWGAN model jointly learns an encoder network and a generator network motivated by the iterative primal dual optimization process. The encoder network maps the observed samples to the latent space and the generator network maps the samples from the latent space to the data space. We establish the generalization error bound of iWGANs to theoretically justify the performance of iWGANs. We further provide a rigorous probabilistic interpretation of our model under the framework of maximum likelihood estimation. The iWGAN, with a clear stopping criteria, has many advantages over other autoencoder GANs. The empirical experiments show that the iWGAN greatly mitigates the symptom of mode collapse, speeds up the convergence, and is able to provide a measurement of quality check for each individual sample. We illustrate the ability of iWGANs by obtaining a competitive and stable performance with state-of-the-art for benchmark datasets. <br></div><div><br></div><div>In the second chapter, we present a general framework for high-dimensional nonlinear variable selection using deep neural networks under the framework of supervised learning. The network architecture includes both a selection layer and approximation layers. The problem can be cast as a sparsity-constrained optimization with a sparse parameter in the selection layer and other parameters in the approximation layers. This problem is challenging due to the sparse constraint and the nonconvex optimization. We propose a novel algorithm, called Deep Feature Selection, to estimate both the sparse parameter and the other parameters. Theoretically, we establish the algorithm convergence and the selection consistency when the objective function has a Generalized Stable Restricted Hessian. This result provides theoretical justifications of our method and generalizes known results for high-dimensional linear variable selection. Simulations and real data analysis are conducted to demonstrate the superior performance of our method.<br></div><div><br></div><div><div>In the third chapter, we develop a novel methodology to classify the electrocardiograms (ECGs) to normal, atrial fibrillation and other cardiac dysrhythmias as defined by the Physionet Challenge 2017. More specifically, we use piecewise linear splines for the feature selection and a gradient boosting algorithm for the classifier. In the algorithm, the ECG waveform is fitted by a piecewise linear spline, and morphological features related to the piecewise linear spline coefficients are extracted. XGBoost is used to classify the morphological coefficients and heart rate variability features. The performance of the algorithm was evaluated by the PhysioNet Challenge database (3658 ECGs classified by experts). Our algorithm achieves an average F1 score of 81% for a 10-fold cross validation and also achieved 81% for F1 score on the independent testing set. This score is similar to the top 9th score (81%) in the official phase of the Physionet Challenge 2017.</div></div><div><br></div><div>In the fourth chapter, we introduce a novel region-selection penalty in the framework of image-on-scalar regression to impose sparsity of pixel values and extract active regions simultaneously. This method helps identify regions of interest (ROI) associated with certain disease, which has a great impact on public health. Our penalty combines the Smoothly Clipped Absolute Deviation (SCAD) regularization, enforcing sparsity, and the SCAD of total variation (TV) regularization, enforcing spatial contiguity, into one group, which segments contiguous spatial regions against zero-valued background. Efficient algorithm is based on the alternative direction method of multipliers (ADMM) which decomposes the non-convex problem into two iterative optimization problems with explicit solutions. Another virtue of the proposed method is that a divide and conquer learning algorithm is developed, thereby allowing scaling to large images. Several examples are presented and the experimental results are compared with other state-of-the-art approaches. <br></div>
|
82 |
Building nonlinear data models with self-organizing mapsDer, Ralf, Balzuweit, Gerd, Herrmann, Michael 10 December 2018 (has links)
We study the extraction of nonlinear data models in high dimensional spaces with modified self-organizing maps. Our algorithm maps lower dimensional lattice into a high dimensional space without topology violations by tuning the neighborhood widths locally. The approach is based on a new principle exploiting the specific dynamical properties of the first order phase transition induced by the noise of the
data. The performance of the algorithm is demonstrated for one- and two-dimensional principal manifolds and for sparse data sets.
|
83 |
High-Dimensional Analysis of Regularized Convex Optimization Problems with Application to Massive MIMO Wireless Communication SystemsAlrashdi, Ayed 03 1900 (has links)
In the past couple of decades, the amount of data available has dramatically in- creased. Thus, in modern large-scale inference problems, the dimension of the signal to be estimated is comparable or even larger than the number of available observa- tions. Yet the desired properties of the signal typically lie in some low-dimensional structure, such as sparsity, low-rankness, finite alphabet, etc. Recently, non-smooth regularized convex optimization has risen as a powerful tool for the recovery of such structured signals from noisy linear measurements in an assortment of applications in signal processing, wireless communications, machine learning, computer vision, etc. With the advent of Compressed Sensing (CS), there has been a huge number of theoretical results that consider the estimation performance of non-smooth convex optimization in such a high-dimensional setting.
In this thesis, we focus on precisely analyzing the high dimensional error perfor- mance of such regularized convex optimization problems under the presence of im- pairments (such as uncertainties) in the measurement matrix, which has independent Gaussian entries. The precise nature of our analysis allows performance compari- son between different types of these estimators and enables us to optimally tune the involved hyper-parameters. In particular, we study the performance of some of the most popular cases in linear inverse problems, such as the LASSO, Elastic Net, Least Squares (LS), Regularized Least Squares (RLS) and their box-constrained variants.
In each context, we define appropriate performance measures, and we sharply an-
alyze them in the High-Dimensional Statistical Regime. We use our results for a concrete application of designing efficient decoders for modern massive multi-input multi-output (MIMO) wireless communication systems and optimally allocate their power.
The framework used for the analysis is based on Gaussian process methods, in particular, on a recently developed strong and tight version of the classical Gor- don Comparison Inequality which is called the Convex Gaussian Min-max Theorem (CGMT). We use some results from Random Matrix Theory (RMT) in our analysis as well.
|
84 |
Efficient Uncertainty quantification with high dimensionalityJianhua Yin (12456819) 25 April 2022 (has links)
<p>Uncertainty exists everywhere in scientific and engineering applications. To avoid potential risk, it is critical to understand the impact of uncertainty on a system by performing uncertainty quantification (UQ) and reliability analysis (RA). However, the computational cost may be unaffordable using current UQ methods with high-dimensional input. Moreover, current UQ methods are not applicable when numerical data and image data coexist. </p>
<p>To decrease the computational cost to an affordable level and enable UQ with special high dimensional data (e.g. image), this dissertation develops three UQ methodologies with high dimensionality of input space. The first two methods focus on high-dimensional numerical input. The core strategy of Methodology 1 is fixing the unimportant variables at their first step most probable point (MPP) so that the dimensionality is reduced. An accurate RA method is used in the reduced space. The final reliability is obtained by accounting for the contributions of important and unimportant variables. Methodology 2 addresses the issue that the dimensionality cannot be reduced when most of the variables are important or when variables equally contribute to the system. Methodology 2 develops an efficient surrogate modeling method for high dimensional UQ using Generalized Sliced Inverse Regression (GSIR), Gaussian Process (GP)-based active learning, and importance sampling. A cost-efficient GP model is built in the latent space after dimension reduction by GSIR. And the failure boundary is identified through active learning that adds optimal training points iteratively. In Methodology 3, a Convolutional Neural Networks (CNN) based surrogate model (CNN-GP) is constructed for dealing with mixed numerical and image data. The numerical data are first converted into images and the converted images are then merged with existing image data. The merged images are fed to CNN for training. Then, we use the latent variables of the CNN model to integrate CNN with GP to quantify the model error using epistemic uncertainty. Both epistemic uncertainty and aleatory uncertainty are considered in uncertainty propagation. </p>
<p>The simulation results indicate that the first two methodologies can not only improve the efficiency but also maintain adequate accuracy for the problems with high-dimensional numerical input. GSIR with active learning can handle the situations that the dimensionality cannot be reduced when most of the variables are important or the importance of variables are close. The two methodologies can be combined as a two-stage dimension reduction for high-dimensional numerical input. The third method, CNN-GP, is capable of dealing with special high-dimensional input, mixed numerical and image data, with the satisfying regression accuracy and providing an estimate of the model error. Uncertainty propagation considering both epistemic uncertainty and aleatory uncertainty provides better accuracy. The proposed methods could be potentially applied to engineering design and decision making. </p>
|
85 |
Distributed Bootstrap for Massive DataYang Yu (12466911) 27 April 2022 (has links)
<p>Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework.</p>
<p> </p>
<p>First, we propose two generic distributed bootstrap methods, \texttt{k-grad} and \texttt{n+k-1-grad}, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an $\ell_\infty$-norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency and showing that $\tau_{\min}$ only increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory.</p>
<p> </p>
<p>Then, we extend \texttt{k-grad} and \texttt{n+k-1-grad} to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.</p>
|
86 |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count DataLenz, Lauren Holt 01 December 2018 (has links)
The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine.
Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway.
|
87 |
Contributions a l’analyse de données multivoie : algorithmes et applications / Contributions to multiway analysis : algorithms and applicationsLechuga lopez, Olga 03 July 2017 (has links)
Nous proposons d’étendre des méthodes statistiques classiques telles que l’analyse discriminante, la régression logistique, la régression de Cox, et l’analyse canonique généralisée régularisée au contexte des données multivoie, pour lesquelles, chaque individu est décrit par plusieurs instances de la même variable. Les données ont ainsi naturellement une structure tensorielle. Contrairement à leur formulation standard, une contrainte structurelle est imposée. L’intérêt de cette contrainte est double: d’une part elle permet une étude séparée de l’influence des variables et de l’influence des modalités, conduisant ainsi à une interprétation facilitée des modèles. D’autre part, elle permet de restreindre le nombre de coefficients à estimer, et ainsi de limiter à la fois la complexité calculatoire et le phénomène de sur-apprentissage. Des stratégies pour gérer les problèmes liés au grande dimension des données sont également discutées. Ces différentes méthodes sont illustrées sur deux jeux de données réelles: (i) des données de spectroscopie d’une part et (ii) des données d’imagerie par résonance magnétique multimodales d’autre part, pour prédire le rétablissement à long terme de patients ayant souffert d’un traumatisme cranien. Dans ces deux cas les méthodes proposées offrent de bons résultats quand ont compare des résultats obtenus avec les approches standards. / In this thesis we develop a framework for the extension of commonly used linear statistical methods (Fisher Discriminant Analysis, Logistical Regression, Cox regression and Regularized Canonical Correlation Analysis) to the multiway context. In contrast to their standard formulation, their multiway generalization relies on structural constraints imposed to the weight vectors that integrate the original tensor structure of the data within the optimization process. This structural constraint yields a more parsimonious and interpretable model. Different strategies to deal with high dimensionality are also considered. The application of these algorithms is illustrated on two real datasets: (i) serving for the discrimination of spectroscopy data for which all methods where tested and (ii) to predict the long term recovery of patients after traumatic brain injury from multi-modal brain Magnetic Resonance Imaging. In both datasets our methods yield valuable results compared to the standard approach.
|
88 |
Some statistical results in high-dimensional dependence modeling / Contributions à l'analyse statistique des modèles de dépendance en grande dimensionDerumigny, Alexis 15 May 2019 (has links)
Cette thèse peut être divisée en trois parties.Dans la première partie, nous étudions des méthodes d'adaptation au niveau de bruit dans le modèle de régression linéaire en grande dimension. Nous prouvons que deux estimateurs à racine carrée, peuvent atteindre les vitesses minimax d'estimation et de prédiction. Nous montrons qu'une version similaire construite à parti de médianes de moyenne, peut encore atteindre les mêmes vitesses optimales en plus d'être robuste vis-à-vis de l'éventuelle présence de données aberrantes.La seconde partie est consacrée à l'analyse de plusieurs modèles de dépendance conditionnelle. Nous proposons plusieurs tests de l'hypothèse simplificatrice qu'une copule conditionnelle est constante vis-à-vis de son évènement conditionnant, et nous prouvons la consistance d'une technique de ré-échantillonage semi-paramétrique. Si la copule conditionnelle n'est pas constante par rapport à sa variable conditionnante, alors elle peut être modélisée via son tau de Kendall conditionnel. Nous étudions donc l'estimation de ce paramètre de dépendance conditionnelle sous 3 approches différentes : les techniques à noyaux, les modèles de type régression et les algorithmes de classification.La dernière partie regroupe deux contributions dans le domaine de l'inférence.Nous comparons et proposons différents estimateurs de fonctionnelles conditionnelles régulières en utilisant des U-statistiques. Finalement, nous étudions la construction et les propriétés théoriques d'intervalles de confiance pour des ratios de moyenne sous différents choix d'hypothèses et de paradigmes. / This thesis can be divided into three parts.In the first part, we study adaptivity to the noise level in the high-dimensional linear regression framework. We prove that two square-root estimators attains the minimax rates of estimation and prediction. We show that a corresponding median-of-means version can still attains the same optimal rates while being robust to outliers in the data.The second part is devoted to the analysis of several conditional dependence models.We propose some tests of the simplifying assumption that a conditional copula is constant with respect to its conditioning event, and prove the consistency of a semiparametric bootstrap scheme.If the conditional copula is not constant with respect to the conditional event, then it can be modelled using the corresponding Kendall's tau.We study the estimation of this conditional dependence parameter using 3 different approaches : kernel techniques, regression-type models and classification algorithms.The last part regroups two different topics in inference.We review and propose estimators for regular conditional functionals using U-statistics.Finally, we study the construction and the theoretical properties of confidence intervals for ratios of means under different sets of assumptions and paradigms.
|
89 |
Interpretable machine learning approaches to high-dimensional data and their applications to biomedical engineering problems / 高次元データへの解釈可能な機械学習アプローチとその医用工学問題への適用Yoshida, Kosuke 26 March 2018 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21215号 / 情博第668号 / 新制||情||115(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授 石井 信, 教授 下平 英寿, 教授 加納 学, 銅谷 賢治 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
90 |
A Unified Exposure Prediction Approach for Multivariate Spatial Data: From Predictions to Health AnalysisZhu, Zheng 18 June 2019 (has links)
No description available.
|
Page generated in 0.0732 seconds