Spelling suggestions: "subject:"high dimensional estatistics"" "subject:"high dimensional cstatistics""
11 |
Sparse Latent-Space Learning for High-Dimensional Data: Extensions and ApplicationsWhite, Alexander James 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / The successful treatment and potential eradication of many complex diseases,
such as cancer, begins with elucidating the convoluted mapping of molecular profiles
to phenotypical manifestation. Our observed molecular profiles (e.g., genomics,
transcriptomics, epigenomics) are often high-dimensional and are collected from patient
samples falling into heterogeneous disease subtypes. Interpretable learning from
such data calls for sparsity-driven models. This dissertation addresses the high dimensionality,
sparsity, and heterogeneity issues when analyzing multiple-omics data,
where each method is implemented with a concomitant R package. First, we examine
challenges in submatrix identification, which aims to find subgroups of samples
that behave similarly across a subset of features. We resolve issues such as two-way
sparsity, non-orthogonality, and parameter tuning with an adaptive thresholding procedure
on the singular vectors computed via orthogonal iteration. We validate the
method with simulation analysis and apply it to an Alzheimer’s disease dataset.
The second project focuses on modeling relationships between large, matched
datasets. Exploring regressional structures between large data sets can provide insights
such as the effect of long-range epigenetic influences on gene expression. We
present a high-dimensional version of mixture multivariate regression to detect patient
clusters, each with different correlation structures of matched-omics datasets.
Results are validated via simulation and applied to matched-omics data sets. In the third project, we introduce a novel approach to modeling spatial transcriptomics
(ST) data with a spatially penalized multinomial model of the expression
counts. This method solves the low-rank structures of zero-inflated ST data with
spatial smoothness constraints. We validate the model using manual cell structure
annotations of human brain samples. We then applied this technique to additional
ST datasets. / 2025-05-22
|
12 |
Efektivní implementace metod pro redukci dimenze v mnohorozměrné statistice / Efficient implementation of dimension reduction methods for high-dimensional statisticsPekař, Vojtěch January 2015 (has links)
The main goal of our thesis is to make the implementation of a classification method called linear discriminant analysis more efficient. It is a model of multivariate statistics which, given samples and their membership to given groups, attempts to determine the group of a new sample. We focus especially on the high-dimensional case, meaning that the number of variables is higher than number of samples and the problem leads to a singular covariance matrix. If the number of variables is too high, it can be practically impossible to use the common methods because of the high computational cost. Therefore, we look at the topic from the perspective of numerical linear algebra and we rearrange the obtained tasks to their equivalent formulation with much lower dimension. We offer new ways of solution, provide examples of particular algorithms and discuss their efficiency. Powered by TCPDF (www.tcpdf.org)
|
13 |
Bilinear Gaussian Radial Basis Function Networks for classification of repeated measurementsSjödin Hällstrand, Andreas January 2020 (has links)
The Growth Curve Model is a bilinear statistical model which can be used to analyse several groups of repeated measurements. Normally the Growth Curve Model is defined in such a way that the permitted sampling frequency of the repeated measurement is limited by the number of observed individuals in the data set.In this thesis, we examine the possibilities of utilizing highly frequently sampled measurements to increase classification accuracy for real world data. That is, we look at the case where the regular Growth Curve Model is not defined due to the relationship between the sampling frequency and the number of observed individuals. When working with this high frequency data, we develop a new method of basis selection for the regression analysis which yields what we call a Bilinear Gaussian Radial Basis Function Network (BGRBFN), which we then compare to more conventional polynomial and trigonometrical functional bases. Finally, we examine if Tikhonov regularization can be used to further increase the classification accuracy in the high frequency data case.Our findings suggest that the BGRBFN performs better than the conventional methods in both classification accuracy and functional approximability. The results also suggest that both high frequency data and furthermore Tikhonov regularization can be used to increase classification accuracy.
|
14 |
Three Essays in Functional Time Series and Factor AnalysisNisol, Gilles 20 December 2018 (has links) (PDF)
The thesis is dedicated to time series analysis for functional data and contains three original parts. In the first part, we derive statistical tests for the presence of a periodic component in a time series of functions. We consider both the traditional setting in which the periodic functional signal is contaminated by functional white noise, and a more general setting of a contaminating process which is weakly dependent. Several forms of the periodic component are considered. Our tests are motivated by the likelihood principle and fall into two broad categories, which we term multivariate and fully functional. Overall, for the functional series that motivate this research, the fully functional tests exhibit a superior balance of size and power. Asymptotic null distributions of all tests are derived and their consistency is established. Their finite sample performance is examined and compared by numerical studies and application to pollution data. In the second part, we consider vector autoregressive processes (VARs) with innovations having a singular covariance matrix (in short singular VARs). These objects appear naturally in the context of dynamic factor models. The Yule-Walker estimator of such a VAR is problematic, because the solution of the corresponding equation system tends to be numerically rather unstable. For example, if we overestimate the order of the VAR, then the singularity of the innovations renders the Yule-Walker equation system singular as well. Moreover, even with correctly selected order, the Yule-Walker system tends be close to singular in finite sample. We show that this has a severe impact on predictions. While the asymptotic rate of the mean square prediction error (MSPE) can be just like in the regular (non-singular) case, the finite sample behavior is suffering. This effect turns out to be particularly dramatic in context of dynamic factor models, where we do not directly observe the so-called common components which we aim to predict. Then, when the data are sampled with some additional error, the MSPE often gets severely inflated. We explain the reason for this phenomenon and show how to overcome the problem. Our numerical results underline that it is very important to adapt prediction algorithms accordingly. In the third part, we set up theoretical foundations and a practical method to forecast multiple functional time series (FTS). In order to do so, we generalize the static factor model to the case where cross-section units are FTS. We first derive a representation result. We show that if the first r eigenvalues of the covariance operator of the cross-section of n FTS are unbounded as n diverges and if the (r+1)th eigenvalue is bounded, then we can represent the each FTS as a sum of a common component driven by r factors and an idiosyncratic component. We suggest a method of estimation and prediction of such a model. We assess the performances of the method through a simulation study. Finally, we show that by applying our method to a cross-section of volatility curves of the stocks of S&P100, we have a better prediction accuracy than by limiting the analysis to individual FTS. / Doctorat en Sciences économiques et de gestion / info:eu-repo/semantics/nonPublished
|
15 |
Dirty statistical modelsJalali, Ali, 1982- 11 July 2012 (has links)
In fields across science and engineering, we are increasingly faced with problems where the number of variables or features we need to estimate is much larger than the number of observations. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity, low-rank structure or block sparsity. However, data may deviate significantly from any one such statistical model. The motivation of this thesis is: can we simultaneously leverage more than one such statistical structural model, to obtain consistency in a larger number of problems, and with fewer samples, than can be obtained by single models? Our approach involves combining via simple linear superposition, a technique we term dirty models. The idea is very simple: while any one structure might not capture the data, a superposition of structural classes might. Dirty models thus searches for a parameter that can be decomposed into a number of simpler structures such as (a) sparse plus block-sparse, (b) sparse plus low-rank and (c) low-rank plus block-sparse. In this thesis, we propose dirty model based algorithms for different problems such as multi-task learning, graph clustering and time-series analysis with latent factors. We analyze these algorithms in terms of the number of observations we need to estimate the variables. These algorithms are based on convex optimization and sometimes they are relatively slow. We provide a class of low-complexity greedy algorithms that not only can solve these optimizations faster, but also guarantee the solution. Other than theoretical results, in each case, we provide experimental results to illustrate the power of dirty models. / text
|
16 |
Policy evaluation, high-dimension and machine learning / Évaluation des politiques publiques, grande dimension et machine learningL'Hour, Jérémy 13 December 2019 (has links)
Cette thèse regroupe trois travaux d'économétrie liés par l'application du machine learning et de la statistique en grande dimension à l'évaluation de politiques publiques. La première partie propose une alternative paramétrique au contrôle synthétique (Abadie and Gardeazabal, 2003; Abadie et al., 2010) sous la forme d'un estimateur reposant sur une première étape de type Lasso, dont on montre qu'il est doublement robuste, asymptotiquement Normal et ``immunisé'' contre les erreurs de première étape. La seconde partie étudie une version pénalisée du contrôle synthétique en présence de données de nature micro-économique. La pénalisation permet d'obtenir une unité synthétique qui réalise un arbitrage entre reproduire fidèlement l'unité traitée durant la période pré-traitement et n'utiliser que des unités non-traitées suffisamment semblables à l'unité traitée. Nous étudions les propriétés de cet estimateur, proposons deux procédures de type ``validation croisée'' afin de choisir la pénalisation et discutons des procédures d'inférence par permutation. La dernière partie porte sur l'application du Generic Machine Learning (Chernozhukov et al., 2018) afin d'étudier l'hétérogénéité des effets d'une expérience aléatoire visant à comparer la fourniture publique et privée d'aide à la recherche d'emploi. D'un point de vue méthodologique, ce projet discute l'extension du Generic Machine Learning à des expériences avec compliance imparfaite. / This dissertation is comprised of three essays that apply machine learning and high-dimensional statistics to causal inference. The first essay proposes a parametric alternative to the synthetic control method (Abadie and Gardeazabal, 2003; Abadie et al., 2010) that relies on a Lasso-type first-step. We show that the resulting estimator is doubly robust, asymptotically Gaussian and ``immunized'' against first-step selection mistakes. The second essay studies a penalized version of the synthetic control method especially useful in the presence of micro-economic data. The penalization parameter trades off pairwise matching discrepancies with respect to the characteristics of each unit in the synthetic control against matching discrepancies with respect to the characteristics of the synthetic control unit as a whole. We study the properties of the resulting estimator, propose data-driven choices of the penalization parameter and discuss randomization-based inference procedures. The last essay applies the Generic Machine Learning framework (Chernozhukov et al., 2018) to study heterogeneity of the treatment in a randomized experiment designed to compare public and private provision of job counselling. From a methodological perspective, we discuss the extension of the Generic Machine Learning framework to experiments with imperfect compliance.
|
17 |
Statistical Design of Sequential Decision Making AlgorithmsChi-hua Wang (12469251) 27 April 2022 (has links)
<p>Sequential decision-making is a fundamental class of problem that motivates algorithm designs of online machine learning and reinforcement learning. Arguably, the resulting online algorithms have supported modern online service industries for their data-driven real-time automated decision making. The applications span across different industries, including dynamic pricing (Marketing), recommendation (Advertising), and dosage finding (Clinical Trial). In this dissertation, we contribute fundamental statistical design advances for sequential decision-making algorithms, leaping progress in theory and application of online learning and sequential decision making under uncertainty including online sparse learning, finite-armed bandits, and high-dimensional online decision making. Our work locates at the intersection of decision-making algorithm designs, online statistical machine learning, and operations research, contributing new algorithms, theory, and insights to diverse fields including optimization, statistics, and machine learning.</p>
<p><br></p>
<p>In part I, we contribute a theoretical framework of continuous risk monitoring for regularized online statistical learning. Such theoretical framework is desirable for modern online service industries on monitoring deployed model's performance of online machine learning task. In the first project (Chapter 1), we develop continuous risk monitoring for the online Lasso procedure and provide an always-valid algorithm for high-dimensional dynamic pricing problems. In the second project (Chapter 2), we develop continuous risk monitoring for online matrix regression and provide new algorithms for rank-constrained online matrix completion problems. Such theoretical advances are due to our elegant interplay between non-asymptotic martingale concentration theory and regularized online statistical machine learning.</p>
<p><br></p>
<p>In part II, we contribute a bootstrap-based methodology for finite-armed bandit problems, termed Residual Bootstrap exploration. Such a method opens a possibility to design model-agnostic bandit algorithms without problem-adaptive optimism-engineering and instance-specific prior-tuning. In the first project (Chapter 3), we develop residual bootstrap exploration for multi-armed bandit algorithms and shows its easy generalizability to bandit problems with complex or ambiguous reward structure. In the second project (Chapter 4), we develop a theoretical framework for residual bootstrap exploration in linear bandit with fixed action set. Such methodology advances are due to our development of non-asymptotic theory for the bootstrap procedure.</p>
<p><br></p>
<p>In part III, we contribute application-driven insights on the exploration-exploitation dilemma for high-dimensional online decision-making problems. Such insights help practitioners to implement effective high-dimensional statistics methods to solve online decisionmaking problems. In the first project (Chapter 5), we develop a bandit sampling scheme for online batch high-dimensional decision making, a practical scenario in interactive marketing, and sequential clinical trials. In the second project (Chapter 6), we develop a bandit sampling scheme for federated online high-dimensional decision-making to maintain data decentralization and perform collaborated decisions. These new insights are due to our new bandit sampling design to address application-driven exploration-exploitation trade-offs effectively. </p>
|
Page generated in 0.102 seconds