Global ETD Search

191	Policy evaluation, high-dimension and machine learning / Évaluation des politiques publiques, grande dimension et machine learning L'Hour, Jérémy 13 December 2019 (has links) Cette thèse regroupe trois travaux d'économétrie liés par l'application du machine learning et de la statistique en grande dimension à l'évaluation de politiques publiques. La première partie propose une alternative paramétrique au contrôle synthétique (Abadie and Gardeazabal, 2003; Abadie et al., 2010) sous la forme d'un estimateur reposant sur une première étape de type Lasso, dont on montre qu'il est doublement robuste, asymptotiquement Normal et ``immunisé'' contre les erreurs de première étape. La seconde partie étudie une version pénalisée du contrôle synthétique en présence de données de nature micro-économique. La pénalisation permet d'obtenir une unité synthétique qui réalise un arbitrage entre reproduire fidèlement l'unité traitée durant la période pré-traitement et n'utiliser que des unités non-traitées suffisamment semblables à l'unité traitée. Nous étudions les propriétés de cet estimateur, proposons deux procédures de type ``validation croisée'' afin de choisir la pénalisation et discutons des procédures d'inférence par permutation. La dernière partie porte sur l'application du Generic Machine Learning (Chernozhukov et al., 2018) afin d'étudier l'hétérogénéité des effets d'une expérience aléatoire visant à comparer la fourniture publique et privée d'aide à la recherche d'emploi. D'un point de vue méthodologique, ce projet discute l'extension du Generic Machine Learning à des expériences avec compliance imparfaite. / This dissertation is comprised of three essays that apply machine learning and high-dimensional statistics to causal inference. The first essay proposes a parametric alternative to the synthetic control method (Abadie and Gardeazabal, 2003; Abadie et al., 2010) that relies on a Lasso-type first-step. We show that the resulting estimator is doubly robust, asymptotically Gaussian and ``immunized'' against first-step selection mistakes. The second essay studies a penalized version of the synthetic control method especially useful in the presence of micro-economic data. The penalization parameter trades off pairwise matching discrepancies with respect to the characteristics of each unit in the synthetic control against matching discrepancies with respect to the characteristics of the synthetic control unit as a whole. We study the properties of the resulting estimator, propose data-driven choices of the penalization parameter and discuss randomization-based inference procedures. The last essay applies the Generic Machine Learning framework (Chernozhukov et al., 2018) to study heterogeneity of the treatment in a randomized experiment designed to compare public and private provision of job counselling. From a methodological perspective, we discuss the extension of the Generic Machine Learning framework to experiments with imperfect compliance. Économétrie Évaluation des politiques publiques Machine learning Statistique en grande dimension Contrôle synthétique Econometrics Policy evaluation Machine learning High-Dimensional statistics Synthetic control 330 310
192	Identification de biomarqueurs prédictifs de la survie et de l'effet du traitement dans un contexte de données de grande dimension / Identification of biomarkers predicting the outcome and the treatment effect in presence of high-dimensional data Ternes, Nils 05 October 2016 (has links) Avec la révolution récente de la génomique et la médecine stratifiée, le développement de signatures moléculaires devient de plus en plus important pour prédire le pronostic (biomarqueurs pronostiques) ou l’effet d’un traitement (biomarqueurs prédictifs) de chaque patient. Cependant, la grande quantité d’information disponible rend la découverte de faux positifs de plus en plus fréquente dans la recherche biomédicale. La présence de données de grande dimension (nombre de biomarqueurs ≫ taille d’échantillon) soulève de nombreux défis statistiques tels que la non-identifiabilité des modèles, l’instabilité des biomarqueurs sélectionnés ou encore la multiplicité des tests.L’objectif de cette thèse a été de proposer et d’évaluer des méthodes statistiques pour l’identification de ces biomarqueurs et l’élaboration d’une prédiction individuelle des probabilités de survie pour des nouveaux patients à partir d’un modèle de régression de Cox. Pour l’identification de biomarqueurs en présence de données de grande dimension, la régression pénalisée lasso est très largement utilisée. Dans le cas de biomarqueurs pronostiques, une extension empirique de cette pénalisation a été proposée permettant d’être plus restrictif sur le choix du paramètre λ dans le but de sélectionner moins de faux positifs. Pour les biomarqueurs prédictifs, l’intérêt s’est porté sur les interactions entre le traitement et les biomarqueurs dans le contexte d’un essai clinique randomisé. Douze approches permettant de les identifier ont été évaluées telles que le lasso (standard, adaptatif, groupé ou encore ridge+lasso), le boosting, la réduction de dimension des effets propres et un modèle implémentant les effets pronostiques par bras. Enfin, à partir d’un modèle de prédiction pénalisé, différentes stratégies ont été évaluées pour obtenir une prédiction individuelle pour un nouveau patient accompagnée d’un intervalle de confiance, tout en évitant un éventuel surapprentissage du modèle. La performance des approches ont été évaluées au travers d’études de simulation proposant des scénarios nuls et alternatifs. Ces méthodes ont également été illustrées sur différents jeux de données, contenant des données d’expression de gènes dans le cancer du sein. / With the recent revolution in genomics and in stratified medicine, the development of molecular signatures is becoming more and more important for predicting the prognosis (prognostic biomarkers) and the treatment effect (predictive biomarkers) of each patient. However, the large quantity of information has rendered false positives more and more frequent in biomedical research. The high-dimensional space (i.e. number of biomarkers ≫ sample size) leads to several statistical challenges such as the identifiability of the models, the instability of the selected coefficients or the multiple testing issue.The aim of this thesis was to propose and evaluate statistical methods for the identification of these biomarkers and the individual predicted survival probability for new patients, in the context of the Cox regression model. For variable selection in a high-dimensional setting, the lasso penalty is commonly used. In the prognostic setting, an empirical extension of the lasso penalty has been proposed to be more stringent on the estimation of the tuning parameter λ in order to select less false positives. In the predictive setting, focus has been given to the biomarker-by-treatment interactions in the setting of a randomized clinical trial. Twelve approaches have been proposed for selecting these interactions such as lasso (standard, adaptive, grouped or ridge+lasso), boosting, dimension reduction of the main effects and a model incorporating arm-specific biomarker effects. Finally, several strategies were studied to obtain an individual survival prediction with a corresponding confidence interval for a future patient from a penalized regression model, while limiting the potential overfit.The performance of the approaches was evaluated through simulation studies combining null and alternative scenarios. The methods were also illustrated in several data sets containing gene expression data in breast cancer. Médecine stratifiée Données de grande dimension Régression pénalisée Biomarqueurs pronostiques Biomarqueurs prédictifs Prédiction individuelle Stratified medicine High-Dimensional data Penalized regression Prognostic biomarkers Predictive biomarkers Individual prediction
193	Vysoce výkonné prohledávání a dotazování ve vybraných mnohadimenzionálních prostorech v přírodních vědách / High-performance exploration and querying of selected multi-dimensional spaces in life sciences Kratochvíl, Miroslav January 2020 (has links) This thesis studies, implements and experiments with specific application-oriented approaches for exploring and querying multi-dimensional datasets. The first part of the thesis scrutinizes indexing of the complex space of chemical compounds, and details a design of high-performance retrieval system for small molecules. The resulting system is then utilized within a wider context of federated search in heterogeneous data and metadata related to the chemical datasets. In the second part, the thesis focuses on fast visualization and exploration of many-dimensional data that originate from single- cell cytometry. Self-organizing maps are used to derive fast methods for analysis of the datasets, and used as a base for a novel data visualization algorithm. Finally, a similar approach is utilized for highly interactive exploration of multimedia datasets. The main contributions of the thesis comprise the advancement in optimization and methods for querying the chemical data implemented in the Sachem database cartridge, the federated, SPARQL-based interface to Sachem that provides the heterogeneous search support, dimensionality reduction algorithm EmbedSOM, design and implementation of the specific EmbedSOM-backed analysis tool for flow and mass cytometry, and design and implementation of the multimedia...
194	False Discovery Rates, Higher Criticism and Related Methods in High-Dimensional Multiple Testing Klaus, Bernd 09 January 2013 (has links) The technical advancements in genomics, functional magnetic-resonance and other areas of scientific research seen in the last two decades have led to a burst of interest in multiple testing procedures. A driving factor for innovations in the field of multiple testing has been the problem of large scale simultaneous testing. There, the goal is to uncover lower--dimensional signals from high--dimensional data. Mathematically speaking, this means that the dimension d is usually in the thousands while the sample size n is relatively small (max. 100 in general, often due to cost constraints) --- a characteristic commonly abbreviated as d >> n. In my thesis I look at several multiple testing problems and corresponding procedures from a false discovery rate (FDR) perspective, a methodology originally introduced in a seminal paper by Benjamini and Hochberg (2005). FDR analysis starts by fitting a two--component mixture model to the observed test statistics. This mixture consists of a null model density and an alternative component density from which the interesting cases are assumed to be drawn. In the thesis I proposed a new approach called log--FDR to the estimation of false discovery rates. Specifically, my new approach to truncated maximum likelihood estimation yields accurate null model estimates. This is complemented by constrained maximum likelihood estimation for the alternative density using log--concave density estimation. A recent competitor to the FDR is the method of \"Higher Criticism\". It has been strongly advocated in the context of variable selection in classification which is deeply linked to multiple comparisons. Hence, I also looked at variable selection in class prediction which can be viewed as a special signal identification problem. Both FDR methods and Higher Criticism can be highly useful for signal identification. This is discussed in the context of variable selection in linear discriminant analysis (LDA), a popular classification method. FDR methods are not only useful for multiple testing situations in the strict sense, they are also applicable to related problems. I looked at several kinds of applications of FDR in linear classification. I present and extend statistical techniques related to effect size estimation using false discovery rates and showed how to use these for variable selection. The resulting fdr--effect method proposed for effect size estimation is shown to work as well as competing approaches while being conceptually simple and computationally inexpensive. Additionally, I applied the fdr--effect method to variable selection by minimizing the misclassification rate and showed that it works very well and leads to compact and interpretable feature sets. info:eu-repo/classification/ddc/500 ddc:500
195	Efficient multivariate approximation with transformed rank-1 lattices Nasdala, Robert 17 May 2022 (has links) We study the approximation of functions defined on different domains by trigonometric and transformed trigonometric functions. We investigate which of the many results known from the approximation theory on the d-dimensional torus can be transfered to other domains. We define invertible parameterized transformations and prove conditions under which functions from a weighted Sobolev space can be transformed into functions defined on the torus, that still have a certain degree of Sobolev smoothness and for which we know worst-case upper error bounds. By reverting the initial change of variables we transfer the fast algorithms based on rank-1 lattices used to approximate functions on the torus efficiently over to other domains and obtain adapted FFT algorithms.:1 Introduction 2 Preliminaries and notations 3 Fourier approximation on the torus 4 Torus-to-R d transformation mappings 5 Torus-to-cube transformation mappings 6 Conclusion Alphabetical Index / Wir betrachten die Approximation von Funktionen, die auf verschiedenen Gebieten definiert sind, mittels trigonometrischer und transformierter trigonometrischer Funktionen. Wir untersuchen, welche bisherigen Ergebnisse für die Approximation von Funktionen, die auf einem d-dimensionalen Torus definiert wurden, auf andere Definitionsgebiete übertragen werden können. Dazu definieren wir parametrisierte Transformationsabbildungen und beweisen Bedingungen, bei denen Funktionen aus einem gewichteten Sobolevraum in Funktionen, die auf dem Torus definiert sind, transformiert werden können, die dabei einen gewissen Grad an Sobolevglattheit behalten und für die obere Schranken der Approximationsfehler bewiesen wurden. Durch Umkehrung der ursprünglichen Koordinatentransformation übertragen wir die schnellen Algorithmen, die Rang-1 Gitter Methoden verwenden um Funktionen auf dem Torus effizient zu approximieren, auf andere Definitionsgebiete und erhalten adaptierte FFT Algorithmen.:1 Introduction 2 Preliminaries and notations 3 Fourier approximation on the torus 4 Torus-to-R d transformation mappings 5 Torus-to-cube transformation mappings 6 Conclusion Alphabetical Index info:eu-repo/classification/ddc/510 ddc:510 Numerische Mathematik Approximationstheorie Koordinatentransformation
196	Nouvelles méthodes pour l’apprentissage non-supervisé en grandes dimensions. / New methods for large-scale unsupervised learning. Tiomoko ali, Hafiz 24 September 2018 (has links) Motivée par les récentes avancées dans l'analyse théorique des performances des algorithmes d'apprentissage automatisé, cette thèse s'intéresse à l'analyse de performances et à l'amélioration de la classification nonsupervisée de données et graphes en grande dimension. Spécifiquement, dans la première grande partie de cette thèse, en s'appuyant sur des outils avancés de la théorie des grandes matrices aléatoires, nous analysons les performances de méthodes spectrales sur des modèles de graphes réalistes et denses ainsi que sur des données en grandes dimensions en étudiant notamment les valeurs propres et vecteurs propres des matrices d'affinités de ces données. De nouvelles méthodes améliorées sont proposées sur la base de cette analyse théorique et démontrent à travers de nombreuses simulations que leurs performances sont meilleures comparées aux méthodes de l'état de l'art. Dans la seconde partie de la thèse, nous proposons un nouvel algorithme pour la détection de communautés hétérogènes entre plusieurs couches d'un graphe à plusieurs types d'interaction. Une approche bayésienne variationnelle est utilisée pour approximer la distribution apostériori des variables latentes du modèle. Toutes les méthodes proposées dans cette thèse sont utilisées sur des bases de données synthétiques et sur des données réelles et présentent de meilleures performances en comparaison aux approches standard de classification dans les contextes susmentionnés. / Spurred by recent advances on the theoretical analysis of the performances of the data-driven machine learning algorithms, this thesis tackles the performance analysis and improvement of high dimensional data and graph clustering. Specifically, in the first bigger part of the thesis, using advanced tools from random matrix theory, the performance analysis of spectral methods on dense realistic graph models and on high dimensional kernel random matrices is performed through the study of the eigenvalues and eigenvectors of the similarity matrices characterizing those data. New improved methods are proposed and are shown to outperform state-of-the-art approaches. In a second part, a new algorithm is proposed for the detection of heterogeneous communities from multi-layer graphs using variational Bayes approaches to approximate the posterior distribution of the sought variables. The proposed methods are successfully applied to synthetic benchmarks as well as real-world datasets and are shown to outperform standard approaches to clustering in those specific contexts. Apprentissage non supervisé Détection de communautés Théorie des matrices aléatoires Inférence bayésienne Unsupervised learning High dimensional data clustering Community detection Random Matrix Theory Bayesian inference
197	Statistical Inference for Change Points in High-Dimensional Offline and Online Data Li, Lingjun 07 April 2020 (has links) No description available. Mathematics Statistics Change point analysis Change-point detection Spatial-temporal data Large p small n High-dimensional data Average run length Expected detection delay
198	Fast, exact and stable reconstruction of multivariate algebraic polynomials in Chebyshev form Potts, Daniel, Volkmer, Toni 16 February 2015 (has links) We describe a fast method for the evaluation of an arbitrary high-dimensional multivariate algebraic polynomial in Chebyshev form at the nodes of an arbitrary rank-1 Chebyshev lattice. Our main focus is on conditions on rank-1 Chebyshev lattices allowing for the exact reconstruction of such polynomials from samples along such lattices and we present an algorithm for constructing suitable rank-1 Chebyshev lattices based on a component-by-component approach. Moreover, we give a method for the fast, exact and stable reconstruction. info:eu-repo/classification/ddc/518 ddc:518 Čebyšev-Polynome
199	Geometry of high dimensional Gaussian data Mossberg, Olof Samuel January 2024 (has links) Collected data may simultaneously be of low sample size and high dimension. Such data exhibit some geometric regularities consisting of a single observation being a rotation on a sphere, and a pair of observations being orthogonal. This thesis investigates these geometric properties in some detail. Background is provided and various approaches to the result are discussed. An approach based on the mean value theorem is eventually chosen, being the only candidate investigated that gives explicit convergence bounds. The bounds are tested employing Monte Carlo simulation and found to be adequate. / Data som insamlas kan samtidigt ha en liten stickprovsstorlek men vara högdimensionell. Sådan data uppvisar vissa geometriska mönster som består av att en enskild observation är en rotation på en sfär, och att ett par av observationer är rätvinkliga. Den här uppsatsen undersöker dessa geometriska egenskaper mer detaljerat. En bakgrund ges och olika typer av angreppssätt diskuteras. Till slut väljs en metod som baseras på medelvärdessatsen eftersom detta är den enda av de undersökta metoderna som ger explicita konvergensgränser. Gränserna testas sedermera med Monte Carlo-simulering och visar sig stämma. HDLSS high dimensional data stochastic boundedness asymptotic orthogonality geometry multivariate normal distribution HDLSS högdimensionell data stokastisk begränsning asymptotisk ortogonalitet geometri multivariat normalfördelning Probability Theory and Statistics Sannolikhetsteori och statistik
200	ESSAYS ON SCALABLE BAYESIAN NONPARAMETRIC AND SEMIPARAMETRIC MODELS Chenzhong Wu (18275839) 29 March 2024 (has links) <p dir="ltr">In this thesis, we delve into the exploration of several nonparametric and semiparametric econometric models within the Bayesian framework, highlighting their applicability across a broad spectrum of microeconomic and macroeconomic issues. Positioned in the big data era, where data collection and storage expand at an unprecedented rate, the complexity of economic questions we aim to address is similarly escalating. This dual challenge ne- cessitates leveraging increasingly large datasets, thereby underscoring the critical need for designing flexible Bayesian priors and developing scalable, efficient algorithms tailored for high-dimensional datasets.</p><p dir="ltr">The initial two chapters, Chapter 2 and 3, are dedicated to crafting Bayesian priors suited for environments laden with a vast array of variables. These priors, alongside their corresponding algorithms, are optimized for computational efficiency, scalability to extensive datasets, and, ideally, distributability. We aim for these priors to accommodate varying levels of dataset sparsity. Chapter 2 assesses nonparametric additive models, employing a smoothing prior alongside a band matrix for each additive component. Utilizing the Bayesian backfitting algorithm significantly alleviates the computational load. In Chapter 3, we address multiple linear regression settings by adopting a flexible scale mixture of normal priors for coefficient parameters, thus allowing data-driven determination of the necessary amount of shrinkage. The use of a conjugate prior enables a closed-form solution for the posterior, markedly enhancing computational speed.</p><p dir="ltr">The subsequent chapters, Chapter 4 and 5, pivot towards time series dataset model- ing and Bayesian algorithms. A semiparametric modeling approach dissects the stochastic volatility in macro time series into persistent and transitory components, the latter addi- tional component addressing outliers. Utilizing a Dirichlet process mixture prior for the transitory part and a collapsed Gibbs sampling algorithm, we devise a method capable of efficiently processing over 10,000 observations and 200 variables. Chapter 4 introduces a simple univariate model, while Chapter 5 presents comprehensive Bayesian VARs. Our al- gorithms, more efficient and effective in managing outliers than existing ones, are adept at handling extensive macro datasets with hundreds of variables.</p> Financial econometrics Econometric and statistical methods Economic models and forecasting Bayesian Large VARs Additive Models MCMC Nonparametrics High-dimensional datasets Large datasets Flexible bayesian prior

Search results