Global ETD Search

11	COMPOSITE NONPARAMETRIC TESTS IN HIGH DIMENSION Villasante Tezanos, Alejandro G. 01 January 2019 (has links) This dissertation focuses on the problem of making high-dimensional inference for two or more groups. High-dimensional means both the sample size (n) and dimension (p) tend to infinity, possibly at different rates. Classical approaches for group comparisons fail in the high-dimensional situation, in the sense that they have incorrect sizes and low powers. Much has been done in recent years to overcome these problems. However, these recent works make restrictive assumptions in terms of the number of treatments to be compared and/or the distribution of the data. This research aims to (1) propose and investigate refined small-sample approaches for high-dimension data in the multi-group setting (2) propose and study a fully-nonparametric approach, and (3) conduct an extensive comparison of the proposed methods with some existing ones in a simulation. When treatment effects can meaningfully be formulated in terms of means, a semiparametric approach under equal and unequal covariance assumptions is investigated. Composites of F-type statistics are used to construct two tests. One test is a moderate-p version – the test statistic is centered by asymptotic mean – and the other test is a large-p version asymptotic-expansion based finite-sample correction for the mean of the test statistic. These tests do not make any distributional assumptions and, therefore, they are nonparametric in a way. The theory for the tests only requires mild assumptions to regulate the dependence. Simulation results show that, for moderately small samples, the large-p version yields substantial gain in the size with a small power tradeoff. In some situations mean-based inference is not appropriate, for example, for data that is in ordinal scale or heavy tailed. For these situations, a high-dimensional fully-nonparametric test is proposed. In the two-sample situation, a composite of a Wilcoxon-Mann-Whitney type test is investigated. Assumptions needed are weaker than those in the semiparametric approach. Numerical comparisons with the moderate-p version of the semiparametric approach show that the nonparametric test has very similar size but achieves superior power, especially for skewed data with some amount of dependence between variables. Finally, we conduct an extensive simulation to compare our proposed methods with other nonparametric test and rank transformation methods. A wide spectrum of simulation settings is considered. These simulation settings include a variety of heavy tailed and skewed data distributions, homoscedastic and heteroscedastic covariance structures, various amounts of dependence and choices of tuning (smoothing window) parameter for the asymptotic variance estimators. The fully-nonparametric and the rank transformation methods behave similarly in terms of type I and type II errors. However, the two approaches fundamentally differ in their hypotheses. Although there are no formal mathematical proofs for the rank transformations, they have a tendency to provide immunity against effects of outliers. From a theoretical standpoint, our nonparametric method essentially uses variable-by-variable ranking which naturally arises from estimating the nonparametric effect of interest. As a result of this, our method is invariant against application of any monotone marginal transformations. For a more practical comparison, real-data from an Encephalogram (EEG) experiment is analyzed. Multivariate Analysis High Dimension Statistical Tests Multivariate Analysis Statistical Methodology Statistical Models
12	The effects of high dimensional covariance matrix estimation on asset pricing and generalized least squares Kim, Soo-Hyun 23 June 2010 (has links) High dimensional covariance matrix estimation is considered in the context of empirical asset pricing. In order to see the effects of covariance matrix estimation on asset pricing, parameter estimation, model specification test, and misspecification problems are explored. Along with existing techniques, which is not yet tested in applications, diagonal variance matrix is simulated to evaluate the performances in these problems. We found that modified Stein type estimator outperforms all the other methods in all three cases. In addition, it turned out that heuristic method of diagonal variance matrix works far better than existing methods in Hansen-Jagannathan distance test. High dimensional covariance matrix as a transformation matrix in generalized least squares is also studied. Since the feasible generalized least squares estimator requires ex ante knowledge of the covariance structure, it is not applicable in general cases. We propose fully banding strategy for the new estimation technique. First we look into the sparsity of covariance matrix and the performances of GLS. Then we move onto the discussion of diagonals of covariance matrix and column summation of inverse of covariance matrix to see the effects on GLS estimation. In addition, factor analysis is employed to model the covariance matrix and it turned out that communality truly matters in efficiency of GLS estimation. Covariance matrix estimation High dimension Analysis of covariance Capital assets pricing model Risk
13	Copulas for High Dimensions: Models, Estimation, Inference, and Applications Oh, Dong Hwan January 2014 (has links) <p>The dissertation consists of four chapters that concern topics on copulas for high dimensions. Chapter 1 proposes a new general model for high dimension joint distributions of asset returns that utilizes high frequency data and copulas. The dependence between returns is decomposed into linear and nonlinear components, which enables the use of high frequency data to accurately measure and forecast linear dependence, and the use of a new class of copulas designed to capture nonlinear dependence among the resulting linearly uncorrelated residuals. Estimation of the new class of copulas is conducted using a composite likelihood, making the model feasible even for hundreds of variables. A realistic simulation study verifies that multistage estimation with composite likelihood results in small loss in efficiency and large gain in computation speed. </p><p>Chapter 2, which is co-authored with Professor Andrew Patton, presents new models for the dependence structure, or copula, of economic variables based on a factor structure. The proposed models are particularly attractive for high dimensional applications, involving fifty or more variables. This class of models generally lacks a closed-form density, but analytical results for the implied tail dependence can be obtained using extreme value theory, and estimation via a simulation-based method using rank statistics is simple and fast. We study the finite-sample properties of the estimation method for applications involving up to 100 variables, and apply the model to daily returns on all 100 constituents of the S\&P 100 index. We find significant evidence of tail dependence, heterogeneous dependence, and asymmetric dependence, with dependence being stronger in crashes than in booms. </p><p>Chapter 3, which is co-authored with Professor Andrew Patton, considers the estimation of the parameters of a copula via a simulated method of moments type approach. This approach is attractive when the likelihood of the copula model is not known in closed form, or when the researcher has a set of dependence measures or other functionals of the copula that are of particular interest. The proposed approach naturally also nests method of moments and generalized method of moments estimators. Drawing on results for simulation based estimation and on recent work in empirical copula process theory, we show the consistency and asymptotic normality of the proposed estimator, and obtain a simple test of over-identifying restrictions as a goodness-of-fit test. The results apply to both $iid$ and time series data. We analyze the finite-sample behavior of these estimators in an extensive simulation study.</p><p>Chapter 4, which is co-authored with Professor Andrew Patton, proposes a new class of copula-based dynamic models for high dimension conditional distributions, facilitating the estimation of a wide variety of measures of systemic risk. Our proposed models draw on successful ideas from the literature on modelling high dimension covariance matrices and on recent work on models for general time-varying distributions. Our use of copula-based models enable the estimation of the joint model in stages, greatly reducing the computational burden. We use the proposed new models to study a collection of daily credit default swap (CDS) spreads on 100 U.S. firms over the period 2006 to 2012. We find that while the probability of distress for individual firms has greatly reduced since the financial crisis of 2008-09, the joint probability of distress (a measure of systemic risk) is substantially higher now than in the pre-crisis period.</p> / Dissertation Economics Statistics Copula Dependence Factor Copula High Dimension High Frequency data Volatility
14	Some problems in high dimensional data analysis Pham, Tung Huy January 2010 (has links) The bloom of economics and technology has had an enormous impact on society. Along with these developments, human activities nowadays produce massive amounts of data that can be easily collected for relatively low cost with the aid of new technologies. Many examples can be mentioned here including data from web term-document data, sensor arrays, gene expression, finance data, imaging and hyperspectral analysis. Because of the enormous amount of data from various different and new sources, more and more challenging scientific problems appear. These problems have changed the types of problems which mathematical scientists work. / In traditional statistics, the dimension of the data, p say, is low, with many observations, n say. In this case, classical rules such as the Central Limit Theorem are often applied to obtain some understanding from data. A new challenge to statisticians today is dealing with a different setting, when the data dimension is very large and the number of observations is small. The mathematical assumption now could be p > n, or even p goes to infinity and n fixed in many cases, for example, there are few patients with many genes. In these cases, classical methods fail to produce a good understanding of the nature of the problem. Hence, new methods need to be found to solve these problems. Mathematical explanations are also needed to generalize these cases. / The research preferred in this thesis includes two problems: Variable selection and Classification, in the case where the dimension is very large. The work on variable selection problems, in particular the Adaptive Lasso was completed by June 2007 and the research on classification has been carried out through out 2008 and 2009. The research on the Dantzig selector and the Lasso were finished in July 2009. Therefore, this thesis is divided into two parts. In the first part of the thesis we study the Adaptive Lasso, the Lasso and the Dantzig selector. In particular, in Chapter 2 we present some results for the Adaptive Lasso. Chapter 3 will provides two examples that show that neither the Dantzig selector or the Lasso is definitely better than the other. The second part of the thesis is organized as follows. In Chapter 5, we shall construct the model setting. In Chapter 6, we summarize the results of the scaled centroid-based classifier. We also prove some results on the scaled centroid-based classifier. Because there are similarities between the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) classifiers, Chapter 8 introduces a class of distance-based classifiers that could be considered a generalization of the SVM and DWD classifiers. Chapters 9 and 10 are about the SVM and DWD classifiers. Chapter 11 demonstrates the performance of these classifiers on simulated data sets and some cancer data sets.
15	Some problems in high dimensional data analysis Pham, Tung Huy January 2010 (has links) The bloom of economics and technology has had an enormous impact on society. Along with these developments, human activities nowadays produce massive amounts of data that can be easily collected for relatively low cost with the aid of new technologies. Many examples can be mentioned here including data from web term-document data, sensor arrays, gene expression, finance data, imaging and hyperspectral analysis. Because of the enormous amount of data from various different and new sources, more and more challenging scientific problems appear. These problems have changed the types of problems which mathematical scientists work. / In traditional statistics, the dimension of the data, p say, is low, with many observations, n say. In this case, classical rules such as the Central Limit Theorem are often applied to obtain some understanding from data. A new challenge to statisticians today is dealing with a different setting, when the data dimension is very large and the number of observations is small. The mathematical assumption now could be p > n, or even p goes to infinity and n fixed in many cases, for example, there are few patients with many genes. In these cases, classical methods fail to produce a good understanding of the nature of the problem. Hence, new methods need to be found to solve these problems. Mathematical explanations are also needed to generalize these cases. / The research preferred in this thesis includes two problems: Variable selection and Classification, in the case where the dimension is very large. The work on variable selection problems, in particular the Adaptive Lasso was completed by June 2007 and the research on classification has been carried out through out 2008 and 2009. The research on the Dantzig selector and the Lasso were finished in July 2009. Therefore, this thesis is divided into two parts. In the first part of the thesis we study the Adaptive Lasso, the Lasso and the Dantzig selector. In particular, in Chapter 2 we present some results for the Adaptive Lasso. Chapter 3 will provides two examples that show that neither the Dantzig selector or the Lasso is definitely better than the other. The second part of the thesis is organized as follows. In Chapter 5, we shall construct the model setting. In Chapter 6, we summarize the results of the scaled centroid-based classifier. We also prove some results on the scaled centroid-based classifier. Because there are similarities between the Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD) classifiers, Chapter 8 introduces a class of distance-based classifiers that could be considered a generalization of the SVM and DWD classifiers. Chapters 9 and 10 are about the SVM and DWD classifiers. Chapter 11 demonstrates the performance of these classifiers on simulated data sets and some cancer data sets.
16	Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension Space January 2011 (has links) abstract: K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science high dimension K nearest neighbor large scale locality sensitive hash
17	An empirical analysis of systemic risk in commodity futures markets / Une analyse empirique du risque systémique sur les marchés futures de matières premières Ling, Julien 26 September 2018 (has links) Cette thèse vise à analyser le risque systémique sur les marchés futures de matières premières. En effet, plusieurs travaux de recherche mettent en évidence l'importance de ces futures dans la détermination du prix physique des matières premières. Leur incorporation dans la finance traditionnelle en tant qu'actif diversifiant a entraîné une évolution de leurs prix similaire à celles de différents actifs financiers depuis environ 2004. La question ayant motivé cette thèse a donc été de quantifier ce risque systémique (puisqu'affectant les matières premières, directement impliquées dans l'économie réelle), d'en voir précisément les moyens de transmission (quels marchés affectent quels autres marchés) et enfin de permettre d'en évaluer les conséquences, par exemple à partir de scénarii (stress tests). Elle permet donc de développer des outils de surveillance des marchés et pourrait donc contribuer à la régulation de ces marchés. / This thesis aims at studying systemic risk in commodity futures markets. A whole strand of the literature is dedicated to the "financialization of commodity markets", but also to the influence of the existence of futures markets on the spot price of their underlying asset. Indeed, since these commodity futures have been largely used by in asset management as diversifying assets, their financialization has raised concerns, especially seeing the evolution of their price, which seems to be similar to that of financial assets. My interest here is thus to quantify this systemic risk, provide a toolbox to assess the consequences of various scenarios (stress tests), but also to assess which markets should be monitored more closely (because they could threaten the real economy or the whole system). Risque systémique Matières premières Financiarisation Graphe Grande dimension Systemic risk Commodities Financialization Graph High dimension 658.15
18	Heritability Estimation in High-dimensional Mixed Models : Theory and Applications. / Estimation de l'héritabilité dans les modèles mixtes en grande dimension : théorie et applications. Bonnet, Anna 05 December 2016 (has links) Nous nous intéressons à desméthodes statistiques pour estimer l'héritabilitéd'un caractère biologique, qui correspond à lapart des variations de ce caractère qui peut êtreattribuée à des facteurs génétiques. Nousproposons dans un premier temps d'étudierl'héritabilité de traits biologiques continus àl'aide de modèles linéaires mixtes parcimonieuxen grande dimension. Nous avons recherché lespropriétés théoriques de l'estimateur du maximumde vraisemblance de l'héritabilité : nousavons montré que cet estimateur était consistantet vérifiait un théorème central limite avec unevariance asymptotique que nous avons calculéeexplicitement. Ce résultat, appuyé par des simulationsnumériques sur des échantillons finis,nous a permis de constater que la variance denotre estimateur était très fortement influencéepar le ratio entre le nombre d'observations et lataille des effets génétiques. Plus précisément,quand le nombre d’observations est faiblecomparé à la taille des effets génétiques (ce quiest très souvent le cas dans les étudesgénétiques), la variance de l’estimateur était trèsgrande. Ce constat a motivé le développementd'une méthode de sélection de variables afin dene garder que les variants génétiques les plusimpliqués dans les variations phénotypiques etd’améliorer la précision des estimations del’héritabilité.La dernière partie de cette thèse est consacrée àl'estimation d'héritabilité de données binaires,dans le but d'étudier la part de facteursgénétiques impliqués dans des maladies complexes.Nous proposons d'étudier les propriétésthéoriques de la méthode développée par Golanet al. (2014) pour des données de cas-contrôleset très efficace en pratique. Nous montronsnotamment la consistance de l’estimateur del’héritabilité proposé par Golan et al. (2014). / We study statistical methods toestimate the heritability of a biological trait,which is the proportion of variations of thistrait that can be explained by genetic factors.First, we propose to study the heritability ofquantitative traits using high-dimensionalsparse linear mixed models. We investigate thetheoretical properties of the maximumlikelihood estimator for the heritability and weshow that it is a consistent estimator and that itsatisfies a central limit theorem with a closedformexpression for the asymptotic variance.This result, supported by an extendednumerical study, shows that the variance of ourestimator is strongly affected by the ratiobetween the number of observations and thesize of the random genetic effects. Moreprecisely, when the number of observations issmall compared to the size of the geneticeffects (which is often the case in geneticstudies), the variance of our estimator is verylarge. This motivated the development of avariable selection method in order to capturethe genetic variants which are involved themost in the phenotypic variations and providemore accurate heritability estimations. Wepropose then a variable selection methodadapted to high dimensional settings and weshow that, depending on the number of geneticvariants actually involved in the phenotypicvariations, called causal variants, it was a goodidea to include or not a variable selection stepbefore estimating heritability.The last part of this thesis is dedicated toheritability estimation for binary data, in orderto study the proportion of genetic factorsinvolved in complex diseases. We propose tostudy the theoretical properties of the methoddeveloped by Golan et al. (2014) for casecontroldata, which is very efficient in practice.Our main result is the proof of the consistencyof their heritability estimator. Héritabilité Modèles mixtes Grande dimension Sélection de variables Heritability Mixed models High dimension Variable selection
19	Méthodes régularisées pour l’analyse de données multivariées en grande dimension : théorie et applications. / Regularized methods to study multivariate data in high dimensional settings : theory and applications. Perrot-Dockès, Marie 08 October 2019 (has links) Dans cette thèse nous nous intéressons au modèle linéaire général (modèle linéaire multivarié) en grande dimension. Nous proposons un nouvel estimateur parcimonieux des coefficients de ce modèle qui prend en compte la dépendance qui peut exister entre les différentes réponses. Cet estimateur est obtenu en estimant dans un premier temps la matrice de covariance des réponses puis en incluant cette matrice de covariance dans un critère Lasso. Les propriétés théoriques de cet estimateur sont étudiées lorsque le nombre de réponses peut tendre vers l’infini plus vite que la taille de l’échantillon. Plus précisément, nous proposons des conditions générales que doivent satisfaire les estimateurs de la matrice de covariance et de son inverse pour obtenir la consistance en signe des coefficients. Nous avons ensuite mis en place des méthodes, adaptées à la grande dimension, pour l’estimation de matrices de covariance qui sont supposées être des matrices de Toeplitz ou des matrices avec une structure par blocs, pas nécessairement diagonaux. Ces différentes méthodes ont enfin été appliquées à des problématiques de métabolomique, de protéomique et d’immunologie. / In this PhD thesis we study general linear model (multivariate linearmodel) in high dimensional settings. We propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approach are investigated both from a theoretical and a numerical point of view. More precisely, we give general conditions that the estimators of the covariance matrix and its inverse have to satisfy in order to recover the positions of the zero and non-zero entries of the coefficient matrix when the number of responses is not fixed and can tend to infinity. We also propose novel, efficient and fully data-driven approaches for estimating Toeplitz and large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. These approaches are appliedto different biological issues in metabolomics, in proteomics and in immunology. Méthodes régularisées Données multivariées Grande dimension Regularized methods Covariance matrix High dimension
20	Contributions to variable selection, clustering and statistical estimation inhigh dimension / Quelques contributions à la sélection de variables, au clustering et à l’estimation statistique en grande dimension Ndaoud, Mohamed 03 July 2019 (has links) Cette thèse traite les problèmes statistiques suivants : la sélection de variables dans le modèle de régression linéaire en grande dimension, le clustering dans le modèle de mélange Gaussien, quelques effets de l'adaptabilité sous l'hypothèse de parcimonie ainsi que la simulation des processus Gaussiens.Sous l'hypothèse de parcimonie, la sélection de variables correspond au recouvrement du "petit" ensemble de variables significatives. Nous étudions les propriétés non-asymptotiques de ce problème dans la régression linéaire en grande dimension. De plus, nous caractérisons les conditions optimales nécessaires et suffisantes pour la sélection de variables dans ce modèle. Nous étudions également certains effets de l'adaptation sous la même hypothèse. Dans le modèle à vecteur parcimonieux, nous analysons les changements dans les taux d'estimation de certains des paramètres du modèle lorsque le niveau de bruit ou sa loi nominale sont inconnus.Le clustering est une tâche d'apprentissage statistique non supervisée visant à regrouper des observations proches les unes des autres dans un certain sens. Nous étudions le problème de la détection de communautés dans le modèle de mélange Gaussien à deux composantes, et caractérisons précisément la séparation optimale entre les groupes afin de les recouvrir de façon exacte. Nous fournissons également une procédure en temps polynomial permettant un recouvrement optimal des communautés.Les processus Gaussiens sont extrêmement utiles dans la pratique, par exemple lorsqu'il s'agit de modéliser les fluctuations de prix. Néanmoins, leur simulation n'est pas facile en général. Nous proposons et étudions un nouveau développement en série à taux optimal pour simuler une grande classe de processus Gaussiens. / This PhD thesis deals with the following statistical problems: Variable selection in high-Dimensional Linear Regression, Clustering in the Gaussian Mixture Model, Some effects of adaptivity under sparsity and Simulation of Gaussian processes.Under the sparsity assumption, variable selection corresponds to recovering the "small" set of significant variables. We study non-asymptotic properties of this problem in the high-dimensional linear regression. Moreover, we recover optimal necessary and sufficient conditions for variable selection in this model. We also study some effects of adaptation under sparsity. Namely, in the sparse vector model, we investigate, the changes in the estimation rates of some of the model parameters when the noise level or its nominal law are unknown.Clustering is a non-supervised machine learning task aiming to group observations that are close to each other in some sense. We study the problem of community detection in the Gaussian Mixture Model with two components, and characterize precisely the sharp separation between clusters in order to recover exactly the clusters. We also provide a fast polynomial time procedure achieving optimal recovery.Gaussian processes are extremely useful in practice, when it comes to model price fluctuations for instance. Nevertheless, their simulation is not easy in general. We propose and study a new rate-optimal series expansion to simulate a large class of Gaussian processes. Grande dimension Regression linéaire Clustering Transition de phase Linear regression High dimension Clustering Phase transition 510

Search results