Global ETD Search

71	False Discovery Rates, Higher Criticism and Related Methods in High-Dimensional Multiple Testing Klaus, Bernd 09 January 2013 (has links) The technical advancements in genomics, functional magnetic-resonance and other areas of scientific research seen in the last two decades have led to a burst of interest in multiple testing procedures. A driving factor for innovations in the field of multiple testing has been the problem of large scale simultaneous testing. There, the goal is to uncover lower--dimensional signals from high--dimensional data. Mathematically speaking, this means that the dimension d is usually in the thousands while the sample size n is relatively small (max. 100 in general, often due to cost constraints) --- a characteristic commonly abbreviated as d >> n. In my thesis I look at several multiple testing problems and corresponding procedures from a false discovery rate (FDR) perspective, a methodology originally introduced in a seminal paper by Benjamini and Hochberg (2005). FDR analysis starts by fitting a two--component mixture model to the observed test statistics. This mixture consists of a null model density and an alternative component density from which the interesting cases are assumed to be drawn. In the thesis I proposed a new approach called log--FDR to the estimation of false discovery rates. Specifically, my new approach to truncated maximum likelihood estimation yields accurate null model estimates. This is complemented by constrained maximum likelihood estimation for the alternative density using log--concave density estimation. A recent competitor to the FDR is the method of \"Higher Criticism\". It has been strongly advocated in the context of variable selection in classification which is deeply linked to multiple comparisons. Hence, I also looked at variable selection in class prediction which can be viewed as a special signal identification problem. Both FDR methods and Higher Criticism can be highly useful for signal identification. This is discussed in the context of variable selection in linear discriminant analysis (LDA), a popular classification method. FDR methods are not only useful for multiple testing situations in the strict sense, they are also applicable to related problems. I looked at several kinds of applications of FDR in linear classification. I present and extend statistical techniques related to effect size estimation using false discovery rates and showed how to use these for variable selection. The resulting fdr--effect method proposed for effect size estimation is shown to work as well as competing approaches while being conceptually simple and computationally inexpensive. Additionally, I applied the fdr--effect method to variable selection by minimizing the misclassification rate and showed that it works very well and leads to compact and interpretable feature sets. info:eu-repo/classification/ddc/500 ddc:500
72	Nouvelles méthodes pour l’apprentissage non-supervisé en grandes dimensions. / New methods for large-scale unsupervised learning. Tiomoko ali, Hafiz 24 September 2018 (has links) Motivée par les récentes avancées dans l'analyse théorique des performances des algorithmes d'apprentissage automatisé, cette thèse s'intéresse à l'analyse de performances et à l'amélioration de la classification nonsupervisée de données et graphes en grande dimension. Spécifiquement, dans la première grande partie de cette thèse, en s'appuyant sur des outils avancés de la théorie des grandes matrices aléatoires, nous analysons les performances de méthodes spectrales sur des modèles de graphes réalistes et denses ainsi que sur des données en grandes dimensions en étudiant notamment les valeurs propres et vecteurs propres des matrices d'affinités de ces données. De nouvelles méthodes améliorées sont proposées sur la base de cette analyse théorique et démontrent à travers de nombreuses simulations que leurs performances sont meilleures comparées aux méthodes de l'état de l'art. Dans la seconde partie de la thèse, nous proposons un nouvel algorithme pour la détection de communautés hétérogènes entre plusieurs couches d'un graphe à plusieurs types d'interaction. Une approche bayésienne variationnelle est utilisée pour approximer la distribution apostériori des variables latentes du modèle. Toutes les méthodes proposées dans cette thèse sont utilisées sur des bases de données synthétiques et sur des données réelles et présentent de meilleures performances en comparaison aux approches standard de classification dans les contextes susmentionnés. / Spurred by recent advances on the theoretical analysis of the performances of the data-driven machine learning algorithms, this thesis tackles the performance analysis and improvement of high dimensional data and graph clustering. Specifically, in the first bigger part of the thesis, using advanced tools from random matrix theory, the performance analysis of spectral methods on dense realistic graph models and on high dimensional kernel random matrices is performed through the study of the eigenvalues and eigenvectors of the similarity matrices characterizing those data. New improved methods are proposed and are shown to outperform state-of-the-art approaches. In a second part, a new algorithm is proposed for the detection of heterogeneous communities from multi-layer graphs using variational Bayes approaches to approximate the posterior distribution of the sought variables. The proposed methods are successfully applied to synthetic benchmarks as well as real-world datasets and are shown to outperform standard approaches to clustering in those specific contexts. Apprentissage non supervisé Détection de communautés Théorie des matrices aléatoires Inférence bayésienne Unsupervised learning High dimensional data clustering Community detection Random Matrix Theory Bayesian inference
73	Statistical Inference for Change Points in High-Dimensional Offline and Online Data Li, Lingjun 07 April 2020 (has links) No description available. Mathematics Statistics Change point analysis Change-point detection Spatial-temporal data Large p small n High-dimensional data Average run length Expected detection delay
74	Geometry of high dimensional Gaussian data Mossberg, Olof Samuel January 2024 (has links) Collected data may simultaneously be of low sample size and high dimension. Such data exhibit some geometric regularities consisting of a single observation being a rotation on a sphere, and a pair of observations being orthogonal. This thesis investigates these geometric properties in some detail. Background is provided and various approaches to the result are discussed. An approach based on the mean value theorem is eventually chosen, being the only candidate investigated that gives explicit convergence bounds. The bounds are tested employing Monte Carlo simulation and found to be adequate. / Data som insamlas kan samtidigt ha en liten stickprovsstorlek men vara högdimensionell. Sådan data uppvisar vissa geometriska mönster som består av att en enskild observation är en rotation på en sfär, och att ett par av observationer är rätvinkliga. Den här uppsatsen undersöker dessa geometriska egenskaper mer detaljerat. En bakgrund ges och olika typer av angreppssätt diskuteras. Till slut väljs en metod som baseras på medelvärdessatsen eftersom detta är den enda av de undersökta metoderna som ger explicita konvergensgränser. Gränserna testas sedermera med Monte Carlo-simulering och visar sig stämma. HDLSS high dimensional data stochastic boundedness asymptotic orthogonality geometry multivariate normal distribution HDLSS högdimensionell data stokastisk begränsning asymptotisk ortogonalitet geometri multivariat normalfördelning Probability Theory and Statistics Sannolikhetsteori och statistik
75	Adaptive Mixture Estimation and Subsampling PCA Liu, Peng January 2009 (has links) No description available. Statistics large data data mining mixture models Gaussian mixtures parameter estimation adaptive procedure partial EM high-dimensional data large p small n dimension reduction feature selection subsampling
76	Partial EM Procedure for Big-Data Linear Mixed Effects Model, and Generalized PPE for High-Dimensional Data in Julia Cho, Jang Ik 31 August 2018 (has links) No description available. Statistics Biostatistics Biomedical Research Health Mining
77	Sparse Principal Component Analysis for High-Dimensional Data: A Comparative Study Bonner, Ashley J. 10 1900 (has links) <p><strong>Background:</strong> Through unprecedented advances in technology, high-dimensional datasets have exploded into many fields of observational research. For example, it is now common to expect thousands or millions of genetic variables (p) with only a limited number of study participants (n). Determining the important features proves statistically difficult, as multivariate analysis techniques become flooded and mathematically insufficient when n < p. Principal Component Analysis (PCA) is a commonly used multivariate method for dimension reduction and data visualization but suffers from these issues. A collection of Sparse PCA methods have been proposed to counter these flaws but have not been tested in comparative detail. <strong>Methods:</strong> Performances of three Sparse PCA methods were evaluated through simulations. Data was generated for 56 different data-structures, ranging p, the number of underlying groups and the variance structure within them. Estimation and interpretability of the principal components (PCs) were rigorously tested. Sparse PCA methods were also applied to a real gene expression dataset. <strong>Results:</strong> All Sparse PCA methods showed improvements upon classical PCA. Some methods were best at obtaining an accurate leading PC only, whereas others were better for subsequent PCs. There exist different optimal choices of Sparse PCA methods when ranging within-group correlation and across-group variances; thankfully, one method repeatedly worked well under the most difficult scenarios. When applying methods to real data, concise groups of gene expressions were detected with the most sparse methods. <strong>Conclusions:</strong> Sparse PCA methods provide a new insightful way to detect important features amidst complex high-dimension data.</p> / Master of Science (MSc) Principal Component Analysis (PCA) High Dimensional Data Simulations Loading Vectors Tuning Parameters Applied Statistics Biostatistics Multivariate Analysis Statistical Methodology Applied Statistics
78	The Growth Curve Model for High Dimensional Data and its Application in Genomics Jana, Sayantee 04 1900 (has links) <p>Recent advances in technology have allowed researchers to collect high-dimensional biological data simultaneously. In genomic studies, for instance, measurements from tens of thousands of genes are taken from individuals across several experimental groups. In time course microarray experiments, gene expression is measured at several time points for each individual across the whole genome resulting in massive amount of data. In such experiments, researchers are faced with two types of high-dimensionality. The first is global high-dimensionality, which is common to all genomic experiments. The global high-dimensionality arises because inference is being done on tens of thousands of genes resulting in multiplicity. This challenge is often dealt with statistical methods for multiple comparison, such as the Bonferroni correction or false discovery rate (FDR). We refer to the second type of high-dimensionality as gene specific high-dimensionality, which arises in time course microarry experiments due to the fact that, in such experiments, sample size is often smaller than the number of time points ($n</p> <p>In this thesis, we use the growth curve model (GCM), which is a generalized multivariate analysis of variance (GMANOVA) model, and propose a moderated test statistic for testing a special case of the general linear hypothesis, which is specially useful for identifying genes that are expressed. We use the trace test for the GCM and modify it so that it can be used in high-dimensional situations. We consider two types of moderation: the Moore-Penrose generalized inverse and Stein's shrinkage estimator of $ S $. We performed extensive simulations to show performance of the moderated test, and compared the results with original trace test. We calculated empirical level and power of the test under many scenarios. Although the focus is on hypothesis testing, we also provided moderated maximum likelihood estimator for the parameter matrix and assessed its performance by investigating bias and mean squared error of the estimator and compared the results with those of the maximum likelihood estimators. Since the parameters are matrices, we consider distance measures in both power and level comparisons as well as when investigating bias and mean squared error. We also illustrated our approach using time course microarray data taken from a study on Lung Cancer. We were able to filter out 1053 genes as non-noise genes from a pool of 22,277 genes which is approximately 5\% of the total number of genes. This is in sync with results from most biological experiments where around 5\% genes are found to be differentially expressed.</p> / Master of Science (MSc) growth curve model high-dimensional data Euclidean distance multivariate bias and mean square error moderated trace test Moore-Penrose generalized inverse Biostatistics Multivariate Analysis Biostatistics
79	Canonical Correlation and Clustering for High Dimensional Data Ouyang, Qing January 2019 (has links) Multi-view datasets arise naturally in statistical genetics when the genetic and trait profile of an individual is portrayed by two feature vectors. A motivating problem concerning the Skin Intrinsic Fluorescence (SIF) study on the Diabetes Control and Complications Trial (DCCT) subjects is presented. A widely applied quantitative method to explore the correlation structure between two domains of a multi-view dataset is the Canonical Correlation Analysis (CCA), which seeks the canonical loading vectors such that the transformed canonical covariates are maximally correlated. In the high dimensional case, regularization of the dataset is required before CCA can be applied. Furthermore, the nature of genetic research suggests that sparse output is more desirable. In this thesis, two regularized CCA (rCCA) methods and a sparse CCA (sCCA) method are presented. When correlation sub-structure exists, stand-alone CCA method will not perform well. To tackle this limitation, a mixture of local CCA models can be employed. In this thesis, I review a correlation clustering algorithm proposed by Fern, Brodley and Friedl (2005), which seeks to group subjects into clusters such that features are identically correlated within each cluster. An evaluation study is performed to assess the effectiveness of CCA and correlation clustering algorithms using artificial multi-view datasets. Both sCCA and sCCA-based correlation clustering exhibited superior performance compare to the rCCA and rCCA-based correlation clustering. The sCCA and the sCCA-clustering are applied to the multi-view dataset consisted of PrediXcan imputed gene expression and SIF measurements of DCCT subjects. The stand-alone sparse CCA method identified 193 among 11538 genes being correlated with SIF#7. Further investigation of these 193 genes with simple linear regression and t-test revealed that only two genes, ENSG00000100281.9 and ENSG00000112787.8, were significance in association with SIF#7. No plausible clustering scheme was detected by the sCCA based correlation clustering method. / Thesis / Master of Science (MSc) Machine Learning Correlation Clustering Sparse Canonical Correlation Analysis Skin Intrinsic Fluorescence Multi-view dataset Lasso Dimensionality reduction PrediXcan High dimensional data
80	[pt] ENSAIOS SOBRE VOLATILIDADE E PREVISIBILIDADE DE RETORNOS / [en] ESSAYS ON VOLATILITY AND RETURNS PREDICTABILITY IURI HONDA FERREIRA 18 August 2022 (has links) [pt] Essa tese é composta por três artigos em econometria financeira. Os dois primeiros artigos exploram a relação entre retornos intradiários do mercado de equities e a implied volatility, representada pelo Índice de Volatilidade da CBOE (VIX). Nos dois artigos, estimamos previsões um minuto à frente utilizando janelas rolantes para cada dia. No primeiro artigo, as estimativas indicam que nossos modelos de fatores de volatilidade têm uma performance superior a benchmarks tradicionais em uma análise de séries de tempo em alta frequência, mesmo aos excluirmos períodos de crise da amostra. Os resultados também indicam uma performance fora da amostra maior para dias em que não ocorrem anúncios macroeconômicos. A performance é ainda maior quando removemos períodos de crise. O segundo artigo propõe uma abordagem de aprendizado de máquinas para modelar esse exercício de previsão. Implementamos um método de estimação intradiário minuto a minuto com janelas móveis, utilizando dois tipos de modelos não lineares: redes neurais com Long-Short-Term Memory (LSTM) e Random Forests (RF). Nossas estimativas mostram que o VIX é o melhor previsor de retornos de mercado intradiários entre os candidatos na nossa análise, especialmente quando implementadas através do modelo LSTM. Esse modelo também melhora significativamente a performance quando utilizamos o retorno de mercado defasado como variável preditiva. Finalmente, o último artigo explora uma extensão multivariada do método FarmPredict, combinando modelos vetoriais autoregressivos aumentados em fatores (FAVAR) e modelos esparsos em um ambiente de alta dimensão. Utilizando um procedimento de três estágios, somos capazes de estimar e prever fatores e seus loadings, que podem ser observados, não observados ou ambos, assim como uma estrutura idiossincrática fracamente esparsa. Realizamos uma aplicação dessa metodologia em um painel de volatilidades realizadas e os resultados de performance do método em etapas indicam melhorias quando comparado a benchmarks consolidados. / [en] This thesis is composed of three papers on financial econometrics. The first two papers explore the relation between intraday equity market returns and implied volatility, represented by the CBOE Volatility Index (VIX). In both papers, we estimate one-minute-ahead forecasts using rolling windows within a day. In the first paper, the estimates indicate that our volatility factor models outperform traditional benchmarks at high frequency time-series analysis, even when excluding crisis periods. We also find that the model has a better out-of-sample performance at days without macroeconomic announcements. Interestingly, these results are amplified when we remove the crisis period. The second paper proposes a machine learning modeling approach to this forecasting exercise. We implement a minute-by-minute rolling window intraday estimation method using two nonlinear models: Long-Short-Term Memory (LSTM) neural networks and Random Forests (RF). Our estimations show that the VIX is the strongest candidate predictor for intraday market returns in our analysis, especially when implemented through the LSTM model. This model also improves significantly the performance of the lagged market return as predictive variable. Finally, the third paper explores a multivariate extension of the FarmPredict method, by combining factor-augmented vector autoregressive (FAVAR) and sparse models in a high-dimensional environment. Using a three-stage procedure, we estimate and forecast factors and its loadings, which can be observed, unobserved, or both, as well as a weakly sparse idiosyncratic structure. We provide an application of this methodology to a panel of daily realized volatilities. Finally, the accuracy of the stepwise method indicates improvements of this forecasting method when compared to consolidated benchmarks. [pt] PREVISIBILIDADE DE RETORNOS [pt] APRENDIZADO DE MAQUINA [pt] MODELOS NAO-LINEARES [pt] DADOS EM ALTA DIMENSAO [en] RETURN PREDICTABILITY [en] MACHINE LEARNING [en] NONLINEAR MODELS [en] HIGH DIMENSIONAL DATA

Search results