71 |
Locality Sensitive Indexing for Efficient High-Dimensional Query Answering in the Presence of Excluded RegionsJanuary 2016 (has links)
abstract: Similarity search in high-dimensional spaces is popular for applications like image
processing, time series, and genome data. In higher dimensions, the phenomenon of
curse of dimensionality kills the effectiveness of most of the index structures, giving
way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity
searches. In addition to range searches and k-nearest neighbor searches, there
is a need to answer negative queries formed by excluded regions, in high-dimensional
data. Though there have been a slew of variants of LSH to improve efficiency, reduce
storage, and provide better accuracies, none of the techniques are capable of
answering queries in the presence of excluded regions.
This thesis provides a novel approach to handle such negative queries. This is
achieved by creating a prefix based hierarchical index structure. First, the higher
dimensional space is projected to a lower dimension space. Then, a one-dimensional
ordering is developed, while retaining the hierarchical traits. The algorithm intelligently
prunes the irrelevant candidates while answering queries in the presence of
excluded regions. While naive LSH would need to filter out the negative query results
from the main results, the new algorithm minimizes the need to fetch the redundant
results in the first place. Experiment results show that this reduces post-processing
cost thereby reducing the query processing time. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
72 |
New Directions in Sparse Models for Image Analysis and RestorationJanuary 2013 (has links)
abstract: Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses on the study of sparse models and their interplay with modern machine learning techniques such as manifold, ensemble and graph-based methods, along with their applications in image analysis and recovery. By considering graph relations between data samples while learning sparse models, graph-embedded codes can be obtained for use in unsupervised, supervised and semi-supervised problems. Using experiments on standard datasets, it is demonstrated that the codes obtained from the proposed methods outperform several baseline algorithms. In order to facilitate sparse learning with large scale data, the paradigm of ensemble sparse coding is proposed, and different strategies for constructing weak base models are developed. Experiments with image recovery and clustering demonstrate that these ensemble models perform better when compared to conventional sparse coding frameworks. When examples from the data manifold are available, manifold constraints can be incorporated with sparse models and two approaches are proposed to combine sparse coding with manifold projection. The improved performance of the proposed techniques in comparison to sparse coding approaches is demonstrated using several image recovery experiments. In addition to these approaches, it might be required in some applications to combine multiple sparse models with different regularizations. In particular, combining an unconstrained sparse model with non-negative sparse coding is important in image analysis, and it poses several algorithmic and theoretical challenges. A convex and an efficient greedy algorithm for recovering combined representations are proposed. Theoretical guarantees on sparsity thresholds for exact recovery using these algorithms are derived and recovery performance is also demonstrated using simulations on synthetic data. Finally, the problem of non-linear compressive sensing, where the measurement process is carried out in feature space obtained using non-linear transformations, is considered. An optimized non-linear measurement system is proposed, and improvements in recovery performance are demonstrated in comparison to using random measurements as well as optimized linear measurements. / Dissertation/Thesis / Ph.D. Electrical Engineering 2013
|
73 |
Apprentissage de données génomiques multiples pour le diagnostic et le pronostic du cancer / Learning from multiple genomic information in cancer for diagnosis and prognosisMoarii, Matahi 26 June 2015 (has links)
De nombreuses initiatives ont été mises en places pour caractériser d'un point de vue moléculaire de grandes cohortes de cancers à partir de diverses sources biologiques dans l'espoir de comprendre les altérations majeures impliquées durant la tumorogénèse. Les données mesurées incluent l'expression des gènes, les mutations et variations de copy-number, ainsi que des signaux épigénétiques tel que la méthylation de l'ADN. De grands consortium tels que “The Cancer Genome Atlas” (TCGA) ont déjà permis de rassembler plusieurs milliers d'échantillons cancéreux mis à la disposition du public. Nous contribuons dans cette thèse à analyser d'un point de vue mathématique les relations existant entre les différentes sources biologiques, valider et/ou généraliser des phénomènes biologiques à grande échelle par une analyse intégrative de données épigénétiques et génétiques.En effet, nous avons montré dans un premier temps que la méthylation de l'ADN était un marqueur substitutif intéressant pour jauger du caractère clonal entre deux cellules et permettait ainsi de mettre en place un outil clinique des récurrences de cancer du sein plus précis et plus stable que les outils actuels, afin de permettre une meilleure prise en charge des patients.D'autre part, nous avons dans un second temps permis de quantifier d'un point de vue statistique l'impact de la méthylation sur la transcription. Nous montrons l'importance d'incorporer des hypothèses biologiques afin de pallier au faible nombre d'échantillons par rapport aux nombre de variables.Enfin, nous montrons l'existence d'un phénomène biologique lié à l'apparition d'un phénotype d'hyperméthylation dans plusieurs cancers. Pour cela, nous adaptons des méthodes de régression en utilisant la similarité entre les différentes tâches de prédictions afin d'obtenir des signatures génétiques communes prédictives du phénotypes plus précises.En conclusion, nous montrons l'importance d'une collaboration biologique et statistique afin d'établir des méthodes adaptées aux problématiques actuelles en bioinformatique. / Several initiatives have been launched recently to investigate the molecular characterisation of large cohorts of human cancers with various high-throughput technologies in order to understanding the major biological alterations related to tumorogenesis. The information measured include gene expression, mutations, copy-number variations, as well as epigenetic signals such as DNA methylation. Large consortiums such as “The Cancer Genome Atlas” (TCGA) have already gathered publicly thousands of cancerous and non-cancerous samples. We contribute in this thesis in the statistical analysis of the relationship between the different biological sources, the validation and/or large scale generalisation of biological phenomenon using an integrative analysis of genetic and epigenetic data.Firstly, we show the role of DNA methylation as a surrogate biomarker of clonality between cells which would allow for a powerful clinical tool for to elaborate appropriate treatments for specific patients with breast cancer relapses.In addition, we developed systematic statistical analyses to assess the significance of DNA methylation variations on gene expression regulation. We highlight the importance of adding prior knowledge to tackle the small number of samples in comparison with the number of variables. In return, we show the potential of bioinformatics to infer new interesting biological hypotheses.Finally, we tackle the existence of the universal biological phenomenon related to the hypermethylator phenotype. Here, we adapt regression techniques using the similarity between the different prediction tasks to obtain robust genetic predictive signatures common to all cancers and that allow for a better prediction accuracy.In conclusion, we highlight the importance of a biological and computational collaboration in order to establish appropriate methods to the current issues in bioinformatics that will in turn provide new biological insights.
|
74 |
Partition clustering of High Dimensional Low Sample Size data based on P-ValuesVon Borries, George Freitas January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Haiyan Wang / This thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time.
Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations.
The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables.
PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability.
|
75 |
The Generalized Monotone Incremental Forward Stagewise Method for Modeling Longitudinal, Clustered, and Overdispersed Count Data: Application Predicting Nuclear Bud and Micronuclei FrequenciesLehman, Rebecca 01 January 2017 (has links)
With the influx of high-dimensional data there is an immediate need for statistical methods that are able to handle situations when the number of predictors greatly exceeds the number of samples. One such area of growth is in examining how environmental exposures to toxins impact the body long term. The cytokinesis-block micronucleus assay can measure the genotoxic effect of exposure as a count outcome. To investigate potential biomarkers, high-throughput assays that assess gene expression and methylation have been developed. It is of interest to identify biomarkers or molecular features that are associated with elevated micronuclei (MN) or nuclear bud (Nbud) frequency, measures of exposure to environmental toxins.
Given our desire to model a count outcome (MN and Nbud frequency) using high-throughput genomic features as predictors, novel methods that can handle over-parameterized models need development. Overdispersion, when the variance of a count outcome is larger than its mean, is frequently observed with count response data. For situations where overdispersion is present, the negative binomial distribution is more appropriate. Furthermore, we expand the method to the longitudinal Poisson and longitudinal negative binomial settings for modeling a longitudinal or clustered outcome both when there is equidispersion and overdispersion. The method we have chosen to expand is the Generalized Monotone Incremental Forward Stagewise (GMIFS) method. We extend the GMIFS to the negative binomial distribution so it may be used to analyze a count outcome when both a high-dimensional predictor space and overdispersion are present. Our methods were compared to glmpath. We also extend the GMIFS to the longitudinal Poisson and longitudinal negative binomial distribution for analyzing a longitudinal outcome. Our methods were compared to glmmLasso and GLMMLasso. The developed methods were used to analyze two datasets, one from the Norwegian Mother and Child Cohort study and one from the breast cancer epigenomic study conducted by researchers at Virginia Commonwealth University. In both studies a count outcome measured exposure to potential genotoxins and either gene expression or high-throughput methylation data formed a high dimensional predictor space. Further, the breast cancer study was longitudinal such that outcomes and high-dimensional genomic features were collected at multiple time points during the study for each patient. Our goal is to identify biomarkers that are associated with elevated MN or NBud frequency. From the development of these methods, we hope to make available more comprehensive statistical models for analyzing count outcomes with high dimensional predictor spaces and either cross-sectional or longitudinal study designs.
|
76 |
Scalable sparse machine learning methods for big dataZeng, Yaohui 15 December 2017 (has links)
Sparse machine learning models have become increasingly popular in analyzing high-dimensional data. With the evolving era of Big Data, ultrahigh-dimensional, large-scale data sets are constantly collected in many areas such as genetics, genomics, biomedical imaging, social media analysis, and high-frequency finance. Mining valuable information efficiently from these massive data sets requires not only novel statistical models but also advanced computational techniques. This thesis focuses on the development of scalable sparse machine learning methods to facilitate Big Data analytics.
Built upon the feature screening technique, the first part of this thesis proposes a family of hybrid safe-strong rules (HSSR) that incorporate safe screening rules into the sequential strong rule to remove unnecessary computational burden for solving the \textit{lasso-type} models. We present two instances of HSSR, namely SSR-Dome and SSR-BEDPP, for the standard lasso problem. We further extend SSR-BEDPP to the elastic net and group lasso problems to demonstrate the generalizability of the hybrid screening idea. In the second part, we design and implement an R package called \texttt{biglasso} to extend the lasso model fitting to Big Data in R. Our package \texttt{biglasso} utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle \textit{data-larger-than-RAM} cases seamlessly. Moreover, it's built upon our redesigned algorithm incorporated with the proposed HSSR screening, making it much more memory- and computation-efficient than existing R packages. Extensive numerical experiments with synthetic and real data sets are conducted in both parts to show the effectiveness of the proposed methods.
In the third part, we consider a novel statistical model, namely the overlapping group logistic regression model, that allows for selecting important groups of features that are associated with binary outcomes in the setting where the features belong to overlapping groups. We conduct systematic simulations and real-data studies to show its advantages in the application of genetic pathway selection. We implement an R package called \texttt{grpregOverlap} that has HSSR screening built in for fitting overlapping group lasso models.
|
77 |
Application de la théorie des matrices aléatoires pour les statistiques en grande dimension / Application of Random Matrix Theory to High Dimensional StatisticsBun, Joël 06 September 2016 (has links)
De nos jours, il est de plus en plus fréquent de travailler sur des bases de données de très grandes tailles dans plein de domaines différents. Cela ouvre la voie à de nouvelles possibilités d'exploitation ou d'exploration de l'information, et de nombreuses technologies numériques ont été créées récemment dans cette optique. D'un point de vue théorique, ce problème nous contraint à revoir notre manière d'analyser et de comprendre les données enregistrées. En effet, dans cet univers communément appelé « Big Data », un bon nombre de méthodes traditionnelles d'inférence statistique multivariée deviennent inadaptées. Le but de cette thèse est donc de mieux comprendre ce phénomène, appelé fléau (ou malédiction) de la dimension, et ensuite de proposer différents outils statistiques exploitant explicitement la dimension du problème et permettant d'extraire des informations fiables des données. Pour cela, nous nous intéresserons beaucoup aux vecteurs propres de matrices symétriques. Nous verrons qu’il est possible d’extraire de l'information présentant un certain degré d’universalité. En particulier, cela nous permettra de construire des estimateurs optimaux, observables, et cohérents avec le régime de grande dimension. / Nowadays, it is easy to get a lot ofquantitative or qualitative data in a lot ofdifferent fields. This access to new databrought new challenges about data processingand there are now many different numericaltools to exploit very large database. In atheoretical standpoint, this framework appealsfor new or refined results to deal with thisamount of data. Indeed, it appears that mostresults of classical multivariate statisticsbecome inaccurate in this era of “Big Data”.The aim of this thesis is twofold: the first one isto understand theoretically this so-called curseof dimensionality that describes phenomenawhich arise in high-dimensional space.Then, we shall see how we can use these toolsto extract signals that are consistent with thedimension of the problem. We shall study thestatistics of the eigenvalues and especially theeigenvectors of large symmetrical matrices. Wewill highlight that we can extract someuniversal properties of these eigenvectors andthat will help us to construct estimators that areoptimal, observable and consistent with thehigh dimensional framework.
|
78 |
ESSAYS IN HIGH-DIMENSIONAL ECONOMETRICSHaiqing Zhao (9174302) 27 July 2020 (has links)
My thesis consists of three chapters. The first chapter uses the Factor-augmented Error Correction Model in model averaging for predictive regressions, which provides significant improvements with large datasets in areas where the individual methods have not. I allow the candidate models to vary by the number of dependent variable lags, the number of factors, and the number of cointegration ranks. I show that the leave-h-out cross-validation criterion is an asymptotically unbiased estimator of the optimal mean squared forecast error, using either the estimated cointegration vectors or the nonstationary regressors. Empirical results demonstrate that including cointegration relationships significantly improves long-run forecasts of a standard set of macroeconomic variables. I also estimate simulation-based prediction intervals for six real and nominal macroeconomics variables. The results are consistent with the point estimates, which further support the usefulness of cointegration in long-run forecasts.<div><br></div><div>The second chapter is a Monte Carlo study comparing the finite sample performance of six recently proposed estimation methods designed for large-dimensional regressions with endogeneity. The methods are based on combining shrinkage estimation with two-stage least squares (2SLS) or generalized method of moments(GMM), where both the number of regressors and instruments can be large. The methods are evaluated in terms of bias and mean squared error of the estimators. I consider a variety of designs with practically relevant features such as weak instruments and heteroskedasticity as well as cases where the number of observations is smaller/larger than the number of regressors/instruments. The consistency results show that the methods using GMM with shrinkage provide smaller estimation errors than the methods using 2SLS with shrinkage. Moreover, the results support the use of cross-validation to select tuning parameters if theoretically derived parameters are unavailable. Lastly, the results indicate that all instruments should correlate with at least one endogenous regressor to ensure estimation consistency.<br></div><div><br></div><div>The third chapter is coauthored with Mohitosh Kejriwal. We present new evidence on the nexus between democracy and growth employing the dynamic common correlated effects (DCCE) approach advanced by Chudik and Pesaran (2015), which is robust to both parameter heterogeneity and cross-section dependence. The DCCE results indicate a positive and statistically significant effect of democracy on economic growth, with a point estimate between approximately 1.5-2% depending on the specification. We complement our estimates with a battery of diagnostic tests for heterogeneity and cross-section dependence that corroborate the use of the DCCE approach.<br></div>
|
79 |
Testing uniformity against rotationally symmetric alternatives on high-dimensional spheresCutting, Christine 04 June 2020 (has links) (PDF)
Dans cette thèse, nous nous intéressons au problème de tester en grande dimension l'uniformité sur la sphère-unité $S^{p_n-1}$ (la dimension des observations, $p_n$, dépend de leur nombre, $n$, et être en grande dimension signifie que $p_n$ tend vers l'infini en même temps que $n$). Nous nous restreignons dans un premier temps à des contre-hypothèses ``monotones'' de densité croissante le long d'une direction ${\pmb \theta}_n\in S^{p_n-1}$ et dépendant d'un paramètre de concentration $\kappa_n>0$. Nous commençons par identifier le taux $\kappa_n$ auquel ces contre-hypothèses sont contiguës à l'uniformité ;nous montrons ensuite grâce à des résultats de normalité locale asymptotique, que le test d'uniformité le plus classique, le test de Rayleigh, n'est pas optimal quand ${\pmb \theta}_n$ est connu mais qu'il le devient à $p$ fixé et dans le cas FvML en grande dimension quand ${\pmb \theta}_n$ est inconnu.Dans un second temps, nous considérons des contre-hypothèses ``axiales'', attribuant la même probabilité à des points diamétralement opposés. Elles dépendent aussi d'un paramètre de position ${\pmb \theta}_n\in S^{p_n-1}$ et d'un paramètre de concentration $\kappa_n\in\R$. Le taux de contiguïté s'avère ici plus élevé et suggère un problème plus difficile que dans le cas monotone. En effet, le test de Bingham, le test classique dans le cas axial, n'est pas optimal à ${\pmb \theta}_n$ inconnu et $p$ fixé, et ne détecte pas les contre-hypothèses contiguës en grande dimension. C'est pourquoi nous nous tournons vers des tests basés sur les plus grande et plus petite valeurs propres de la matrice de variance-covariance et nous déterminons leurs distributions asymptotiques sous les contre-hypothèses contiguës à $p$ fixé.Enfin, à l'aide d'un théorème central limite pour martingales, nous montrons que sous certaines conditions et après standardisation, les statistiques de Rayleigh et de Bingham sont asymptotiquement normales sous l'hypothèse d'invariance par rotation des observations. Ce résultat permet non seulement d'identifier le taux auquel le test de Bingham détecte des contre-hypothèses axiales mais aussi celui auquel il détecte des contre-hypothèses monotones. / In this thesis we are interested in testing uniformity in high dimensions on the unit sphere $S^{p_n-1}$ (the dimension of the observations, $p_n$, depends on their number, and high-dimensional data are such that $p_n$ diverges to infinity with $n$).We consider first ``monotone'' alternatives whose density increases along an axis ${\pmb \theta}_n\in S^{p_n-1}$ and depends on a concentration parameter $\kappa_n>0$. We start by identifying the rate at which these alternatives are contiguous to uniformity; then we show thanks to local asymptotic normality results that the most classical test of uniformity, the Rayleigh test, is not optimal when ${\pmb \theta}_n$ is specified but becomes optimal when $p$ is fixed and in the high-dimensional FvML case when ${\pmb \theta}_n$ is unspecified.We consider next ``axial'' alternatives, assigning the same probability to antipodal points. They also depend on a location parameter ${\pmb \theta}_n\in S^{p_n-1}$ and a concentration parameter $\kappa_n\in\R$. The contiguity rate proves to be higher in that case and implies that the problem is more difficult than in the monotone case. Indeed, the Bingham test, the classical test when dealing with axial data, is not optimal when $p$ is fixed and ${\pmb \theta}_n$ is not specified, and is blind to the contiguous alternatives in high dimensions. This is why we turn to tests based on the extreme eigenvalues of the covariance matrix and establish their fixed-$p$ asymptotic distributions under contiguous alternatives.Finally, thanks to a martingale central limit theorem, we show that, under some assumptions and after standardisation, the Rayleigh and Bingham test statistics are asymptotically normal under general rotationally symmetric distributions. It enables us to identify the rate at which the Bingham test detects axial alternatives and also monotone alternatives. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished
|
80 |
Advancing Bechhofer's Ranking Procedures to High-dimensional Variable SelectionGu, Chao 01 September 2021 (has links)
No description available.
|
Page generated in 0.1118 seconds