Spelling suggestions: "subject:"highdimensional data analysis"" "subject:"higherdimensional data analysis""
1 |
High-dimensional statistical data integrationJanuary 2019 (has links)
archives@tulane.edu / Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A representative model for the integrative analysis of multiple data types is to decompose each data matrix into a low-rank common-source matrix generated by latent factors shared across all data types, a low-rank distinctive-source matrix corresponding to each data type, and an additive noise matrix. We propose a novel decomposition method, called the decomposition-based generalized canonical correlation analysis, which appropriately defines those matrices by imposing a desirable orthogonality constraint on distinctive latent factors that aims to sufficiently capture the common latent factors. To further delineate the common and distinctive patterns between two data types, we propose another new decomposition method, called the common and distinctive pattern analysis. This method takes into account the common and distinctive information between the coefficient matrices of the common latent factors. We develop consistent estimation approaches for both proposed decompositions under high-dimensional settings, and demonstrate their finite-sample performance via extensive simulations. We illustrate the superiority of proposed methods over the state of the arts by real-world data examples obtained from The Cancer Genome Atlas and Human Connectome Project. / 1 / Zhe Qu
|
2 |
Bayesian and Information-Theoretic Learning of High Dimensional DataChen, Minhua January 2012 (has links)
<p>The concept of sparseness is harnessed to learn a low dimensional representation of high dimensional data. This sparseness assumption is exploited in multiple ways. In the Bayesian Elastic Net, a small number of correlated features are identified for the response variable. In the sparse Factor Analysis for biomarker trajectories, the high dimensional gene expression data is reduced to a small number of latent factors, each with a prototypical dynamic trajectory. In the Bayesian Graphical LASSO, the inverse covariance matrix of the data distribution is assumed to be sparse, inducing a sparsely connected Gaussian graph. In the nonparametric Mixture of Factor Analyzers, the covariance matrices in the Gaussian Mixture Model are forced to be low-rank, which is closely related to the concept of block sparsity. </p><p>Finally in the information-theoretic projection design, a linear projection matrix is explicitly sought for information-preserving dimensionality reduction. All the methods mentioned above prove to be effective in learning both simulated and real high dimensional datasets.</p> / Dissertation
|
3 |
A NEW INDEPENDENCE MEASURE AND ITS APPLICATIONS IN HIGH DIMENSIONAL DATA ANALYSISKe, Chenlu 01 January 2019 (has links)
This dissertation has three consecutive topics. First, we propose a novel class of independence measures for testing independence between two random vectors based on the discrepancy between the conditional and the marginal characteristic functions. If one of the variables is categorical, our asymmetric index extends the typical ANOVA to a kernel ANOVA that can test a more general hypothesis of equal distributions among groups. The index is also applicable when both variables are continuous. Second, we develop a sufficient variable selection procedure based on the new measure in a large p small n setting. Our approach incorporates marginal information between each predictor and the response as well as joint information among predictors. As a result, our method is more capable of selecting all truly active variables than marginal selection methods. Furthermore, our procedure can handle both continuous and discrete responses with mixed-type predictors. We establish the sure screening property of the proposed approach under mild conditions. Third, we focus on a model-free sufficient dimension reduction approach using the new measure. Our method does not require strong assumptions on predictors and responses. An algorithm is developed to find dimension reduction directions using sequential quadratic programming. We illustrate the advantages of our new measure and its two applications in high dimensional data analysis by numerical studies across a variety of settings.
|
4 |
Analyse de données de cytometrie de flux pour un grand nombre d'échantillons / Automated flow cytometric analysis across a large number of samplesChen, Xiaoyi 06 October 2015 (has links)
Cette thèse a conduit à la mise au point de deux nouvelles approches statistiques pour l'identification automatique de populations cellulaires en cytometrie de flux multiparamétrique, et ceci pour le traitement d'un grand nombre d'échantillons, chaque échantillon étant prélevé sur un donneur particulier. Ces deux approches répondent à des besoins exprimés dans le cadre du projet Labex «Milieu Intérieur». Dix panels cytométriques de 8 marqueurs ont été sélectionnés pour la quantification des populations principales et secondaires présentes dans le sang périphérique. Sur la base de ces panels, les données ont été acquises et analysées sur une cohorte de 1000 donneurs sains.Tout d'abord, nous avons recherché une quantification robuste des principales composantes cellulaires du système immunitaire. Nous décrivons une procédure computationnelle, appelée FlowGM, qui minimise l'intervention de l'utilisateur. Le cœur statistique est fondé sur le modèle classique de mélange de lois gaussiennes. Ce modèle est tout d'abord utilisé pour obtenir une classification initiale, le nombre de classes étant déterminé par le critère d'information BIC. Après cela, une méta-classification, qui consiste en l'étiquetage des classes et la fusion de celles qui ont la même étiquette au regard de la référence, a permis l'identification automatique de 24 populations cellulaires sur quatre panels. Ces identifications ont ensuite été intégrées dans les fichiers de cytométrie de flux standard (FCS), permettant ainsi la comparaison avec l'analyse manuelle opérée par les experts. Nous montrons que la qualité est similaire entre FlowGM et l'analyse manuelle classique pour les lymphocytes, mais notamment que FlowGM montre une meilleure discrimination des sous-populations de monocytes et de cellules dendritiques (DC), qui sont difficiles à obtenir manuellement. FlowGM fournit ainsi une analyse rapide de phénotypes cellulaires et se prête à des études de cohortes.A des fins d'évaluation, de diagnostic et de recherche, une analyse tenant compte de l'influence de facteurs, comme par exemple les effets du protocole, l'effet de l'âge et du sexe, a été menée. Dans le contexte du projet MI, les 1000 donneurs sains ont été stratifiés selon le sexe et l'âge. Les résultats de l'analyse quantitative faite avec FlowGM ont été jugés concordants avec l'analyse manuelle qui est considérée comme l'état de l'art. On note surtout une augmentation de la précision pour les populations CD16+ et CDC1, où les sous-populations CD14loCD16hi et HLADRhi CDC1 ont été systématiquement identifiées. Nous démontrons que les effectifs de ces deux populations présentent une corrélation significative avec l'âge. En ce qui concerne les populations qui sont connues pour être associées à l'âge, un modèle de régression linéaire multiple a été considéré qui fournit un coefficient de régression renforcé. Ces résultats établissent une base efficace pour l'évaluation de notre procédure FlowGM.Lors de l'utilisation de FlowGM pour la caractérisation détaillée de certaines sous-populations présentant de fortes variations au travers des différents échantillons, par exemple les cellules T, nous avons constaté que FlowGM était en difficulté. En effet, dans ce cas, l'algorithme EM classique initialisé avec la classification de l'échantillon de référence est insuffisant pour garantir l'alignement et donc l'identification des différentes classes entre tous échantillons. Nous avons donc amélioré FlowGM en une nouvelle procédure FlowGMP. Pour ce faire, nous avens ajouté au modèle de mélange, une distribution a priori sur les paramètres de composantes, conduisant à un algorithme EM contraint. Enfin, l'évaluation de FlowGMP sur un panel difficile de cellules T a été réalisée, en effectuant une comparaison avec l'analyse manuelle. Cette comparaison montre que notre procédure Bayésienne fournit une identification fiable et efficace des onze sous-populations de cellules T à travers un grand nombre d'échantillons. / In the course of my Ph.D. work, I have developed and applied two new computational approaches for automatic identification of cell populations in multi-parameter flow cytometry across a large number of samples. Both approaches were motivated and taken by the LabEX "Milieu Intérieur" study (hereafter MI study). In this project, ten 8-color flow cytometry panels were standardized for assessment of the major and minor cell populations present in peripheral whole blood, and data were collected and analyzed from 1,000 cohorts of healthy donors.First, we aim at robust characterization of major cellular components of the immune system. We report a computational pipeline, called FlowGM, which minimizes operator input, is insensitive to compensation settings, and can be adapted to different analytic panels. A Gaussian Mixture Model (GMM) - based approach was utilized for initial clustering, with the number of clusters determined using Bayesian Information Criterion. Meta-clustering in a reference donor, by which we mean labeling clusters and merging those with the same label in a pre-selected representative donor, permitted automated identification of 24 cell populations across four panels. Cluster labels were then integrated into Flow Cytometry Standard (FCS) files, thus permitting comparisons to human expert manual analysis. We show that cell numbers and coefficient of variation (CV) are similar between FlowGM and conventional manual analysis of lymphocyte populations, but notably FlowGM provided improved discrimination of "hard-to-gate" monocyte and dendritic cell (DC) subsets. FlowGM thus provides rapid, high-dimensional analysis of cell phenotypes and is amenable to cohort studies.After having cell counts across a large number of cohort donors, some further analysis (for example, the agreement with other methods, the age and gender effect, etc.) are required naturally for the purpose of comprehensive evaluation, diagnosis and discovery. In the context of the MI project, the 1,000 healthy donors were stratified across gender (50% women and 50% men) and age (20-69 years of age). Analysis was streamlined using our established approach FlowGM, the results were highly concordant with the state-of-art gold standard manual gating. More important, further precision of the CD16+ monocytes and cDC1 population was achieved using FlowGM, CD14loCD16hi monocytes and HLADRhi cDC1 cells were consistently identified. We demonstrate that the counts of these two populations show a significant correlation with age. As for the cell populations that are well-known to be related to age, a multiple linear regression model was considered, and it is shown that our results provided higher regression coefficient. These findings establish a strong foundation for comprehensive evaluation of our previous work.When extending this FlowGM method for detailed characterization of certain subpopulations where more variations are revealed across a large number of samples, for example the T cells, we find that the conventional EM algorithm initiated with reference clustering is insufficient to guarantee the alignment of clusters between all samples due to the presence of technical and biological variations. We then improved FlowGM and presented FlowGMP pipeline to address this specific panel. We introduce a Bayesian mixture model by assuming a prior distribution of component parameters and derive a penalized EM algorithm. Finally the performance of FlowGMP on this difficult T cell panel with a comparison between automated and manual analysis shows that our method provides a reliable and efficient identification of eleven T cell subpopulations across a large number of samples.
|
5 |
Adaptive risk managementChen, Ying 13 February 2007 (has links)
In den vergangenen Jahren ist die Untersuchung des Risikomanagements vom Baselkomitee angeregt, um die Kredit- und Bankwesen regelmäßig zu aufsichten. Für viele multivariate Risikomanagementmethoden gibt es jedoch Beschränkungen von: 1) verlässt sich die Kovarianzschätzung auf eine zeitunabhängige Form, 2) die Modelle beruhen auf eine unrealistischen Verteilungsannahme und 3) numerische Problem, die bei hochdimensionalen Daten auftreten. Es ist das primäre Ziel dieser Doktorarbeit, präzise und schnelle Methoden vorzuschlagen, die diesen Beschränkungen überwinden. Die Grundidee besteht darin, zuerst aus einer hochdimensionalen Zeitreihe die stochastisch unabhängigen Komponenten (IC) zu extrahieren und dann die Verteilungsparameter der resultierenden IC beruhend auf eindimensionale Heavy-Tailed Verteilungsannahme zu identifizieren. Genauer gesagt werden zwei lokale parametrische Methoden verwendet, um den Varianzprozess jeder IC zu schätzen, das lokale Moving Window Average (MVA) Methode und das lokale Exponential Smoothing (ES) Methode. Diese Schätzungen beruhen auf der realistischen Annahme, dass die IC Generalized Hyperbolic (GH) verteilt sind. Die Berechnung ist schneller und erreicht eine höhere Genauigkeit als viele bekannte Risikomanagementmethoden. / Over recent years, study on risk management has been prompted by the Basel committee for the requirement of regular banking supervisory. There are however limitations of many risk management methods: 1) covariance estimation relies on a time-invariant form, 2) models are based on unrealistic distributional assumption and 3) numerical problems appear when applied to high-dimensional portfolios. The primary aim of this dissertation is to propose adaptive methods that overcome these limitations and can accurately and fast measure risk exposures of multivariate portfolios. The basic idea is to first retrieve out of high-dimensional time series stochastically independent components (ICs) and then identify the distributional behavior of every resulting IC in univariate space. To be more specific, two local parametric approaches, local moving window average (MWA) method and local exponential smoothing (ES) method, are used to estimate the volatility process of every IC under the heavy-tailed distributional assumption, namely ICs are generalized hyperbolic (GH) distributed. By doing so, it speeds up the computation of risk measures and achieves much better accuracy than many popular risk management methods.
|
Page generated in 0.1172 seconds