Global ETD Search

21	Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods Minnier, Jessica 06 August 2012 (has links) Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes biostatistics high dimensional data kernel machine learning prediction statistical genetics statistics
22	Learning the Structure of High-Dimensional Manifolds with Self-Organizing Maps for Accurate Information Extraction Zhang, Lili January 2011 (has links) This work aims to improve the capability of accurate information extraction from high-dimensional data, with a specific neural learning paradigm, the Self-Organizing Map (SOM). The SOM is an unsupervised learning algorithm that can faithfully sense the manifold structure and support supervised learning of relevant information from the data. Yet open problems regarding SOM learning exist. We focus on the following two issues. 1. Evaluation of topology preservation. Topology preservation is essential for SOMs in faithful representation of manifold structure. However, in reality, topology violations are not unusual, especially when the data have complicated structure. Measures capable of accurately quantifying and informatively expressing topology violations are lacking. One contribution of this work is a new measure, the Weighted Differential Topographic Function (WDTF), which differentiates an existing measure, the Topographic Function (TF), and incorporates detailed data distribution as an importance weighting of violations to distinguish severe violations from insignificant ones. Another contribution is an interactive visual tool, TopoView, which facilitates the visual inspection of violations on the SOM lattice. We show the effectiveness of the combined use of the WDTF and TopoView through a simple two-dimensional data set and two hyperspectral images. 2. Learning multiple latent variables from high-dimensional data. We use an existing two-layer SOM-hybrid supervised architecture, which captures the manifold structure in its SOM hidden layer, and then, uses its output layer to perform the supervised learning of latent variables. In the customary way, the output layer only uses the strongest output of the SOM neurons. This severely limits the learning capability. We allow multiple, k, strongest responses of the SOM neurons for the supervised learning. Moreover, the fact that different latent variables can be best learned with different values of k motivates a new neural architecture, the Conjoined Twins, which extends the existing architecture with additional copies of the output layer, for preferential use of different values of k in the learning of different latent variables. We also automate the customization of k for different variables with the statistics derived from the SOM. The Conjoined Twins shows its effectiveness in the inference of two physical parameters from Near-Infrared spectra of planetary ices. Information Extraction Self-Organizing Maps (SOMs) High-Dimensional Data Neural Networks Manifold learning
23	Algorithmically Guided Information Visualization : Explorative Approaches for High Dimensional, Mixed and Categorical Data / Algoritmiskt vägledd informationsvisualisering för högdimensionell och kategorisk data Johansson Fernstad, Sara January 2011 (has links) Facilitated by the technological advances of the last decades, increasing amounts of complex data are being collected within fields such as biology, chemistry and social sciences. The major challenge today is not to gather data, but to extract useful information and gain insights from it. Information visualization provides methods for visual analysis of complex data but, as the amounts of gathered data increase, the challenges of visual analysis become more complex. This thesis presents work utilizing algorithmically extracted patterns as guidance during interactive data exploration processes, employing information visualization techniques. It provides efficient analysis by taking advantage of fast pattern identification techniques as well as making use of the domain expertise of the analyst. In particular, the presented research is concerned with the issues of analysing categorical data, where the values are names without any inherent order or distance; mixed data, including a combination of categorical and numerical data; and high dimensional data, including hundreds or even thousands of variables. The contributions of the thesis include a quantification method, assigning numerical values to categorical data, which utilizes an automated method to define category similarities based on underlying data structures, and integrates relationships within numerical variables into the quantification when dealing with mixed data sets. The quantification is incorporated in an interactive analysis pipeline where it provides suggestions for numerical representations, which may interactively be adjusted by the analyst. The interactive quantification enables exploration using commonly available visualization methods for numerical data. Within the context of categorical data analysis, this thesis also contributes the first user study evaluating the performance of what are currently the two main visualization approaches for categorical data analysis. Furthermore, this thesis contributes two dimensionality reduction approaches, which aim at preserving structure while reducing dimensionality, and provide flexible and user-controlled dimensionality reduction. Through algorithmic quality metric analysis, where each metric represents a structure of interest, potentially interesting variables are extracted from the high dimensional data. The automatically identified structures are visually displayed, using various visualization methods, and act as guidance in the selection of interesting variable subsets for further analysis. The visual representations furthermore provide overview of structures within the high dimensional data set and may, through this, aid in focusing subsequent analysis, as well as enabling interactive exploration of the full high dimensional data set and selected variable subsets. The thesis also contributes the application of algorithmically guided approaches for high dimensional data exploration in the rapidly growing field of microbiology, through the design and development of a quality-guided interactive system in collaboration with microbiologists. Information visualization data mining high dimensional data categorical data mixed data
24	Bayesian networks for high-dimensional data with complex mean structure. Kasza, Jessica Eleonore January 2010 (has links) In a microarray experiment, it is expected that there will be correlations between the expression levels of different genes under study. These correlation structures are of great interest from both biological and statistical points of view. From a biological perspective, the identification of correlation structures can lead to an understanding of genetic pathways involving several genes, while the statistical interest, and the emphasis of this thesis, lies in the development of statistical methods to identify such structures. However, the data arising from microarray studies is typically very high-dimensional, with an order of magnitude more genes being considered than there are samples of each gene. This leads to difficulties in the estimation of the dependence structure of all genes under study. Graphical models and Bayesian networks are often used in these situations, providing flexible frameworks in which dependence structures for high-dimensional data sets can be considered. The current methods for the estimation of dependence structures for high-dimensional data sets typically assume the presence of independent and identically distributed samples of gene expression values. However, often the data available will have a complex mean structure and additional components of variance. Given such data, the application of methods that assume independent and identically distributed samples may result in incorrect biological conclusions being drawn. In this thesis, methods for the estimation of Bayesian networks for gene expression data sets that contain additional complexities are developed and implemented. The focus is on the development of score metrics that take account of these complexities for use in conjunction with score-based methods for the estimation of Bayesian networks, in particular the High-dimensional Bayesian Covariance Selection algorithm. The necessary theory relating to Gaussian graphical models and Bayesian networks is reviewed, as are the methods currently available for the estimation of dependence structures for high-dimensional data sets consisting of independent and identically distributed samples. Score metrics for the estimation of Bayesian networks when data sets are not independent and identically distributed are then developed and explored, and the utility and necessity of these metrics is demonstrated. Finally, the developed metrics are applied to a data set consisting of samples of grape genes taken from several different vineyards. / Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 2010
25	New Directions in Sparse Models for Image Analysis and Restoration January 2013 (has links) abstract: Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses on the study of sparse models and their interplay with modern machine learning techniques such as manifold, ensemble and graph-based methods, along with their applications in image analysis and recovery. By considering graph relations between data samples while learning sparse models, graph-embedded codes can be obtained for use in unsupervised, supervised and semi-supervised problems. Using experiments on standard datasets, it is demonstrated that the codes obtained from the proposed methods outperform several baseline algorithms. In order to facilitate sparse learning with large scale data, the paradigm of ensemble sparse coding is proposed, and different strategies for constructing weak base models are developed. Experiments with image recovery and clustering demonstrate that these ensemble models perform better when compared to conventional sparse coding frameworks. When examples from the data manifold are available, manifold constraints can be incorporated with sparse models and two approaches are proposed to combine sparse coding with manifold projection. The improved performance of the proposed techniques in comparison to sparse coding approaches is demonstrated using several image recovery experiments. In addition to these approaches, it might be required in some applications to combine multiple sparse models with different regularizations. In particular, combining an unconstrained sparse model with non-negative sparse coding is important in image analysis, and it poses several algorithmic and theoretical challenges. A convex and an efficient greedy algorithm for recovering combined representations are proposed. Theoretical guarantees on sparsity thresholds for exact recovery using these algorithms are derived and recovery performance is also demonstrated using simulations on synthetic data. Finally, the problem of non-linear compressive sensing, where the measurement process is carried out in feature space obtained using non-linear transformations, is considered. An optimized non-linear measurement system is proposed, and improvements in recovery performance are demonstrated in comparison to using random measurements as well as optimized linear measurements. / Dissertation/Thesis / Ph.D. Electrical Engineering 2013 Electrical engineering Computer science Dictionary learning High dimensional data Image processing Machine learning Sparse representations
26	Apprentissage de données génomiques multiples pour le diagnostic et le pronostic du cancer / Learning from multiple genomic information in cancer for diagnosis and prognosis Moarii, Matahi 26 June 2015 (has links) De nombreuses initiatives ont été mises en places pour caractériser d'un point de vue moléculaire de grandes cohortes de cancers à partir de diverses sources biologiques dans l'espoir de comprendre les altérations majeures impliquées durant la tumorogénèse. Les données mesurées incluent l'expression des gènes, les mutations et variations de copy-number, ainsi que des signaux épigénétiques tel que la méthylation de l'ADN. De grands consortium tels que “The Cancer Genome Atlas” (TCGA) ont déjà permis de rassembler plusieurs milliers d'échantillons cancéreux mis à la disposition du public. Nous contribuons dans cette thèse à analyser d'un point de vue mathématique les relations existant entre les différentes sources biologiques, valider et/ou généraliser des phénomènes biologiques à grande échelle par une analyse intégrative de données épigénétiques et génétiques.En effet, nous avons montré dans un premier temps que la méthylation de l'ADN était un marqueur substitutif intéressant pour jauger du caractère clonal entre deux cellules et permettait ainsi de mettre en place un outil clinique des récurrences de cancer du sein plus précis et plus stable que les outils actuels, afin de permettre une meilleure prise en charge des patients.D'autre part, nous avons dans un second temps permis de quantifier d'un point de vue statistique l'impact de la méthylation sur la transcription. Nous montrons l'importance d'incorporer des hypothèses biologiques afin de pallier au faible nombre d'échantillons par rapport aux nombre de variables.Enfin, nous montrons l'existence d'un phénomène biologique lié à l'apparition d'un phénotype d'hyperméthylation dans plusieurs cancers. Pour cela, nous adaptons des méthodes de régression en utilisant la similarité entre les différentes tâches de prédictions afin d'obtenir des signatures génétiques communes prédictives du phénotypes plus précises.En conclusion, nous montrons l'importance d'une collaboration biologique et statistique afin d'établir des méthodes adaptées aux problématiques actuelles en bioinformatique. / Several initiatives have been launched recently to investigate the molecular characterisation of large cohorts of human cancers with various high-throughput technologies in order to understanding the major biological alterations related to tumorogenesis. The information measured include gene expression, mutations, copy-number variations, as well as epigenetic signals such as DNA methylation. Large consortiums such as “The Cancer Genome Atlas” (TCGA) have already gathered publicly thousands of cancerous and non-cancerous samples. We contribute in this thesis in the statistical analysis of the relationship between the different biological sources, the validation and/or large scale generalisation of biological phenomenon using an integrative analysis of genetic and epigenetic data.Firstly, we show the role of DNA methylation as a surrogate biomarker of clonality between cells which would allow for a powerful clinical tool for to elaborate appropriate treatments for specific patients with breast cancer relapses.In addition, we developed systematic statistical analyses to assess the significance of DNA methylation variations on gene expression regulation. We highlight the importance of adding prior knowledge to tackle the small number of samples in comparison with the number of variables. In return, we show the potential of bioinformatics to infer new interesting biological hypotheses.Finally, we tackle the existence of the universal biological phenomenon related to the hypermethylator phenotype. Here, we adapt regression techniques using the similarity between the different prediction tasks to obtain robust genetic predictive signatures common to all cancers and that allow for a better prediction accuracy.In conclusion, we highlight the importance of a biological and computational collaboration in order to establish appropriate methods to the current issues in bioinformatics that will in turn provide new biological insights. Apprentissage supervisé Apprentissage non-Supervisé Données à grande dimension Supervised Analysis Unsupervised Analysis High-Dimensional Data 610.28
27	Testing uniformity against rotationally symmetric alternatives on high-dimensional spheres Cutting, Christine 04 June 2020 (has links) (PDF) Dans cette thèse, nous nous intéressons au problème de tester en grande dimension l'uniformité sur la sphère-unité $S^{p_n-1}$ (la dimension des observations, $p_n$, dépend de leur nombre, $n$, et être en grande dimension signifie que $p_n$ tend vers l'infini en même temps que $n$). Nous nous restreignons dans un premier temps à des contre-hypothèses ``monotones'' de densité croissante le long d'une direction ${\pmb \theta}_n\in S^{p_n-1}$ et dépendant d'un paramètre de concentration $\kappa_n>0$. Nous commençons par identifier le taux $\kappa_n$ auquel ces contre-hypothèses sont contiguës à l'uniformité ;nous montrons ensuite grâce à des résultats de normalité locale asymptotique, que le test d'uniformité le plus classique, le test de Rayleigh, n'est pas optimal quand ${\pmb \theta}_n$ est connu mais qu'il le devient à $p$ fixé et dans le cas FvML en grande dimension quand ${\pmb \theta}_n$ est inconnu.Dans un second temps, nous considérons des contre-hypothèses ``axiales'', attribuant la même probabilité à des points diamétralement opposés. Elles dépendent aussi d'un paramètre de position ${\pmb \theta}_n\in S^{p_n-1}$ et d'un paramètre de concentration $\kappa_n\in\R$. Le taux de contiguïté s'avère ici plus élevé et suggère un problème plus difficile que dans le cas monotone. En effet, le test de Bingham, le test classique dans le cas axial, n'est pas optimal à ${\pmb \theta}_n$ inconnu et $p$ fixé, et ne détecte pas les contre-hypothèses contiguës en grande dimension. C'est pourquoi nous nous tournons vers des tests basés sur les plus grande et plus petite valeurs propres de la matrice de variance-covariance et nous déterminons leurs distributions asymptotiques sous les contre-hypothèses contiguës à $p$ fixé.Enfin, à l'aide d'un théorème central limite pour martingales, nous montrons que sous certaines conditions et après standardisation, les statistiques de Rayleigh et de Bingham sont asymptotiquement normales sous l'hypothèse d'invariance par rotation des observations. Ce résultat permet non seulement d'identifier le taux auquel le test de Bingham détecte des contre-hypothèses axiales mais aussi celui auquel il détecte des contre-hypothèses monotones. / In this thesis we are interested in testing uniformity in high dimensions on the unit sphere $S^{p_n-1}$ (the dimension of the observations, $p_n$, depends on their number, and high-dimensional data are such that $p_n$ diverges to infinity with $n$).We consider first ``monotone'' alternatives whose density increases along an axis ${\pmb \theta}_n\in S^{p_n-1}$ and depends on a concentration parameter $\kappa_n>0$. We start by identifying the rate at which these alternatives are contiguous to uniformity; then we show thanks to local asymptotic normality results that the most classical test of uniformity, the Rayleigh test, is not optimal when ${\pmb \theta}_n$ is specified but becomes optimal when $p$ is fixed and in the high-dimensional FvML case when ${\pmb \theta}_n$ is unspecified.We consider next ``axial'' alternatives, assigning the same probability to antipodal points. They also depend on a location parameter ${\pmb \theta}_n\in S^{p_n-1}$ and a concentration parameter $\kappa_n\in\R$. The contiguity rate proves to be higher in that case and implies that the problem is more difficult than in the monotone case. Indeed, the Bingham test, the classical test when dealing with axial data, is not optimal when $p$ is fixed and ${\pmb \theta}_n$ is not specified, and is blind to the contiguous alternatives in high dimensions. This is why we turn to tests based on the extreme eigenvalues of the covariance matrix and establish their fixed-$p$ asymptotic distributions under contiguous alternatives.Finally, thanks to a martingale central limit theorem, we show that, under some assumptions and after standardisation, the Rayleigh and Bingham test statistics are asymptotically normal under general rotationally symmetric distributions. It enables us to identify the rate at which the Bingham test detects axial alternatives and also monotone alternatives. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Statistique mathématique rotationally symmetric uniformity Rayleigh Bingham Billingsley local asymptotic normality high-dimensional data directional data
28	Efficient Uncertainty quantification with high dimensionality Jianhua Yin (12456819) 25 April 2022 (has links) <p>Uncertainty exists everywhere in scientific and engineering applications. To avoid potential risk, it is critical to understand the impact of uncertainty on a system by performing uncertainty quantification (UQ) and reliability analysis (RA). However, the computational cost may be unaffordable using current UQ methods with high-dimensional input. Moreover, current UQ methods are not applicable when numerical data and image data coexist. </p> <p>To decrease the computational cost to an affordable level and enable UQ with special high dimensional data (e.g. image), this dissertation develops three UQ methodologies with high dimensionality of input space. The first two methods focus on high-dimensional numerical input. The core strategy of Methodology 1 is fixing the unimportant variables at their first step most probable point (MPP) so that the dimensionality is reduced. An accurate RA method is used in the reduced space. The final reliability is obtained by accounting for the contributions of important and unimportant variables. Methodology 2 addresses the issue that the dimensionality cannot be reduced when most of the variables are important or when variables equally contribute to the system. Methodology 2 develops an efficient surrogate modeling method for high dimensional UQ using Generalized Sliced Inverse Regression (GSIR), Gaussian Process (GP)-based active learning, and importance sampling. A cost-efficient GP model is built in the latent space after dimension reduction by GSIR. And the failure boundary is identified through active learning that adds optimal training points iteratively. In Methodology 3, a Convolutional Neural Networks (CNN) based surrogate model (CNN-GP) is constructed for dealing with mixed numerical and image data. The numerical data are first converted into images and the converted images are then merged with existing image data. The merged images are fed to CNN for training. Then, we use the latent variables of the CNN model to integrate CNN with GP to quantify the model error using epistemic uncertainty. Both epistemic uncertainty and aleatory uncertainty are considered in uncertainty propagation. </p> <p>The simulation results indicate that the first two methodologies can not only improve the efficiency but also maintain adequate accuracy for the problems with high-dimensional numerical input. GSIR with active learning can handle the situations that the dimensionality cannot be reduced when most of the variables are important or the importance of variables are close. The two methodologies can be combined as a two-stage dimension reduction for high-dimensional numerical input. The third method, CNN-GP, is capable of dealing with special high-dimensional input, mixed numerical and image data, with the satisfying regression accuracy and providing an estimate of the model error. Uncertainty propagation considering both epistemic uncertainty and aleatory uncertainty provides better accuracy. The proposed methods could be potentially applied to engineering design and decision making. </p> Mechanical Engineering Uncertainty Quantification Reliability Analysis Dimension Reduction Computational Cost Reduction High Dimensional Data Uncertainty
29	Interpretable machine learning approaches to high-dimensional data and their applications to biomedical engineering problems / 高次元データへの解釈可能な機械学習アプローチとその医用工学問題への適用 Yoshida, Kosuke 26 March 2018 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21215号 / 情博第668号 / 新制\|\|情\|\|115(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授石井信, 教授下平英寿, 教授加納学, 銅谷賢治 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM machine learning high-dimensional data biomedical engineering depression fMRI kernel method 007
30	A Unified Exposure Prediction Approach for Multivariate Spatial Data: From Predictions to Health Analysis Zhu, Zheng 18 June 2019 (has links) No description available. Statistics Spatial model air pollution high dimensional data multivariate spatial model machine learning health effect model

Search results