Global ETD Search

31	Bayesian networks for high-dimensional data with complex mean structure. Kasza, Jessica Eleonore January 2010 (has links) In a microarray experiment, it is expected that there will be correlations between the expression levels of different genes under study. These correlation structures are of great interest from both biological and statistical points of view. From a biological perspective, the identification of correlation structures can lead to an understanding of genetic pathways involving several genes, while the statistical interest, and the emphasis of this thesis, lies in the development of statistical methods to identify such structures. However, the data arising from microarray studies is typically very high-dimensional, with an order of magnitude more genes being considered than there are samples of each gene. This leads to difficulties in the estimation of the dependence structure of all genes under study. Graphical models and Bayesian networks are often used in these situations, providing flexible frameworks in which dependence structures for high-dimensional data sets can be considered. The current methods for the estimation of dependence structures for high-dimensional data sets typically assume the presence of independent and identically distributed samples of gene expression values. However, often the data available will have a complex mean structure and additional components of variance. Given such data, the application of methods that assume independent and identically distributed samples may result in incorrect biological conclusions being drawn. In this thesis, methods for the estimation of Bayesian networks for gene expression data sets that contain additional complexities are developed and implemented. The focus is on the development of score metrics that take account of these complexities for use in conjunction with score-based methods for the estimation of Bayesian networks, in particular the High-dimensional Bayesian Covariance Selection algorithm. The necessary theory relating to Gaussian graphical models and Bayesian networks is reviewed, as are the methods currently available for the estimation of dependence structures for high-dimensional data sets consisting of independent and identically distributed samples. Score metrics for the estimation of Bayesian networks when data sets are not independent and identically distributed are then developed and explored, and the utility and necessity of these metrics is demonstrated. Finally, the developed metrics are applied to a data set consisting of samples of grape genes taken from several different vineyards. / Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 2010
32	New Directions in Sparse Models for Image Analysis and Restoration January 2013 (has links) abstract: Effective modeling of high dimensional data is crucial in information processing and machine learning. Classical subspace methods have been very effective in such applications. However, over the past few decades, there has been considerable research towards the development of new modeling paradigms that go beyond subspace methods. This dissertation focuses on the study of sparse models and their interplay with modern machine learning techniques such as manifold, ensemble and graph-based methods, along with their applications in image analysis and recovery. By considering graph relations between data samples while learning sparse models, graph-embedded codes can be obtained for use in unsupervised, supervised and semi-supervised problems. Using experiments on standard datasets, it is demonstrated that the codes obtained from the proposed methods outperform several baseline algorithms. In order to facilitate sparse learning with large scale data, the paradigm of ensemble sparse coding is proposed, and different strategies for constructing weak base models are developed. Experiments with image recovery and clustering demonstrate that these ensemble models perform better when compared to conventional sparse coding frameworks. When examples from the data manifold are available, manifold constraints can be incorporated with sparse models and two approaches are proposed to combine sparse coding with manifold projection. The improved performance of the proposed techniques in comparison to sparse coding approaches is demonstrated using several image recovery experiments. In addition to these approaches, it might be required in some applications to combine multiple sparse models with different regularizations. In particular, combining an unconstrained sparse model with non-negative sparse coding is important in image analysis, and it poses several algorithmic and theoretical challenges. A convex and an efficient greedy algorithm for recovering combined representations are proposed. Theoretical guarantees on sparsity thresholds for exact recovery using these algorithms are derived and recovery performance is also demonstrated using simulations on synthetic data. Finally, the problem of non-linear compressive sensing, where the measurement process is carried out in feature space obtained using non-linear transformations, is considered. An optimized non-linear measurement system is proposed, and improvements in recovery performance are demonstrated in comparison to using random measurements as well as optimized linear measurements. / Dissertation/Thesis / Ph.D. Electrical Engineering 2013 Electrical engineering Computer science Dictionary learning High dimensional data Image processing Machine learning Sparse representations
33	Apprentissage de données génomiques multiples pour le diagnostic et le pronostic du cancer / Learning from multiple genomic information in cancer for diagnosis and prognosis Moarii, Matahi 26 June 2015 (has links) De nombreuses initiatives ont été mises en places pour caractériser d'un point de vue moléculaire de grandes cohortes de cancers à partir de diverses sources biologiques dans l'espoir de comprendre les altérations majeures impliquées durant la tumorogénèse. Les données mesurées incluent l'expression des gènes, les mutations et variations de copy-number, ainsi que des signaux épigénétiques tel que la méthylation de l'ADN. De grands consortium tels que “The Cancer Genome Atlas” (TCGA) ont déjà permis de rassembler plusieurs milliers d'échantillons cancéreux mis à la disposition du public. Nous contribuons dans cette thèse à analyser d'un point de vue mathématique les relations existant entre les différentes sources biologiques, valider et/ou généraliser des phénomènes biologiques à grande échelle par une analyse intégrative de données épigénétiques et génétiques.En effet, nous avons montré dans un premier temps que la méthylation de l'ADN était un marqueur substitutif intéressant pour jauger du caractère clonal entre deux cellules et permettait ainsi de mettre en place un outil clinique des récurrences de cancer du sein plus précis et plus stable que les outils actuels, afin de permettre une meilleure prise en charge des patients.D'autre part, nous avons dans un second temps permis de quantifier d'un point de vue statistique l'impact de la méthylation sur la transcription. Nous montrons l'importance d'incorporer des hypothèses biologiques afin de pallier au faible nombre d'échantillons par rapport aux nombre de variables.Enfin, nous montrons l'existence d'un phénomène biologique lié à l'apparition d'un phénotype d'hyperméthylation dans plusieurs cancers. Pour cela, nous adaptons des méthodes de régression en utilisant la similarité entre les différentes tâches de prédictions afin d'obtenir des signatures génétiques communes prédictives du phénotypes plus précises.En conclusion, nous montrons l'importance d'une collaboration biologique et statistique afin d'établir des méthodes adaptées aux problématiques actuelles en bioinformatique. / Several initiatives have been launched recently to investigate the molecular characterisation of large cohorts of human cancers with various high-throughput technologies in order to understanding the major biological alterations related to tumorogenesis. The information measured include gene expression, mutations, copy-number variations, as well as epigenetic signals such as DNA methylation. Large consortiums such as “The Cancer Genome Atlas” (TCGA) have already gathered publicly thousands of cancerous and non-cancerous samples. We contribute in this thesis in the statistical analysis of the relationship between the different biological sources, the validation and/or large scale generalisation of biological phenomenon using an integrative analysis of genetic and epigenetic data.Firstly, we show the role of DNA methylation as a surrogate biomarker of clonality between cells which would allow for a powerful clinical tool for to elaborate appropriate treatments for specific patients with breast cancer relapses.In addition, we developed systematic statistical analyses to assess the significance of DNA methylation variations on gene expression regulation. We highlight the importance of adding prior knowledge to tackle the small number of samples in comparison with the number of variables. In return, we show the potential of bioinformatics to infer new interesting biological hypotheses.Finally, we tackle the existence of the universal biological phenomenon related to the hypermethylator phenotype. Here, we adapt regression techniques using the similarity between the different prediction tasks to obtain robust genetic predictive signatures common to all cancers and that allow for a better prediction accuracy.In conclusion, we highlight the importance of a biological and computational collaboration in order to establish appropriate methods to the current issues in bioinformatics that will in turn provide new biological insights. Apprentissage supervisé Apprentissage non-Supervisé Données à grande dimension Supervised Analysis Unsupervised Analysis High-Dimensional Data 610.28
34	Partition clustering of High Dimensional Low Sample Size data based on P-Values Von Borries, George Freitas January 1900 (has links) Doctor of Philosophy / Department of Statistics / Haiyan Wang / This thesis introduces a new partitioning algorithm to cluster variables in high dimensional low sample size (HDLSS) data and high dimensional longitudinal low sample size (HDLLSS) data. HDLSS data contain a large number of variables with small number of replications per variable, and HDLLSS data refer to HDLSS data observed over time. Clustering technique plays an important role in analyzing high dimensional low sample size data as is seen commonly in microarray experiment, mass spectrometry data, pattern recognition. Most current clustering algorithms for HDLSS and HDLLSS data are adaptations from traditional multivariate analysis, where the number of variables is not high and sample sizes are relatively large. Current algorithms show poor performance when applied to high dimensional data, especially in small sample size cases. In addition, available algorithms often exhibit poor clustering accuracy and stability for non-normal data. Simulations show that traditional clustering algorithms used in high dimensional data are not robust to monotone transformations. The proposed clustering algorithm PPCLUST is a powerful tool for clustering HDLSS data, which uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity between groups of variables. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. PPCLUSTEL is an extension of PPCLUST for clustering of HDLLSS data. A nonparametric test of no simple effect of group is developed and the p-value from the test is used as a measure of similarity between groups of variables. PPCLUST and PPCLUSTEL are able to cluster a large number of variables in the presence of very few replications and in case of PPCLUSTEL, the algorithm require neither a large number nor equally spaced time points. PPCLUST and PPCLUSTEL do not suffer from loss of power due to distributional assumptions, general multiple comparison problems and difficulty in controlling heterocedastic variances. Applications with available data from previous microarray studies show promising results and simulations studies reveal that the algorithm outperforms a series of benchmark algorithms applied to HDLSS data exhibiting high clustering accuracy and stability. High Dimensional Data Clustering Nonparametric Inference Bioinformatics Statistical Learning Data Mining Statistics (0463)
35	Testing uniformity against rotationally symmetric alternatives on high-dimensional spheres Cutting, Christine 04 June 2020 (has links) (PDF) Dans cette thèse, nous nous intéressons au problème de tester en grande dimension l'uniformité sur la sphère-unité $S^{p_n-1}$ (la dimension des observations, $p_n$, dépend de leur nombre, $n$, et être en grande dimension signifie que $p_n$ tend vers l'infini en même temps que $n$). Nous nous restreignons dans un premier temps à des contre-hypothèses ``monotones'' de densité croissante le long d'une direction ${\pmb \theta}_n\in S^{p_n-1}$ et dépendant d'un paramètre de concentration $\kappa_n>0$. Nous commençons par identifier le taux $\kappa_n$ auquel ces contre-hypothèses sont contiguës à l'uniformité ;nous montrons ensuite grâce à des résultats de normalité locale asymptotique, que le test d'uniformité le plus classique, le test de Rayleigh, n'est pas optimal quand ${\pmb \theta}_n$ est connu mais qu'il le devient à $p$ fixé et dans le cas FvML en grande dimension quand ${\pmb \theta}_n$ est inconnu.Dans un second temps, nous considérons des contre-hypothèses ``axiales'', attribuant la même probabilité à des points diamétralement opposés. Elles dépendent aussi d'un paramètre de position ${\pmb \theta}_n\in S^{p_n-1}$ et d'un paramètre de concentration $\kappa_n\in\R$. Le taux de contiguïté s'avère ici plus élevé et suggère un problème plus difficile que dans le cas monotone. En effet, le test de Bingham, le test classique dans le cas axial, n'est pas optimal à ${\pmb \theta}_n$ inconnu et $p$ fixé, et ne détecte pas les contre-hypothèses contiguës en grande dimension. C'est pourquoi nous nous tournons vers des tests basés sur les plus grande et plus petite valeurs propres de la matrice de variance-covariance et nous déterminons leurs distributions asymptotiques sous les contre-hypothèses contiguës à $p$ fixé.Enfin, à l'aide d'un théorème central limite pour martingales, nous montrons que sous certaines conditions et après standardisation, les statistiques de Rayleigh et de Bingham sont asymptotiquement normales sous l'hypothèse d'invariance par rotation des observations. Ce résultat permet non seulement d'identifier le taux auquel le test de Bingham détecte des contre-hypothèses axiales mais aussi celui auquel il détecte des contre-hypothèses monotones. / In this thesis we are interested in testing uniformity in high dimensions on the unit sphere $S^{p_n-1}$ (the dimension of the observations, $p_n$, depends on their number, and high-dimensional data are such that $p_n$ diverges to infinity with $n$).We consider first ``monotone'' alternatives whose density increases along an axis ${\pmb \theta}_n\in S^{p_n-1}$ and depends on a concentration parameter $\kappa_n>0$. We start by identifying the rate at which these alternatives are contiguous to uniformity; then we show thanks to local asymptotic normality results that the most classical test of uniformity, the Rayleigh test, is not optimal when ${\pmb \theta}_n$ is specified but becomes optimal when $p$ is fixed and in the high-dimensional FvML case when ${\pmb \theta}_n$ is unspecified.We consider next ``axial'' alternatives, assigning the same probability to antipodal points. They also depend on a location parameter ${\pmb \theta}_n\in S^{p_n-1}$ and a concentration parameter $\kappa_n\in\R$. The contiguity rate proves to be higher in that case and implies that the problem is more difficult than in the monotone case. Indeed, the Bingham test, the classical test when dealing with axial data, is not optimal when $p$ is fixed and ${\pmb \theta}_n$ is not specified, and is blind to the contiguous alternatives in high dimensions. This is why we turn to tests based on the extreme eigenvalues of the covariance matrix and establish their fixed-$p$ asymptotic distributions under contiguous alternatives.Finally, thanks to a martingale central limit theorem, we show that, under some assumptions and after standardisation, the Rayleigh and Bingham test statistics are asymptotically normal under general rotationally symmetric distributions. It enables us to identify the rate at which the Bingham test detects axial alternatives and also monotone alternatives. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Statistique mathématique rotationally symmetric uniformity Rayleigh Bingham Billingsley local asymptotic normality high-dimensional data directional data
36	Efficient Uncertainty quantification with high dimensionality Jianhua Yin (12456819) 25 April 2022 (has links) <p>Uncertainty exists everywhere in scientific and engineering applications. To avoid potential risk, it is critical to understand the impact of uncertainty on a system by performing uncertainty quantification (UQ) and reliability analysis (RA). However, the computational cost may be unaffordable using current UQ methods with high-dimensional input. Moreover, current UQ methods are not applicable when numerical data and image data coexist. </p> <p>To decrease the computational cost to an affordable level and enable UQ with special high dimensional data (e.g. image), this dissertation develops three UQ methodologies with high dimensionality of input space. The first two methods focus on high-dimensional numerical input. The core strategy of Methodology 1 is fixing the unimportant variables at their first step most probable point (MPP) so that the dimensionality is reduced. An accurate RA method is used in the reduced space. The final reliability is obtained by accounting for the contributions of important and unimportant variables. Methodology 2 addresses the issue that the dimensionality cannot be reduced when most of the variables are important or when variables equally contribute to the system. Methodology 2 develops an efficient surrogate modeling method for high dimensional UQ using Generalized Sliced Inverse Regression (GSIR), Gaussian Process (GP)-based active learning, and importance sampling. A cost-efficient GP model is built in the latent space after dimension reduction by GSIR. And the failure boundary is identified through active learning that adds optimal training points iteratively. In Methodology 3, a Convolutional Neural Networks (CNN) based surrogate model (CNN-GP) is constructed for dealing with mixed numerical and image data. The numerical data are first converted into images and the converted images are then merged with existing image data. The merged images are fed to CNN for training. Then, we use the latent variables of the CNN model to integrate CNN with GP to quantify the model error using epistemic uncertainty. Both epistemic uncertainty and aleatory uncertainty are considered in uncertainty propagation. </p> <p>The simulation results indicate that the first two methodologies can not only improve the efficiency but also maintain adequate accuracy for the problems with high-dimensional numerical input. GSIR with active learning can handle the situations that the dimensionality cannot be reduced when most of the variables are important or the importance of variables are close. The two methodologies can be combined as a two-stage dimension reduction for high-dimensional numerical input. The third method, CNN-GP, is capable of dealing with special high-dimensional input, mixed numerical and image data, with the satisfying regression accuracy and providing an estimate of the model error. Uncertainty propagation considering both epistemic uncertainty and aleatory uncertainty provides better accuracy. The proposed methods could be potentially applied to engineering design and decision making. </p> Mechanical Engineering Uncertainty Quantification Reliability Analysis Dimension Reduction Computational Cost Reduction High Dimensional Data Uncertainty
37	Interpretable machine learning approaches to high-dimensional data and their applications to biomedical engineering problems / 高次元データへの解釈可能な機械学習アプローチとその医用工学問題への適用 Yoshida, Kosuke 26 March 2018 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21215号 / 情博第668号 / 新制\|\|情\|\|115(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授石井信, 教授下平英寿, 教授加納学, 銅谷賢治 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM machine learning high-dimensional data biomedical engineering depression fMRI kernel method 007
38	A Unified Exposure Prediction Approach for Multivariate Spatial Data: From Predictions to Health Analysis Zhu, Zheng 18 June 2019 (has links) No description available. Statistics Spatial model air pollution high dimensional data multivariate spatial model machine learning health effect model
39	Sequential Change-point Detection in Linear Regression and Linear Quantile Regression Models Under High Dimensionality Ratnasingam, Suthakaran 06 August 2020 (has links) No description available. Statistics High-dimensional data Change point analysis Sequential analysis Penalized regression Quantile regression Confidence distribution
40	Nonlocal Priors in Generalized Linear Models and Gaussian Graphical Models Yang, Fang 23 August 2022 (has links) No description available. Statistics non-local prior Bayesian DAG models high-dimensional data posterior consistency group selection

Search results