• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 145
  • 24
  • 17
  • 7
  • 2
  • 1
  • 1
  • Tagged with
  • 253
  • 253
  • 131
  • 75
  • 50
  • 47
  • 40
  • 38
  • 35
  • 35
  • 30
  • 28
  • 28
  • 26
  • 26
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

High-dimensional statistics : model specification and elementary estimators

Yang, Eunho 16 January 2015 (has links)
Modern statistics typically deals with complex data, in particular where the ambient dimension of the problem p may be of the same order as, or even substantially larger than, the sample size n. It has now become well understood that even in this type of high-dimensional scaling, statistically consistent estimators can be achieved provided one imposes structural constraints on the statistical models. In spite of great success over the last few decades, we are still experiencing bottlenecks of two distinct kinds: (I) in multivariate modeling, data modeling assumption is typically limited to instances such as Gaussian or Ising models, and hence handling varied types of random variables is still restricted, and (II) in terms of computation, learning or estimation process is not efficient especially when p is extremely large, since in the current paradigm for high-dimensional statistics, regularization terms induce non-differentiable optimization problems, which do not have closed-form solutions in general. The thesis addresses these two distinct but highly complementary problems: (I) statistical model specification beyond the standard Gaussian or Ising models for data of varied types, and (II) computationally efficient elementary estimators for high-dimensional statistical models. / text
62

Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine Methods

Minnier, Jessica 06 August 2012 (has links)
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes
63

Analyzing the Combination of Polymorphisms Associating with Antidepressant Response by Exact Conditional Test

Ma, Baofu 08 August 2005 (has links)
Genetic factors have been shown to be involved in etiology of a poor response to the antidepressant treatment with sufficient dosage and duration. Our goal was to identify the role of polymorphisms in the poor response to the treatment. To this end, 5 functional polymorphisms in 109 patients diagnosed with unipolar, major depressive disorder are analyzed. Due to the small sample size, exact conditional tests are utilized to analyze the contingency table. The data analysis involves: (1) Exact test for conditional independence in a high dimensional contingency table; (2) Marginal independence test; (3) Exact test for three-way interactions. The efficiency of program always limits the application of exact test. The appropriate methods for enumerating exact tables are the key to improve the efficiency of programs. The algorithm of enumerating the exact tables is also introduced.
64

Scalable Nonparametric Bayes Learning

Banerjee, Anjishnu January 2013 (has links)
<p>Capturing high dimensional complex ensembles of data is becoming commonplace in a variety of application areas. Some examples include</p><p>biological studies exploring relationships between genetic mutations and diseases, atmospheric and spatial data, and internet usage and online behavioral data. These large complex data present many challenges in their modeling and statistical analysis. Motivated by high dimensional data applications, in this thesis, we focus on building scalable Bayesian nonparametric regression algorithms and on developing models for joint distributions of complex object ensembles.</p><p>We begin with a scalable method for Gaussian process regression, a commonly used tool for nonparametric regression, prediction and spatial modeling. A very common bottleneck for large data sets is the need for repeated inversions of a big covariance matrix, which is required for likelihood evaluation and inference. Such inversion can be practically infeasible and even if implemented, highly numerically unstable. We propose an algorithm utilizing random projection ideas to construct flexible, computationally efficient and easy to implement approaches for generic scenarios. We then further improve the algorithm incorporating some structure and blocking ideas in our random projections and demonstrate their applicability in other contexts requiring inversion of large covariance matrices. We show theoretical guarantees for performance as well as substantial improvements over existing methods with simulated and real data. A by product of the work is that we discover hitherto unknown equivalences between approaches in machine learning, random linear algebra and Bayesian statistics. We finally connect random projection methods for large dimensional predictors and large sample size under a unifying theoretical framework.</p><p>The other focus of this thesis is joint modeling of complex ensembles of data from different domains. This goes beyond traditional relational modeling of ensembles of one type of data and relies on probability mixing measures over tensors. These models have added flexibility over some existing product mixture model approaches in letting each component of the ensemble have its own dependent cluster structure. We further investigate the question of measuring dependence between variables of different types and propose a very general novel scaled measure based on divergences between the joint and marginal distributions of the objects. Once again, we show excellent performance in both simulated and real data scenarios.</p> / Dissertation
65

Learning the Structure of High-Dimensional Manifolds with Self-Organizing Maps for Accurate Information Extraction

Zhang, Lili January 2011 (has links)
This work aims to improve the capability of accurate information extraction from high-dimensional data, with a specific neural learning paradigm, the Self-Organizing Map (SOM). The SOM is an unsupervised learning algorithm that can faithfully sense the manifold structure and support supervised learning of relevant information from the data. Yet open problems regarding SOM learning exist. We focus on the following two issues. 1. Evaluation of topology preservation. Topology preservation is essential for SOMs in faithful representation of manifold structure. However, in reality, topology violations are not unusual, especially when the data have complicated structure. Measures capable of accurately quantifying and informatively expressing topology violations are lacking. One contribution of this work is a new measure, the Weighted Differential Topographic Function (WDTF), which differentiates an existing measure, the Topographic Function (TF), and incorporates detailed data distribution as an importance weighting of violations to distinguish severe violations from insignificant ones. Another contribution is an interactive visual tool, TopoView, which facilitates the visual inspection of violations on the SOM lattice. We show the effectiveness of the combined use of the WDTF and TopoView through a simple two-dimensional data set and two hyperspectral images. 2. Learning multiple latent variables from high-dimensional data. We use an existing two-layer SOM-hybrid supervised architecture, which captures the manifold structure in its SOM hidden layer, and then, uses its output layer to perform the supervised learning of latent variables. In the customary way, the output layer only uses the strongest output of the SOM neurons. This severely limits the learning capability. We allow multiple, k, strongest responses of the SOM neurons for the supervised learning. Moreover, the fact that different latent variables can be best learned with different values of k motivates a new neural architecture, the Conjoined Twins, which extends the existing architecture with additional copies of the output layer, for preferential use of different values of k in the learning of different latent variables. We also automate the customization of k for different variables with the statistics derived from the SOM. The Conjoined Twins shows its effectiveness in the inference of two physical parameters from Near-Infrared spectra of planetary ices.
66

Spatiotemporal Gene Networks from ISH Images

Puniyani, Kriti 01 September 2013 (has links)
As large-scale techniques for studying and measuring gene expressions have been developed, automatically inferring gene interaction networks from expression data has emerged as a popular technique to advance our understanding of cellular systems. Accurate prediction of gene interactions, especially in multicellular organisms such as Drosophila or humans, requires temporal and spatial analysis of gene expressions, which is not easily obtainable from microarray data. New image based techniques using in-sit hybridization(ISH) have recently been developed to allowlarge-scale spatial-temporal profiling of whole body mRNA expression. However, analysis of such data for discovering new gene interactions still remains an open challenge. This thesis studies the question of predicting gene interaction networks from ISH data in three parts. First, we present SPEX2, a computer vision pipeline to extract informative features from ISH data. Next, we present an algorithm, GINI, for learning spatial gene interaction networks from embryonic ISH images at a single time step. GINI combines multi-instance kernels with recent work in learning sparse undirected graphical models to predict interactions between genes. Finally, we propose NP-MuScL (nonparanormal multi source learning) to estimate a gene interaction network that is consistent with multiple sources of data, having the same underlying relationships between the nodes. NP-MuScL casts the network estimation problem as estimating the structure of a sparse undirected graphical model. We use the semiparametric Gaussian copula to model the distribution of the different data sources, with the different copulas sharing the same covariance matrix, and show how to estimate such a model in the high dimensional scenario. We apply our algorithms on more than 100,000 Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project. Each of the 6 time steps in Drosophila embryonic development is treated as a separate data source. With spatial gene interactions predicted via GINI, and temporal predictions combined via NP-MuScL, we are finally able to predict spatiotemporal gene networks from these images.
67

Efficient Computational Methods for Structural Reliability and Global Sensitivity Analyses

Zhang, Xufang 25 April 2013 (has links)
Uncertainty analysis of a system response is an important part of engineering probabilistic analysis. Uncertainty analysis includes: (a) to evaluate moments of the response; (b) to evaluate reliability analysis of the system; (c) to assess the complete probability distribution of the response; (d) to conduct the parametric sensitivity analysis of the output. The actual model of system response is usually a high-dimensional function of input variables. Although Monte Carlo simulation is a quite general approach for this purpose, it may require an inordinate amount of resources to achieve an acceptable level of accuracy. Development of a computationally efficient method, hence, is of great importance. First of all, the study proposed a moment method for uncertainty quantification of structural systems. However, a key departure is the use of fractional moment of response function, as opposed to integer moment used so far in literature. The advantage of using fractional moment over integer moment was illustrated from the relation of one fractional moment with a couple of integer moments. With a small number of samples to compute the fractional moments, a system output distribution was estimated with the principle of maximum entropy (MaxEnt) in conjunction with the constraints specified in terms of fractional moments. Compared to the classical MaxEnt, a novel feature of the proposed method is that fractional exponent of the MaxEnt distribution is determined through the entropy maximization process, instead of assigned by an analyst in prior. To further minimize the computational cost of the simulation-based entropy method, a multiplicative dimensional reduction method (M-DRM) was proposed to compute the fractional (integer) moments of a generic function with multiple input variables. The M-DRM can accurately approximate a high-dimensional function as the product of a series low-dimensional functions. Together with the principle of maximum entropy, a novel computational approach was proposed to assess the complete probability distribution of a system output. Accuracy and efficiency of the proposed method for structural reliability analysis were verified by crude Monte Carlo simulation of several examples. Application of M-DRM was further extended to the variance-based global sensitivity analysis of a system. Compared to the local sensitivity analysis, the variance-based sensitivity index can provide significance information about an input random variable. Since each component variance is defined as a conditional expectation with respect to the system model function, the separable nature of the M-DRM approximation can simplify the high-dimension integrations in sensitivity analysis. Several examples were presented to illustrate the numerical accuracy and efficiency of the proposed method in comparison to the Monte Carlo simulation method. The last contribution of the proposed study is the development of a computationally efficient method for polynomial chaos expansion (PCE) of a system's response. This PCE model can be later used uncertainty analysis. However, evaluation of coefficients of a PCE meta-model is computational demanding task due to the involved high-dimensional integrations. With the proposed M-DRM, the involved computational cost can be remarkably reduced compared to the classical methods in literature (simulation method or tensor Gauss quadrature method). Accuracy and efficiency of the proposed method for polynomial chaos expansion were verified by considering several practical examples.
68

Algorithmically Guided Information Visualization : Explorative Approaches for High Dimensional, Mixed and Categorical Data / Algoritmiskt vägledd informationsvisualisering för högdimensionell och kategorisk data

Johansson Fernstad, Sara January 2011 (has links)
Facilitated by the technological advances of the last decades, increasing amounts of complex data are being collected within fields such as biology, chemistry and social sciences. The major challenge today is not to gather data, but to extract useful information and gain insights from it. Information visualization provides methods for visual analysis of complex data but, as the amounts of gathered data increase, the challenges of visual analysis become more complex. This thesis presents work utilizing algorithmically extracted patterns as guidance during interactive data exploration processes, employing information visualization techniques. It provides efficient analysis by taking advantage of fast pattern identification techniques as well as making use of the domain expertise of the analyst. In particular, the presented research is concerned with the issues of analysing categorical data, where the values are names without any inherent order or distance; mixed data, including a combination of categorical and numerical data; and high dimensional data, including hundreds or even thousands of variables. The contributions of the thesis include a quantification method, assigning numerical values to categorical data, which utilizes an automated method to define category similarities based on underlying data structures, and integrates relationships within numerical variables into the quantification when dealing with mixed data sets. The quantification is incorporated in an interactive analysis pipeline where it provides suggestions for numerical representations, which may interactively be adjusted by the analyst. The interactive quantification enables exploration using commonly available visualization methods for numerical data. Within the context of categorical data analysis, this thesis also contributes the first user study evaluating the performance of what are currently the two main visualization approaches for categorical data analysis. Furthermore, this thesis contributes two dimensionality reduction approaches, which aim at preserving structure while reducing dimensionality, and provide flexible and user-controlled dimensionality reduction. Through algorithmic quality metric analysis, where each metric represents a structure of interest, potentially interesting variables are extracted from the high dimensional data. The automatically identified structures are visually displayed, using various visualization methods, and act as guidance in the selection of interesting variable subsets for further analysis. The visual representations furthermore provide overview of structures within the high dimensional data set and may, through this, aid in focusing subsequent analysis, as well as enabling interactive exploration of the full high dimensional data set and selected variable subsets. The thesis also contributes the application of algorithmically guided approaches for high dimensional data exploration in the rapidly growing field of microbiology, through the design and development of a quality-guided interactive system in collaboration with microbiologists.
69

Bayesian networks for high-dimensional data with complex mean structure.

Kasza, Jessica Eleonore January 2010 (has links)
In a microarray experiment, it is expected that there will be correlations between the expression levels of different genes under study. These correlation structures are of great interest from both biological and statistical points of view. From a biological perspective, the identification of correlation structures can lead to an understanding of genetic pathways involving several genes, while the statistical interest, and the emphasis of this thesis, lies in the development of statistical methods to identify such structures. However, the data arising from microarray studies is typically very high-dimensional, with an order of magnitude more genes being considered than there are samples of each gene. This leads to difficulties in the estimation of the dependence structure of all genes under study. Graphical models and Bayesian networks are often used in these situations, providing flexible frameworks in which dependence structures for high-dimensional data sets can be considered. The current methods for the estimation of dependence structures for high-dimensional data sets typically assume the presence of independent and identically distributed samples of gene expression values. However, often the data available will have a complex mean structure and additional components of variance. Given such data, the application of methods that assume independent and identically distributed samples may result in incorrect biological conclusions being drawn. In this thesis, methods for the estimation of Bayesian networks for gene expression data sets that contain additional complexities are developed and implemented. The focus is on the development of score metrics that take account of these complexities for use in conjunction with score-based methods for the estimation of Bayesian networks, in particular the High-dimensional Bayesian Covariance Selection algorithm. The necessary theory relating to Gaussian graphical models and Bayesian networks is reviewed, as are the methods currently available for the estimation of dependence structures for high-dimensional data sets consisting of independent and identically distributed samples. Score metrics for the estimation of Bayesian networks when data sets are not independent and identically distributed are then developed and explored, and the utility and necessity of these metrics is demonstrated. Finally, the developed metrics are applied to a data set consisting of samples of grape genes taken from several different vineyards. / Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 2010
70

Locality Sensitive Indexing for Efficient High-Dimensional Query Answering in the Presence of Excluded Regions

January 2016 (has links)
abstract: Similarity search in high-dimensional spaces is popular for applications like image processing, time series, and genome data. In higher dimensions, the phenomenon of curse of dimensionality kills the effectiveness of most of the index structures, giving way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity searches. In addition to range searches and k-nearest neighbor searches, there is a need to answer negative queries formed by excluded regions, in high-dimensional data. Though there have been a slew of variants of LSH to improve efficiency, reduce storage, and provide better accuracies, none of the techniques are capable of answering queries in the presence of excluded regions. This thesis provides a novel approach to handle such negative queries. This is achieved by creating a prefix based hierarchical index structure. First, the higher dimensional space is projected to a lower dimension space. Then, a one-dimensional ordering is developed, while retaining the hierarchical traits. The algorithm intelligently prunes the irrelevant candidates while answering queries in the presence of excluded regions. While naive LSH would need to filter out the negative query results from the main results, the new algorithm minimizes the need to fetch the redundant results in the first place. Experiment results show that this reduces post-processing cost thereby reducing the query processing time. / Dissertation/Thesis / Masters Thesis Computer Science 2016

Page generated in 0.1208 seconds