• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Probabilistic Modeling of Multi-relational and Multivariate Discrete Data

Wu, Hao 07 February 2017 (has links)
Modeling and discovering knowledge from multi-relational and multivariate discrete data is a crucial task that arises in many research and application domains, e.g. text mining, intelligence analysis, epidemiology, social science, etc. In this dissertation, we study and address three problems involving the modeling of multi-relational discrete data and multivariate multi-response count data, viz. (1) discovering surprising patterns from multi-relational data, (2) constructing a generative model for multivariate categorical data, and (3) simultaneously modeling multivariate multi-response count data and estimating covariance structures between multiple responses. To discover surprising multi-relational patterns, we first study the ``where do I start?'' problem originating from intelligence analysis. By studying nine methods with origins in association analysis, graph metrics, and probabilistic modeling, we identify several classes of algorithmic strategies that can supply starting points to analysts, and thus help to discover interesting multi-relational patterns from datasets. To actually mine for interesting multi-relational patterns, we represent the multi-relational patterns as dense and well-connected chains of biclusters over multiple relations, and model the discrete data by the maximum entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery. To build a generative model for multivariate categorical data, we apply the maximum entropy principle to propose a categorical maximum entropy model such that in a statistically well-founded way we can optimally use given prior information about the data, and are unbiased otherwise. Generally, inferring the maximum entropy model could be infeasible in practice. Here, we leverage the structure of the categorical data space to design an efficient model inference algorithm to estimate the categorical maximum entropy model, and we demonstrate how the proposed model is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and US census datasets, and demonstrate its feasibility using an epidemic simulation application. Modeling data with multivariate count responses is a challenging problem due to the discrete nature of the responses. Existing methods for univariate count responses cannot be easily extended to the multivariate case since the dependency among multiple responses needs to be properly accounted for. To model multivariate data with multiple count responses, we propose a novel multivariate Poisson log-normal model (MVPLN). By simultaneously estimating the regression coefficients and inverse covariance matrix over the latent variables with an efficient Monte Carlo EM algorithm, the proposed model takes advantages of association among multiple count responses to improve the model prediction accuracy. Simulation studies and applications to real world data are conducted to systematically evaluate the performance of the proposed method in comparison with conventional methods. / Ph. D.
2

Some problems in the theory & application of graphical models

Roddam, Andrew Wilfred January 1999 (has links)
A graphical model is simply a representation of the results of an analysis of relationships between sets of variables. It can include the study of the dependence of one variable, or a set of variables on another variable or sets of variables, and can be extended to include variables which could be considered as intermediate to the others. This leads to the concept of representing these chains of relationships by means of a graph; where variables are represented by vertices, and relationships between the variables are represented by edges. These edges can be either directed or undirected, depending upon the type of relationship being represented. The thesis investigates a number of outstanding problems in the area of statistical modelling, with particular emphasis on representing the results in terms of a graph. The thesis will study models for multivariate discrete data and in the case of binary responses, some theoretical results are given on the relationship between two common models. In the more general setting of multivariate discrete responses, a general class of models is studied and an approximation to the maximum likelihood estimates in these models is proposed. This thesis also addresses the problem of measurement errors. An investigation into the effect that measurement error has on sample size calculations is given with respect to a general measurement error specification in both linear and binary regression models. Finally, the thesis presents, in terms of a graphical model, a re-analysis of a set of childhood growth data, collected in South Wales during the 1970s. Within this analysis, a new technique is proposed that allows the calculation of derived variables under the assumption that the joint relationships between the variables are constant at each of the time points.

Page generated in 0.2149 seconds