Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, there is a large disconnect between the copula modeling and the machine learning communities. Copulas are stochastic models that capture the full dependence structure between random variables and allow flexible modeling of multivariate joint distributions. Elidan was the first to recognize this disconnect, and introduced copula based models to the ML community that demonstrated magnitudes of order better performance than the non copula-based models Elidan [2013]. However, the limitation of these is that they are only applicable for continuous random variables and real world data is often naturally modeled jointly as continuous and discrete. This report details our work in bridging this gap of modeling and analyzing data that is jointly continuous and discrete using copulas.
Our first research contribution details modeling of jointly continuous and discrete random variables using the copula framework with Bayesian networks, termed Hybrid Copula Bayesian Networks (HCBN) [Karra and Mili, 2016], a continuation of Elidan’s work on Copula Bayesian Networks Elidan [2010]. In this work, we extend the theorems proved by Neslehov ˇ a [2007] from bivariate ´ to multivariate copulas with discrete and continuous marginal distributions. Using the multivariate copula with discrete and continuous marginal distributions as a theoretical basis, we construct an HCBN that can model all possible permutations of discrete and continuous random variables for parent and child nodes, unlike the popular conditional linear Gaussian network model. Finally, we demonstrate on numerous synthetic datasets and a real life dataset that our HCBN compares favorably, from a modeling and flexibility viewpoint, to other hybrid models including the conditional linear Gaussian and the mixture of truncated exponentials models.
Our second research contribution then deals with the analysis side, and discusses how one may use copulas for exploratory data analysis. To this end, we introduce a nonparametric copulabased index for detecting the strength and monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals. Our index, termed Copula Index for Detecting Dependence and Monotonicity (CIM), satisfies several desirable properties of measures of association, including Renyi’s properties, the data processing inequality (DPI), and ´ consequently self-equitability. Synthetic data simulations reveal that the statistical power of CIM compares favorably to other state-of-the-art measures of association that are proven to satisfy the DPI. Simulation results with real-world data reveal CIM’s unique ability to detect the monotonicity structure among stochastic signals to find interesting dependencies in large datasets. Additionally, simulations show that CIM shows favorable performance to estimators of mutual information when discovering Markov network structure.
Our third research contribution deals with how to assess an estimator’s performance, in the scenario where multiple estimates of the strength of association between random variables need to be rank ordered. More specifically, we introduce a new property of estimators of the strength of statistical association, which helps characterize how well an estimator will perform in scenarios where dependencies between continuous and discrete random variables need to be rank ordered. The new property, termed the estimator response curve, is easily computable and provides a marginal distribution agnostic way to assess an estimator’s performance. It overcomes notable drawbacks of current metrics of assessment, including statistical power, bias, and consistency. We utilize the estimator response curve to test various measures of the strength of association that satisfy the data processing inequality (DPI), and show that the CIM estimator’s performance compares favorably to kNN, vME, AP, and HMI estimators of mutual information. The estimators which were identified to be suboptimal, according to the estimator response curve, perform worse than the more optimal estimators when tested with real-world data from four different areas of science, all with varying dimensionalities and sizes. / Ph. D. / Many machine learning (ML) techniques rely on probability, random variables, and stochastic modeling. Although statistics pervades this field, many of the traditional machine learning techniques rely on linear statistical techniques and models. For example, the correlation coefficient, a widely used construct in modern data analysis, is only a measure of linear dependence and cannot fully capture non-linear interactions. In this dissertation, we aim to address some of these gaps, and how they affect machine learning performance, using the mathematical construct of copulas.
Our first contribution deals with accurate probabilistic modeling of real-world data, where the underlying data is both continuous and discrete. We show that even though the copula construct has some limitations with respect to discrete data, it is still amenable to modeling large real-world datasets probabilistically. Our second contribution deals with analysis of non-linear datasets. Here, we develop a new measure of statistical association that can handle discrete, continuous, or combinations of such random variables that are related by any general association pattern. We show that our new metric satisfies several desirable properties and compare it’s performance to other measures of statistical association. Our final contribution attempts to provide a framework for understanding how an estimator of statistical association will affect end-to-end machine learning performance. Here, we develop the estimator response curve, and show a new way to characterize the performance of an estimator of statistical association, termed the estimator response curve. We then show that the estimator response curve can help predict how well an estimator performs in algorithms which require statistical associations to be rank ordered.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/85110 |
Date | 21 September 2018 |
Creators | Karra, Kiran |
Contributors | Electrical Engineering, Mili, Lamine M., Clancy, Thomas Charles III, Ramakrishnan, Naren, Yu, Guoqiang, Raman, Sanjay |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.003 seconds