Global ETD Search

Return to search

Advances in Machine Learning for Compositional Data

Compositional data refers to simplex-valued data, or equivalently, nonnegative vectors whose totals are uninformative. This data modality is of relevance across several scientific domains. A classical example of compositional data is the chemical composition of geological samples, e.g., major-oxide concentrations. A more modern example arises from the microbial populations recorded using high-throughput genetic sequencing technologies, e.g., the gut microbiome. This dissertation presents a set of methodological and theoretical contributions that advance the state of the art in the analysis of compositional data.

Our work can be divided along two categories: problems in which compositional data represents the input to a predictive model, and problems in which it represents the output of the model. For the first class of problems, we build on the popular log-ratio framework to develop an efficient learning algorithm for high-dimensional compositional data. Our algorithm runs orders of magnitude faster than competing alternatives, without sacrificing model quality. For the second class of problems, we define a novel exponential family of probability distributions supported on the simplex. This distribution enjoys attractive mathematical properties and provides a performant probability model for simplex-valued outcomes. Taken together, our results constitute a broad contribution to the toolkit of researchers and practitioners studying compositional data.

https://doi.org/10.7916/vztk-yc59

Statistics

Machine learning--Statistical methods

Geochemistry

Bacteriology

Identifer	oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/vztk-yc59
Date	January 2022
Creators	Gordon Rodriguez, Elliott
Source Sets	Columbia University
Language	English
Detected Language	English
Type	Theses

Page generated in 0.0016 seconds

Advances in Machine Learning for Compositional Data

Description

Links & Downloads

Tags

Additional Fields