Global ETD Search

Return to search

Probabilistic Modeling for Whole Metagenome Profiling

To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling.

Markovian Jensen-Shannon Divergence

Segmentation

Clustering

Identifer	oai:union.ndltd.org:unt.edu/info:ark/67531/metadc1808381
Date	05 1900
Creators	Burks, David
Contributors	Azad, Rajeev, Allen, Michael, 1971-, Antunes, Mauricio S., Padilla, Pamela Anne, Shulaev, Vladimir
Publisher	University of North Texas
Source Sets	University of North Texas
Language	English
Detected Language	English
Type	Thesis or Dissertation
Format	xiv, 115 pages : illustrations (some color), Text
Rights	Public, Burks, David, Copyright, Copyright is held by the author, unless otherwise noted. All rights Reserved.

Page generated in 0.0025 seconds

Probabilistic Modeling for Whole Metagenome Profiling

Description

Links & Downloads

Tags

Additional Fields