Return to search

Reference-free identification of genetic variation in metagenomic sequence data using a probabilistic model

Microorganisms are an indispensable part of our ecosystem, yet the natural metabolic and ecological diversity of these organisms is poorly understood due to a historical reliance of microbiology on laboratory grown cultures. The awareness that this diversity cannot be studied by laboratory isolation, together with recent advances in low cost scalable sequencing technology, have enabled the foundation of culture-independent microbiology, or metagenomics. The study of environmental microbial samples with metagenomics has led to many advances, but a number of technological and methodological challenges still remain. A potentially diverse set of taxa may be represented in anyone environmental sample. Existing tools for representing the genetic composition of such samples sequenced with short-read data, and tools for identifying variation amongst them, are still in their infancy. This thesis makes the case that a new framework based on a joint-genome graph can constitute a powerful tool for representing and manipulating the joint genomes of population samples. I present the development of a collection of methods, called SCRAPS, to construct these efficient graphs in small communities without the availability or bias of a reference genome. A key novelty is that genetic variation is identified from the data structure using a probabilistic algorithm that can provide a measure of the confidence in each call. SCRAPS is first tested on simulated short read data for accuracy and efficiency. At least 95% of non-repetitive small-scale genetic variation with a minor allele read depth greater than 10x is correctly identified; the number false positives per conserved nucleotide is consistently better than 1 part in 333 x 103. SCRAPS is then applied to artificially pooled experimental datasets. As part of this study, SCRAPS is used to identify genetic variation in an epidemiological 11 sample Neisseria meningitidis dataset collected from the African meningitis belt". In total 14,000 sites of genetic variation are identified from 48 million Illumina/Solexa reads. The results clearly show the genetic differences between two waves of infection that has plagued northern Ghana and Burkina Faso.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:561121
Date January 2012
CreatorsAhiska, Bartu
ContributorsMcVean, Gilean
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0024 seconds