In high-diversity populations, a complete accounting of de novo mutations can be difficult to obtain. Most analyses involve identifying such mutations by sequencing pedigrees on second-generation sequencing platforms and aligning the short reads to a reference assembly, the genomic sequence of a canonical member (or members) of a species. Often, large regions of the genomes under study may be greatly diverged from the reference sequence, or not represented at all (e.g. the HLA, antigenic genes, or other regions under balancing selective pressure). If the haplotypic background upon which a mutation occurs is absent, events can easily be missed (as reads have nowhere to align) and false-positives may abound (as the software forces the reads to align elsewhere). This thesis presents a novel method for de novo mutation discovery and allele identification. Rather than relying on alignment, our method is based on the de novo assembly of short-read sequence data using a multi-color de Bruijn graph. In this data structure, each sample is assigned a unique index (or "color"), reads from each sample are decomposed into smaller subsequences of length k (or "kmers"), and color-specific adjacency information between kmers is recorded. Mutations can be discovered in the graph itself by searching for characteristic motifs (e.g. a "bubble motifs", indicative of a SNP or indel, and "linear motifs" indicative of allelic and non-allelic recombination). De novo mutations differ from inherited mutations in that the kmers spanning the variant allele are absent in the parents; in a sense, they facilitate their own discovery by generating "novel" sequence. We exploit this fact to limit processing of the graph to only those regions containing these novel kmers. We verified our approach using simulations, validation, and visualization. On the simulations, we developed genome and read generation software driven by empirical distributions computed from real data to emit genomes with realistic features: recombinations, de novo variants, read fragment sizes, sequencing errors, and coverage profiles. In 20 artifical samples, we determined our sensitivity and specificity for novel kmer recovery to be approximately 98% and 100% at worst, respectively. Not every novel stretch can be reconstituted as a variant, owing to errors and homology in the graph. In simulations, our false discovery rate was 10% for "bubble" events and 12% for "linear" events. On validation, we obtained a high-quality draft assembly for a single P. falciparum child using a third-generation sequencing platform. We discovered three de novo events in the draft assembly, all three of which are recapitulated in our calls on the second-generation sequencing data for the same sample; no false-positives are present. On visualization, we developed an interactive web application capable of rendering a multi-color subgraph that assists in visually distinguishing between true variation and sequencing artifacts. We applied our caller to real datasets: 115 progeny across four previously analyzed experimental crosses of Plasmodium falciparum. We demonstrate our ability to access subtelomeric compartments of the genome, regions harboring antigenic genes under tremendous selective pressure, thus highly divergent between geographically distinct isolates and routinely masked and ignored in reference-based analyses. We also show our caller's ability to recover an important form of structural de novo variation: non-allelic homologous recombination (NAHR) events, an important mechanism for the pathogen to diversify its own antigenic repertoire. We demonstrate our ability to recover the few events in these samples known to exist, and overturn some previous findings indicating exchanges between "core" (non-subtelomeric) genes. We compute the SNP mutation rate to be approximately 2.91 per sample, insertion and deletion mutation rates to be 0.55 and 1.04 per sample, respectively, multi-nucleotide polymorphisms to be 0.72 per sample, and NAHR events to be 0.33 per sample. These findings are consistent across crosses. Finally, we investigated our method's scaling capabilities by processing a quintet of previously analyzed Pan troglodytes verus (western chimpanzee) samples. The genome of the chimpanzee is two orders of magnitude larger than the malaria parasite's (3, 300 Mbp versus 23 Mbp), diploid rather than haploid, poorly assembled, and the read dataset is lower coverage (20x versus 120x). Comparing to Sequenom validation data as well as visual validation, our sensitivity is expectedly low. However, this can be attributed to overaggressiveness in data cleaning applied by the de novo assembler atop which our software is built. We discuss the precise changes that would likely need to be made in future work to adapt our method to low-coverage samples.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:757671 |
Date | January 2016 |
Creators | Garimella, Kiran |
Contributors | McVean, Gil |
Publisher | University of Oxford |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://ora.ox.ac.uk/objects/uuid:5153dbfa-bf2b-4bcc-a5fd-71492b2b2138 |
Page generated in 0.002 seconds