Next-generation sequencing is revolutionising in genetics, where base-by base information for the whole genome is available for a large sample of individuals. This type of data is becoming commonly used and will continue to be in the near future. One of the first questions arising is the identification of novel variants and subsequently genotype calling of the individuals in the sample. However, given the cost of sequencing, so far most projects are sequencing individuals in low to medium coverage. In this thesis, we present two distinct methods for SNP and genotype calling from low-coverage sequencing data, TreeCall and MVNcall, that combine sequencing and Linkage Disequilibrium (LD) information. We begin by describing the pipeline for next-generation sequencing analysis and existing methods for SNP and genotype calling using low-coverage sequencing information. Subsequently, we present the two novel LD-based methods for SNP and genotype calling. The two methods developed assume a study design where the individuals are both genotyped and sequenced at low-coverage. The genotypes are used to construct a haplotype scaffold, where the LD information is extracted, either by the construction of genealogical trees (TreeCall), or the approximation of a windows of contiguous SNPs of the scaffold by a multivariate normal distribution (MVNcall). Both methods have been applied on real datasets from the 1000 Genomes project and compared to other LD-based methods applied on the same datasets, mainly in terms of genotype calling and phasing. Whereas TreeCall gives lower genotype concordance rates than the other methods, MVNcall provides the highest genotype concordance rates for a dataset with a small sample size (Lowcoverage pilot of the 1000 Genomes project). Applying the MVNcall on a larger dataset (Phase 1 of the 1000 Genomes project), it achieves an overall genotype discordance rate of 0.58%, whereas SNPTools achieves an overall genotype discordance rate of 0.57%, Thunder 0.56%, and BEAGLE 0.61% (comparison based on Axiom chip). The main advantage of MVNcall is in terms of phasing accuracy, where by using a haplotype scaffold, and especially in the case where the haplotype scaffold is phased using pedigree information, it provides accurate haplotypes. MVNcall is also extended to incorporate trio information for genotype calling. Experiment on a deeply sequenced trio leads to an accurate set of haplotypes of the trio with switch error rates as low as ~0.28 for the parents and ~0.12 for the offspring.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:669933 |
Date | January 2012 |
Creators | Menelaou, Androniki |
Contributors | Marchini, Jonathan |
Publisher | University of Oxford |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://ora.ox.ac.uk/objects/uuid:2093d498-3e7f-4648-9fde-fcdb311849de |
Page generated in 0.0019 seconds