The discovery of DNA has been one of the biggest catalysts in genomic research. Sequencing has enabled us to access the wealth of information encoded in DNA and has provided the basis for ground-breaking achievements such as the first complete human genome sequence. Furthermore, it has tremendously advanced our understanding of life-threatening genetic disorders and bacterial and viral infections. With the recent advent of next generation sequencing (NGS) technologies, sequencing became accessible to the majority of researchers and made metagenomic sequencing widely available. However, to realise its true potential, sophisticated and tailor-made bioinformatic programs are essential to translate the collected data into meaningful information. My thesis explored the potential of resolving fine-scale variation in NGS data. The identification and correction of artificial fine-scale variation in the form of biases and errors is imperative in order to draw valid conclusions. Furthermore, resolving natural fine-scale variation in the form of single nucleotide polymorphisms (SNPs) and closely related species or strains is critical for the development of effective treatments and the characterisation of diseases. In recent years, Illumina has emerged as the global market leader in DNA sequencing. However, biases and errors associated with this high-throughput sequencing technology are still poorly understood which has precluded the development of effective noise removal algorithms. In addition, many programs were not designed for Illumina data or metagenomic sequencing. Therefore, a better understanding of the idiosyncrasies encountered in Illumina data is essential and programs must be tested and benchmarked on realistic and reliable in silico data sets to reveal not only their true capacities but also their limitations. I conducted the largest in vivo study of Illumina error profiles in combination with state-of-the-art library preparation methods to date. For the first time, a direct connection between experimental design factors and systematic errors was established, providing detailed insight into the nature of Illumina errors. Further, I tested various error removal techniques and developed a sophisticated Illumina amplicon noise removal algorithm, enabling researchers to choose optimal processing strategies for their particular data sets. In addition, I devised several simulation tools that accurately reflect artificial and natural fine-scale variation. This includes a flexible and efficient read simulation program which is the only program that can directly reflect the impact of experimental design factors. Furthermore, I developed a program simulating the evolution of a virus into a quasi-species. These programs formed the basis for two comprehensive benchmarking studies that revealed the capacities and limitations of viral haplotype reconstruction programs and taxonomic classification programs, respectively. My work furthers our knowledge of Illumina sequencing errors and will facilitate more accurate and effective analyses of sequencing data sets.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:631005 |
Date | January 2014 |
Creators | Schirmer, Melanie |
Publisher | University of Glasgow |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://theses.gla.ac.uk/5627/ |
Page generated in 0.0022 seconds