1 |
Targeted long-read sequencing of a locus under long-term balancing selection in CapsellaBachmann, J.A., Tedder, Andrew, Laenen, B., Steige, K.A., Slotte, T. 13 September 2019 (has links)
Yes / Rapid advances in short-read DNA sequencing technologies have revolutionized population genomic studies, but there are genomic regions where this technology reaches its limits. Limitations mostly arise due to the difficulties in assembly or alignment to genomic regions of high sequence divergence and high repeat content, which are typical characteristics for loci under strong long-term balancing selection. Studying genetic diversity at such loci therefore remains challenging. Here, we investigate the feasibility and error rates associated with targeted long-read sequencing of a locus under balancing selection. For this purpose, we generated bacterial artificial chromosomes (BACs) containing the Brassicaceae S-locus, a region under strong negative frequency-dependent selection which has previously proven difficult to assemble in its entirety using short reads. We sequence S-locus BACs with single-molecule long-read sequencing technology and conduct de novo assembly of these S-locus haplotypes. By comparing repeated assemblies resulting from independent long-read sequencing runs on the same BAC clone we do not detect any structural errors, suggesting that reliable assemblies are generated, but we estimate an indel error rate of 5.7×10−5. A similar error rate was estimated based on comparison of Illumina short-read sequences and BAC assemblies. Our results show that, until de novo assembly of multiple individuals using long-read sequencing becomes feasible, targeted long-read sequencing of loci under balancing selection is a viable option with low error rates for single nucleotide polymorphisms or structural variation. We further find that short-read sequencing is a valuable complement, allowing correction of the relatively high rate of indel errors that result from this approach. / This study was supported by a grant from the Swedish Research Council to T.S.
|
2 |
The Systematic Design and Application of Robust DNA BarcodesBuschmann, Tilo 19 September 2016 (has links) (PDF)
High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.
Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance.
Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors.
We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform.
To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
|
3 |
The Systematic Design and Application of Robust DNA BarcodesBuschmann, Tilo 02 September 2016 (has links)
High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.
Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance.
Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors.
We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform.
To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
|
Page generated in 0.1699 seconds