High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.
Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance.
Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors.
We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform.
To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa.de:bsz:15-qucosa-209812 |
Date | 19 September 2016 |
Creators | Buschmann, Tilo |
Contributors | Universität Leipzig, Fakultät für Mathematik und Informatik, Dr. Leonid V. Bystrykh, Prof. Dr. Ivo Große, Prof. Dr. Peter Stadler |
Publisher | Universitätsbibliothek Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | doc-type:doctoralThesis |
Format | application/pdf |
Page generated in 0.0023 seconds