1 |
Mapping Bisulfite-Treated Short DNA ReadsPorter, Jacob Stuart 23 April 2018 (has links)
Epigenetics are stable heritable traits that are not a result of the DNA sequence. Epigenetic modification of DNA cytosine plays a role in development and disease. The covalent bonding of a methyl group or a hydroxymethyl group to the 5-carbon of cytosine epigenetically modifies cytosine to 5-methylcytosine or 5-hydroxymethylcytosine. Upon PCR amplification, the bisulfite treatment of DNA converts unmethylated cytosine to thymine, while 5-methylcytosine, 5-hydroxymethylcytosine, and other bases remain unchanged. The resulting sequences can be mapped to a reference genome; however, this can be challenging due to sequencing technology complexity, low sequence complexity, and biases and errors introduced with bisulfite treatment. Once the short read is mapped, the identity of 5-methylcytosine or 5-hydroxymethylcytosine can be determined by comparing the mapped read to the aligned reference genome. Bisulfite DNA read mapping is characterized by mapping performance as low as 40%. This research improves bisulfite short read mapping quality. First, reads generated from the bisulfite hairpin PCR protocol are used to study mapping failure and solutions. A read may not map to the genome; it may map uniquely, or it may map to multiple locations. Sequence complexity correlates with these mapping categories. The hairpin protocol allows for a recovery, in some cases, of the original untreated read, and mapping this read with the regular read mapper Bowtie2 improved mapper performance by 10%. New bisulfite read mapping software called BisPin was created that calls BFAST (BLAT-like Fast Accurate Search Tool) for mapping. BisPin resolves ambiguously mapped reads with a rescoring strategy, which yields a statistically significant improvement. BFAST-Gap for Ion Torrent reads was developed, since Ion Torrent machines are less expensive than Illumina machines and since Ion Torrent reads are longer. There are few mappers for Ion Torrent data. BFAST-Gap uses homopolymer run length for contextual gap penalty functions, since homopolymer runs cause errors in Ion Torrent reads. In conjunction with BisPin, this software performed well on real and simulated bisulfite Ion Torrent data and Illumina data. InfoTrim, a read trimmer with an entropy term, was developed with competitive results. / Ph. D. / DNA, deoxyribonucleic acid, is a large molecule comprised of four molecular bases: adenine, cytosine, thymine, and guanine, and it determines heritable traits in living organisms. Sequencing DNA determines the sequential arrangement of bases. A read is a small sequence of DNA bases. Epigenetics are stable heritable traits that are not a result of the DNA sequence. Chemical groups called methyl and hydroxymethyl can be attached to cytosine. These groups are an epigenetic modification of cytosine, and they play a role in disease and development. The chemical bisulfite is used to discover these chemical groups. The bisulfite sequencing of DNA is a process where bisulfite is introduced to DNA, and then the DNA is sequenced. Bisulfite treatment converts cytosines without the methyl and hydroxymethyl chemical groups into thymine. Software is then used to align and match the resulting DNA strands to a large reference DNA strand called a reference genome to distinguish between cytosines that have these chemical groups. This process is called mapping or alignment, and its performance can be as low as 40% for bisulfite data. This research improves this performance. The hairpin protocol is a known bisulfite sequencing method that sequences two opposing DNA strands, where the original untreated strand can sometimes be recovered. Mapping the recovered strands improved performance by 10%. Using hairpin data, sequence complexity, a measure of DNA sequence randomness, correlated with mapping performance. BisPin mapping software was created that implements the hairpin recovery approach. BisPin rescores DNA strands that map to multiple locations on the reference genome, and it supports multiple sequencing technologies. BFAST-Gap, a modified mapping program callable by BisPin, uses a context sensitive function to better align Ion Torrent reads, which tend to have errors in regions of repeated bases. BFAST-Gap was developed, since Ion Torrent sequencing machines are less expensive than Illumina machines and since Ion Torrent reads tend to be longer and have more information. The read trimmer InfoTrim was developed to trim the lengths of short DNA sequences to improve the quality of alignments. These programs were validated on real and simulated DNA data and performed well.
|
Page generated in 0.0373 seconds