A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification.
SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn’s disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do?
This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50 2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well (<150 bps), including SNP sequences.
Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns.
Identifer | oai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/44090 |
Date | 20 March 2014 |
Creators | Yan, Rui |
Contributors | Jurisica, Igor |
Source Sets | University of Toronto |
Language | en_ca |
Detected Language | English |
Type | Thesis |
Page generated in 0.0018 seconds