DNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A,
C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of
an organism. Understanding the signi cance of certain arrangements of nucleotides can unlock the secrets of
how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C
nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with
biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are
known to collocate with regions upstream from gene coding sequences within the promoter region. The
pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides),
lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict
the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without
an awareness of how the outcomes are a ected by the manner in which the HMM is applied.
Two main ndings of this study are:
1. The outcome of a HMM is highly sensitive to the setting of the initial probability estimates.
2. Without the appropriate software techniques, HMMs cannot be applied e ectively to large data such
as whole eukaryotic chromosomes.
Both of these factors are rarely considered by users of HMMs, but are critical to a successful application of
HMMs to large DNA sequences. In fact, these shortcomings were discovered through a close examination
of published results of CpG island prediction using HMMs, and without being addressed, can lead to an
incorrect implementation and application of HMM theory.
A rst-order HMM is developed and its performance compared to two other historical methods, the
Takai and Jones method and the UCSC method from the University of California Santa Cruz. The HMM
is then extended to a second-order to acknowledge that pairs of nucleotides de ne CpG islands rather than
single nucleotides alone, and the second-order HMM is evaluated in comparison to the other methods. The
UCSC method is found to be based on properties that are not related to CpG islands, and thus is not a
fair comparison to the other methods. Of the other methods, the rst-order HMM method and the Takai
and Jones method are comparable in the tests conducted, but the second-order HMM method demonstrates
superior predictive capabilities. However, these results are valid only when taking into consideration the
highly sensitive outcomes based on initial estimates, and nding a suitable set of estimates that provide the
most appropriate results.
The rst-order HMM is applied to the problem of producing synthetic data that simulates the characteristics
of a DNA sequence, including the speci ed presence of CpG islands, based on the model parameters of
a trained HMM. HMM analysis is applied to the synthetic data to explore its delity in generating data with
similar characteristics, as well as to validate the predictive ability of an HMM. Although this test fails to
i
meet expectations, a second test using a second-order HMM to produce simulated DNA data using frequency
distributions of CpG island pro les exhibits highly accurate predictions of the pre-speci ed CpG islands, con-
rming that when the synthetic data are appropriately structured, an HMM can be an accurate predictive
tool.
One outcome of this thesis is a set of software components (CpGID 2.0 and TrackMap) capable of ef-
cient and accurate application of an HMM to genomic sequences, together with visualization that allows
quantitative CpG island results to be viewed in conjunction with other genomic data. CpGID 2.0 is an
adaptation of a previously published software component that has been extensively revised, and TrackMap
is a companion product that works with the results produced by the CpGID 2.0 program. Executing these
components allows one to monitor output aspects of the computational model such as number and size of the
predicted CpG islands, including their CG content percentage and level of CpG frequency. These outcomes
can then be related to the input values used to parameterize the HMM.
Identifer | oai:union.ndltd.org:USASK/oai:ecommons.usask.ca:10388/ETD-2013-04-1030 |
Date | 2013 April 1900 |
Contributors | Kusalik, Tony, Harkness, Troy |
Source Sets | University of Saskatchewan Library |
Language | English |
Detected Language | English |
Type | text, thesis |
Page generated in 0.007 seconds