Return to search

Generalized pattern matching applied to genetic analysis. / 通用性模式匹配在基因序列分析中的應用 / CUHK electronic theses & dissertations collection / Digital dissertation consortium / Tong yong xing mo shi pi pei zai ji yin xu lie fen xi zhong de ying yong

Approximate pattern matching problem is, given a reference sequence T, a pattern (query) Q, and a maximum allowed error e, to find all the substrings in the reference, such that the edit distance between the substrings and the pattern is smaller than or equal to the maximum allowed error. Though it is a well-studied problem in Computer Science, it gains a resurrection in Bioinformatics in recent years, largely due to the emergence of the next-generation high-throughput sequencing technologies. This thesis contributes in a novel generalized pattern matching framework, and applies it to solve pattern matching problems in general and alternative splicing detection (AS) in particular. AS is to map a large amount of next-generation sequencing short reads data to a reference human genome, which is the first and an important step in analyzing the sequenced data for further Biological analysis. The four parts of my research are as follows. / In the first part of my research work, we propose a novel deterministic pattern matching algorithm which applies Agrep, a well-known bit-parallel matching algorithm, to a truncated suffix array. Due to the linear cost of Agrep, the cost of our approach is linear to the number of characters processed in the truncated suffix array. We analyze the matching cost theoretically, and .obtain empirical costs from experiments. We carry out experiments using both synthetic and real DNA sequence data (queries) and search them in Chromosome-X of a reference human genome. The experimental results show that our approach achieves a speed-up of several magnitudes over standard Agrep algorithm. / In the fourth part, we focus on the seeding strategies for alternative splicing detection. We review the history of seeding-and-extending (SAE), and assess both theoretically and empirically the seeding strategies adopted in existing splicing detection tools, including Bowtie's heuristic and ABMapper's exact seedings, against the novel complementary quad-seeding strategy we proposed and the corresponding novel splice detection tool called CS4splice, which can handle inexact seeding (with errors) and all 3 types of errors including mismatch (substitution), insertion, and deletion. We carry out experiments using short reads (queries) of length 105bp comprised of several data sets consisting of various levels of errors, and align them back to a reference human genome (hg18). On average, CS4splice can align 88. 44% (recall rate) of 427,786 short reads perfectly back to the reference; while the other existing tools achieve much smaller recall rates: SpliceMap 48.72%, MapSplice 58.41%, and ABMapper 51.39%. The accuracies of CS4splice are also the highest or very close to the highest in all the experiments carried out. But due to the complementary quad-seeding that CS4splice use, it takes more computational resources, about twice (or more) of the other alternative splicing detection tools, which we think is practicable and worthy. / In the second part, we define a novel generalized pattern (query) and a framework of generalized pattern matching, for which we propose a heuristic matching algorithm. Simply speaking, a generalized pattern is Q 1G1Q2 ... Qc--1Gc--1 Qc, which consists of several substrings Q i and gaps Gi occurring in-between two substrings. The prototypes of the generalized pattern come from several real Biological problems that can all be modeled as generalized pattern matching problems. Based on a well-known seeding-and-extending heuristic, we propose a dual-seeding strategy, with which we solve the matching problem effectively and efficiently. We also develop a specialized matching tool called Gpattern-match. We carry out experiments using 10,000 generalized patterns and search them in a reference human genome (hg18). Over 98.74% of them can be recovered from the reference. It takes 1--2 seconds on average to recover a pattern, and memory peak goes to a little bit more than 1G. / In the third part, a natural extension of the second part, we model a real biological problem, alternative splicing detection, into a generalized pattern matching problem, and solve it using a proposed bi-directional seeding-and-extending algorithm. Different from all the other tools which depend on third-party tools, our mapping tool, ABMapper, is not only stand-alone but performs unbiased alignments. We carry out experiments using 427,786 real next-generation sequencing short reads data (queries) and align them back to a reference human genome (hg18). ABMapper achieves 98.92% accuracy and 98.17% recall rate, and is much better than the other state-of-the-art tools: SpliceMap achieves 94.28% accuracy and 78.13% recall rate;while TopHat 88.99% accuracy and 76.33% recall rate. When the seed length is set to 12 in ABMapper, the whole searching and alignment process takes about 20 minutes, and memory peak goes to a little bit more than 2G. / Ni, Bing. / Adviser: Kwong-Sak Leung. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical referencesTexture mapping (leaves 151-161). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.

Identiferoai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_344817
Date January 2011
ContributorsNi, Bing., Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Source SetsThe Chinese University of Hong Kong
LanguageEnglish, Chinese
Detected LanguageEnglish
TypeText, theses
Formatelectronic resource, microform, microfiche, 1 online resource (xxi, 161 leaves : ill. (some col.))
RightsUse of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0021 seconds