Computing the similarity of two strings or sequences is one of the most important fundamental in computer field, and it has been widely studied for several decades.
In the last decade, it gained the researchers' attentions again because of the improvements of the hardware computation ability and the presence of huge amount of data in biotechnology.
In this dissertation, we pay attention to computing the edit distance between two sequences where the block-edit operations are involved in addition to the character-edit operations.
Previous researches show that this problem is NP-hard if recursive block moves are allowed.
Since we are interested in solving the editing problems by the polynomial-time optimization algorithms, we consider the simplified version of the edit distance problem.
We first focus on the longest common subsequence (LCS) of run-length encoded (RLE) strings, where the runs can be seen as a class of simplified blocks.
Then, we apply constraints to the problem, i.e. to find the constrained LCS (CLCS) of RLE strings.
Besides, we show that the problems which involve block-edit operations can still be solved by the polynomial-time optimization algorithms if some restrictions are applied.
Let X and Y be two sequences of lengths n and m, respectively.
Also, let N and M, be the numbers of runs in the corresponding RLE forms of X and Y, respectively.
In this dissertation, first, we propose a simple algorithm for computing the LCS of X and Y in O(NM + min{ p_1, p_2 }) time, where p_1 and p_2 denote the numbers of elements in the bottom and right boundaries of the matched blocks, respectively.
This new algorithm improves the previously known time bound O(min{nM, Nm}) and outperforms the time bounds O(NM log NM) or O((N+M+q) log (N+M+q)) for some cases, where q denotes the number of matched blocks.
Next, we give an efficient algorithm for solving the CLCS problem, which is to find a common subsequences Z of X and Y such that a given constrained sequence P is a subsequence of Z and the length of Z is maximized.
Suppose X, Y and P are all in RLE format, and the lengths of X, Y and P are n, m and r, respectively.
Let N, M and R be the numbers of runs in X, Y, and P, respectively.
We show that by RLE, the CLCS problem can be solved in O(NMr + min{q_1 r + q_4, q_2 r + q_5 }) time, where q_1 and q_2 denote the numbers of elements in the south and east boundaries of the partially matched blocks on the first layer, respectively, and q_4 and q_5 denote the numbers of elements of the west and north pillars in the bottom boundaries of all fully matched cuboids in the DP lattice, respectively.
When the input strings have good compression ratios, our work obviously outperforms the previously known DP algorithms and the Hunt-Szymanski-like algorithms.
Finally, we consider variations of the block edit distance problem that involve character insertions, character deletions, block copies and block deletions, for two given sequences X and Y.
In this dissertation, three variations are defined with different measuring functions, which are P(EIS, C), P(EI, L) and P(EI, N).
Then we show that with some preprocessing, the minimum block edit distances of these three variations can be obtained by dynamic programming in O(nm), O(nm log m) and O(nm^2) time, respectively, where n and m are the lengths of X and Y.
Identifer | oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0518110-225424 |
Date | 18 May 2010 |
Creators | Ann, Hsing-Yen |
Contributors | Chin-Wen Ho, M. S. Chang, Tzung-Pei Hong, Chang-Biau Yang, Yue-Li Wang, Jung-Hsien Chiang, Kun-Mao Chao, R. C. T. Lee |
Publisher | NSYSU |
Source Sets | NSYSU Electronic Thesis and Dissertation Archive |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0518110-225424 |
Rights | off_campus_withheld, Copyright information available at source archive |
Page generated in 0.0023 seconds