Spelling suggestions: "subject:"longest common subsequente"" "subject:"longest common subsequent""
1 |
A Survey of the Longest Common Subsequence Problem and Its Related ProblemsChen, Jian-bein 12 September 2005 (has links)
ABSTRACT
The longest common subsequence (LCS) problem is a classical problem both in com-
binational optimization and computational biology. During the past few decades,
a considerable number of studies have been focused on LCS and its related prob-
lems. In this thesis, we shall present a complete survey on LCS and its related
problems, and review some e¡Ócient algorithms for solving these problems. We shall
also give the de¡Ânition to each problem, present some theorems, and illustrate time
complexity and space complexity of the algorithms for solving these problems.
|
2 |
Algorithms for Scaled String Indexing and LCS VariantsPeng, Yung-Hsing 20 July 2010 (has links)
Related problems of string indexing and sequence analysis have been widely studied for a long time. Recently, researchers turn to consider extended versions of these problems, which provides more realistic applications. In this dissertation, we focus on three problems of recent interest, which are (1)the
indexing problem for scaled strings, (2)the merged longest common subsequence problem and its variant with blocks, and (3)the sequence alignment
problem with weighted constraints.
The indexing problem for scaled strings asks one to preprocess a text string T, so that the matched positions of a pattern string P in T, with some
scales £\ applied to P, can be reported efficiently. In this dissertation, we propose efficient algorithms for indexing real scaled strings, discretely scaled
strings, and proportionally scaled strings. Our indexing algorithms achieve either significant improvements to previous results, or the best known results. The merged longest common subsequence (merged LCS) problem aims to detect the interleaving relationship between sequences, which has important applications to genomic and signal comparison. In this dissertation, we propose improved algorithms for finding the merged LCS. Our
algorithms for finding the merged LCS are also more efficient than the previous results, especially for large alphabets. Finally, the sequence alignment problem with weighted constraints is a newly proposed problem in this dissertation. For this new problem, we first propose an efficient solution, and then show that the concept of weighted constraints can be further used to solve many constraint-related problems on sequences. Therefore, our results in this dissertation have significant contributions to the field of string indexing and sequence analysis.
|
3 |
A Genetic Algorithm for the Longest Common Subsequence of Multiple SequencesChiang, Chung-Han 06 January 2009 (has links)
Various approaches have been proposed for finding the longest
common subsequence (LCS) of two sequences. The time complexities
of these algorithms are usually $O(n^2)$ in the worst case, where
$n$ is the length of input sequences. However, these algorithms
would become infeasible when the input length, $n$, is very long.
Recently, the $k$-LCS $(k ≥ 2)$ problem has become more
attractive. Some algorithms have been proposed for solving the
problem, but the execution time required for solving the $k$-LCS
problem is still too long to be practical. In this thesis, we
propose a genetic algorithm for solving the $k$-LCS problem with
time complexity $O(Gpk(n + |P_j|))$, which $G$ is the number of
generations, $p$ is the number of template patterns, $k$ is the
number of input sequences, $n$ and $|P_j|$ are the length of input
sequences and the length of template patterns, respectively. As
our experimental results show, when $k$ is 20 and $n$ is 1000, the
performance ratio ($|CS|/|LCS|$) of our algorithm is greater than
0.8, where $|CS|$ denotes the length of the solution we find, and
$|LCS|$ represents the length of the real (optimal) LCS. Comparing
the performance ratios with Expansion Algorithm and BNMAS
Algorithm, our algorithm is much better than them when the number
of input sequences varies from 2 to 20 and the length of the input
sequences varies from 100 to 2000.
|
4 |
An Approach for Solving the Constrained Longest Common Subsequence ProblemPeng, Chao-Li 28 August 2003 (has links)
A subsequence is obtained by deleting zero or more symbols from a given sequence. For given two sequences, the longest common subsequence problem is to find a common subsequence whose length is the longest. The constrained longest common subsequence (CLCS) problem is to find a longest common subsequence that contains a specific subsequence. Note a CLCS may be shorter than an LCS.
In this thesis, we propose a dynamic programming algorithm for solving the CLCS problem. The time complexity is O(pmn), where m and n are the lengths of the given sequences and p is the length of the constraint sequence. Our algorithm
can also be applied to solve the constrained sequence alignment problem, which is a sequence alignment problem with the requirement that some specific symbols must be aligned together.
|
5 |
Some Common Subsequence Problems of Multiple Sequences and Their ApplicationsHuang, Kuo-Si 14 July 2007 (has links)
The longest common subsequence (LCS) problem is a famous and classical problem in computer science and molecular biology. The common subsequence of multiple sequences shows the identical and similar parts in these sequences. This dissertation pays attention to the approximate algorithms for finding the LCS of $k$ input sequence ($k$-LCS problem), the merged LCS problem, and the mosaic LCS problem. These three problems try to hunt out the identical relationships among the $k$ sequences, the interleaving relationship between a target sequence and a merged sequence of a pair of sequences, and the mosaic relationship between a target sequence and a set of sequences, respectively.
Given $k$ input sequences, the $k$-LCS problem is to find the LCS which is common in all sequences. We first propose two $sigma$-approximate algorithms for the $k$-LCS problem with time complexities $O(sigma k n)$ and $O(sigma^{2} k n + sigma^{3} n)$ respectively, where $sigma$ and $n$ are the alphabet size and length of sequences, respectively. Experimental results show that our algorithms for 2-LCS could be a good filter to select the candidate sequences in database searching.
Given a target sequence $T$ and a pair of merging sequences $A$ and $B$, the merged LCS problem is to find the LCS of $T$ and the optimally merged sequence by merging $A$ and $B$ alternately. Its goal is to find a merging way for understanding the interleaving relationship of sequences. We first propose an algorithm with $O(n^{3})$ time for solving the problem, where $n$ is the sequence length. We further add the block information of input sequences in the blocked merged LCS problem. To solve the latter problem, we propose an algorithm with time complexity $O(n^{2}m_{b})$, where $m_{b}$ is the number of blocks. Based on the S-table technique, we can design an improved algorithm with $O(n^{2} + nm_{b}^{2})$ time.
Additionally, we desire to obtain the relationship between one sequence and a set of sequences. Given a target sequence $T$ and a set $S$ of source sequences, the mosaic LCS problem is to find the LCS of $T$ and a mosaic sequence $C$, composed of repeatable $k$ sequences in $S$. Based on the concept of break points in $T$, a divide and conquer algorithm is proposed with $O(n^2m|S|+ n^3log k)$ time, where $n$ and $m$ are the lengths of $T$ and the maximal length of sequences in $S$, respectively. Again, based on the S-table technique, an improved algorithm with $O(n(m+k)|S|)$ time is proposed by applying an efficient preprocessing.
|
6 |
A Hybrid Algorithm for the Longest Common Subsequence of Multiple SequencesWeng, Hsiang-yi 19 August 2009 (has links)
The k-LCS problem is to find the longest common subsequence (LCS) of k input sequences. It is difficult while the number of input sequences is large.
In the past, researchers focused on finding the LCS of two sequences (2-LCS). However, there is no good algorithm for finding the optimal solution of k-LCS up to now. For solving the k-LCS problem, in this thesis, we first propose a mixed algorithm, which is a combination of a heuristic algorithm, genetic algorithm (GA) and ant colony optimization (ACO) algorithm.
Then, we propose an enhanced ACO (EACO) algorithm, composed of the heuristic algorithm and matching pair algorithm (MPA). In our experiments, we compare our algorithms with expansion algorithm, best next for maximal available symbol algorithm, GA and ACO algorithm. The experimental results on several sets of DNA and protein sequences show that our EACO algorithm outperforms other algorithms in the lengths of solutions.
|
7 |
Topics in sequence analysisMa, Jinyong 12 November 2012 (has links)
This thesis studies two topics in sequence analysis. In the first part, we investigate the large deviations of the shape of the random RSK Young diagrams, associated with a random word of size n whose letters are independently drawn from an alphabet of size m=m(n). When the letters are drawn uniformly and when both n and m converge together to infinity, m not growing too fast with respect to n, the large deviations of the shape of the Young diagrams are shown to be the same as that of the spectrum of the traceless GUE. Since the length of the top row of the Young diagrams is the length of the longest (weakly) increasing subsequence of the random word, the corresponding large deviations follow. When the letters are drawn with non-uniform probability, a control of both highest probabilities will ensure that the length of the top row of the diagrams satisfies a large deviation principle. In either case, both speeds and rate functions are identified. To complete our study, non-asymptotic concentration bounds for the length of the top row of the diagrams, are obtained for both models. In the second part, we investigate the order of the r-th, 1<= r < +∞, central moment of the length of the longest common subsequence of two independent random words of size n whose letters are identically distributed and independently drawn from a finite alphabet. When all but one of the letters are drawn with small probabilities, which depend on the size of the alphabet, the r-th central moment is shown to be of order n^{r/2}. In particular, when r=2, we get the order of the variance of the longest common subsequence.
|
8 |
Implementation and evaluation of a text extraction tool for adverse drug reaction informationDahlberg, Gunnar January 2010 (has links)
Inom ramen för Världshälsoorganisationens (WHO:s) internationella biverkningsprogram rapporterar sjukvårdspersonal och patienter misstänkta läkemedelsbiverkningar i form av spontana biverkningsrapporter som via nationella myndigheter skickas till Uppsala Monitoring Centre (UMC). Hos UMC lagras rapporterna i VigiBase, WHO:s biverkningsdatabas. Rapporterna i VigiBase analyseras med hjälp av statistiska metoder för att hitta potentiella samband mellan läkemedel och biverkningar. Funna samband utvärderas i flera steg där ett tidigt steg i utvärderingen är att studera den medicinska litteraturen för att se om sambandet redan är känt sedan tidigare (tidigare kända samband filtreras bort från fortsatt analys). Att manuellt leta efter samband mellan ett visst läkemedel och en viss biverkan är tidskrävande. I den här studien har vi utvecklat ett verktyg för att automatiskt leta efter medicinska biverkningstermer i medicinsk litteratur och spara funna samband i ett strukturerat format. I verktyget har vi implementerat och integrerat funktionalitet för att söka efter medicinska biverkningar på olika sätt (utnyttja synonymer,ta bort ändelser på ord, ta bort ord som saknar betydelse, godtycklig ordföljd och stavfel). Verktygets prestanda har utvärderats på manuellt extraherade medicinska termer från SPC-texter (texter från läkemedels bipacksedlar) och på biverkningstexter från Martindale (medicinsk referenslitteratur för information om läkemedel och substanser) där WHO-ART- och MedDRA-terminologierna har använts som källa för biverkningstermer. Studien visar att sofistikerad textextraktion avsevärt kan förbättra identifieringen av biverkningstermer i biverkningstexter jämfört med en ordagrann extraktion. / Background: Initial review of potential safety issues related to the use of medicines involves reading and searching existing medical literature sources for known associations of drug and adverse drug reactions (ADRs), so that they can be excluded from further analysis. The task is labor demanding and time consuming. Objective: To develop a text extraction tool to automatically identify ADR information from medical adverse effects texts. Evaluate the performance of the tool’s underlying text extraction algorithm and identify what parts of the algorithm contributed to the performance. Method: A text extraction tool was implemented on the .NET platform with functionality for preprocessing text (removal of stop words, Porter stemming and use of synonyms) and matching medical terms using permutations of words and spelling variations (Soundex, Levenshtein distance and Longest common subsequence distance). Its performance was evaluated on both manually extracted medical terms (semi-structuredtexts) from summary of product characteristics (SPC) texts and unstructured adverse effects texts from Martindale (i.e. a medical reference for information about drugs andmedicines) using the WHO-ART and MedDRA medical term dictionaries. Results: For the SPC data set, a verbatim match identified 72% of the SPC terms. The text extraction tool correctly matched 87% of the SPC terms while producing one false positive match using removal of stop words, Porter stemming, synonyms and permutations. The use of the full MedDRA hierarchy contributed the most to performance. Sophisticated text algorithms together contributed roughly equally to the performance. Phonetic codes (i.e. Soundex) is evidently inferior to string distance measures (i.e. Levenshtein distance and Longest common subsequence distance) for fuzzy matching in our implementation. The string distance measures increased the number of matched SPC terms, but at the expense of generating false positive matches. Results from Martindaleshow that 90% of the identified medical terms were correct. The majority of false positive matches were caused by extracting medical terms not describing ADRs. Conclusion: Sophisticated text extraction can considerably improve the identification of ADR information from adverse effects texts compared to a verbatim extraction.
|
9 |
Efficient Algorithms for the Block Edit Distance and Related ProblemsAnn, Hsing-Yen 18 May 2010 (has links)
Computing the similarity of two strings or sequences is one of the most important fundamental in computer field, and it has been widely studied for several decades.
In the last decade, it gained the researchers' attentions again because of the improvements of the hardware computation ability and the presence of huge amount of data in biotechnology.
In this dissertation, we pay attention to computing the edit distance between two sequences where the block-edit operations are involved in addition to the character-edit operations.
Previous researches show that this problem is NP-hard if recursive block moves are allowed.
Since we are interested in solving the editing problems by the polynomial-time optimization algorithms, we consider the simplified version of the edit distance problem.
We first focus on the longest common subsequence (LCS) of run-length encoded (RLE) strings, where the runs can be seen as a class of simplified blocks.
Then, we apply constraints to the problem, i.e. to find the constrained LCS (CLCS) of RLE strings.
Besides, we show that the problems which involve block-edit operations can still be solved by the polynomial-time optimization algorithms if some restrictions are applied.
Let X and Y be two sequences of lengths n and m, respectively.
Also, let N and M, be the numbers of runs in the corresponding RLE forms of X and Y, respectively.
In this dissertation, first, we propose a simple algorithm for computing the LCS of X and Y in O(NM + min{ p_1, p_2 }) time, where p_1 and p_2 denote the numbers of elements in the bottom and right boundaries of the matched blocks, respectively.
This new algorithm improves the previously known time bound O(min{nM, Nm}) and outperforms the time bounds O(NM log NM) or O((N+M+q) log (N+M+q)) for some cases, where q denotes the number of matched blocks.
Next, we give an efficient algorithm for solving the CLCS problem, which is to find a common subsequences Z of X and Y such that a given constrained sequence P is a subsequence of Z and the length of Z is maximized.
Suppose X, Y and P are all in RLE format, and the lengths of X, Y and P are n, m and r, respectively.
Let N, M and R be the numbers of runs in X, Y, and P, respectively.
We show that by RLE, the CLCS problem can be solved in O(NMr + min{q_1 r + q_4, q_2 r + q_5 }) time, where q_1 and q_2 denote the numbers of elements in the south and east boundaries of the partially matched blocks on the first layer, respectively, and q_4 and q_5 denote the numbers of elements of the west and north pillars in the bottom boundaries of all fully matched cuboids in the DP lattice, respectively.
When the input strings have good compression ratios, our work obviously outperforms the previously known DP algorithms and the Hunt-Szymanski-like algorithms.
Finally, we consider variations of the block edit distance problem that involve character insertions, character deletions, block copies and block deletions, for two given sequences X and Y.
In this dissertation, three variations are defined with different measuring functions, which are P(EIS, C), P(EI, L) and P(EI, N).
Then we show that with some preprocessing, the minimum block edit distances of these three variations can be obtained by dynamic programming in O(nm), O(nm log m) and O(nm^2) time, respectively, where n and m are the lengths of X and Y.
|
10 |
Automatically determine route and mode of tranport using a gps enabled phoneGilani, Himanshu 01 June 2005 (has links)
A system consisting of a GPS-enabled phone and a database has been designed and implemented. This system is capable of recording the route traveled by a user and determining the mode of transport (walk, bicycle, car or bus) used. The Java code running in the GPS-enabled phone automatically records location data, determines critical locations for the trip, and transmits the locations to a central database using the wireless capabilities of the phone. As the route information arrives at the database, it is processed by the mode detection algorithm that determines the mode of transportation being used by the individual for this route. The mode detection algorithm uses travel time, speed, location of bus stops and knowledge of bus routes. The system was tested on experimental data collected from 100 trips (25 trips for each mode of transportation). The correct mode of transport was detected on 97% of the trips.
|
Page generated in 0.0915 seconds