Spelling suggestions: "subject:"subsequence"" "subject:"subsequences""
11 |
Topics in sequence analysisMa, Jinyong 12 November 2012 (has links)
This thesis studies two topics in sequence analysis. In the first part, we investigate the large deviations of the shape of the random RSK Young diagrams, associated with a random word of size n whose letters are independently drawn from an alphabet of size m=m(n). When the letters are drawn uniformly and when both n and m converge together to infinity, m not growing too fast with respect to n, the large deviations of the shape of the Young diagrams are shown to be the same as that of the spectrum of the traceless GUE. Since the length of the top row of the Young diagrams is the length of the longest (weakly) increasing subsequence of the random word, the corresponding large deviations follow. When the letters are drawn with non-uniform probability, a control of both highest probabilities will ensure that the length of the top row of the diagrams satisfies a large deviation principle. In either case, both speeds and rate functions are identified. To complete our study, non-asymptotic concentration bounds for the length of the top row of the diagrams, are obtained for both models. In the second part, we investigate the order of the r-th, 1<= r < +∞, central moment of the length of the longest common subsequence of two independent random words of size n whose letters are identically distributed and independently drawn from a finite alphabet. When all but one of the letters are drawn with small probabilities, which depend on the size of the alphabet, the r-th central moment is shown to be of order n^{r/2}. In particular, when r=2, we get the order of the variance of the longest common subsequence.
|
12 |
Implementation and evaluation of a text extraction tool for adverse drug reaction informationDahlberg, Gunnar January 2010 (has links)
Inom ramen för Världshälsoorganisationens (WHO:s) internationella biverkningsprogram rapporterar sjukvårdspersonal och patienter misstänkta läkemedelsbiverkningar i form av spontana biverkningsrapporter som via nationella myndigheter skickas till Uppsala Monitoring Centre (UMC). Hos UMC lagras rapporterna i VigiBase, WHO:s biverkningsdatabas. Rapporterna i VigiBase analyseras med hjälp av statistiska metoder för att hitta potentiella samband mellan läkemedel och biverkningar. Funna samband utvärderas i flera steg där ett tidigt steg i utvärderingen är att studera den medicinska litteraturen för att se om sambandet redan är känt sedan tidigare (tidigare kända samband filtreras bort från fortsatt analys). Att manuellt leta efter samband mellan ett visst läkemedel och en viss biverkan är tidskrävande. I den här studien har vi utvecklat ett verktyg för att automatiskt leta efter medicinska biverkningstermer i medicinsk litteratur och spara funna samband i ett strukturerat format. I verktyget har vi implementerat och integrerat funktionalitet för att söka efter medicinska biverkningar på olika sätt (utnyttja synonymer,ta bort ändelser på ord, ta bort ord som saknar betydelse, godtycklig ordföljd och stavfel). Verktygets prestanda har utvärderats på manuellt extraherade medicinska termer från SPC-texter (texter från läkemedels bipacksedlar) och på biverkningstexter från Martindale (medicinsk referenslitteratur för information om läkemedel och substanser) där WHO-ART- och MedDRA-terminologierna har använts som källa för biverkningstermer. Studien visar att sofistikerad textextraktion avsevärt kan förbättra identifieringen av biverkningstermer i biverkningstexter jämfört med en ordagrann extraktion. / Background: Initial review of potential safety issues related to the use of medicines involves reading and searching existing medical literature sources for known associations of drug and adverse drug reactions (ADRs), so that they can be excluded from further analysis. The task is labor demanding and time consuming. Objective: To develop a text extraction tool to automatically identify ADR information from medical adverse effects texts. Evaluate the performance of the tool’s underlying text extraction algorithm and identify what parts of the algorithm contributed to the performance. Method: A text extraction tool was implemented on the .NET platform with functionality for preprocessing text (removal of stop words, Porter stemming and use of synonyms) and matching medical terms using permutations of words and spelling variations (Soundex, Levenshtein distance and Longest common subsequence distance). Its performance was evaluated on both manually extracted medical terms (semi-structuredtexts) from summary of product characteristics (SPC) texts and unstructured adverse effects texts from Martindale (i.e. a medical reference for information about drugs andmedicines) using the WHO-ART and MedDRA medical term dictionaries. Results: For the SPC data set, a verbatim match identified 72% of the SPC terms. The text extraction tool correctly matched 87% of the SPC terms while producing one false positive match using removal of stop words, Porter stemming, synonyms and permutations. The use of the full MedDRA hierarchy contributed the most to performance. Sophisticated text algorithms together contributed roughly equally to the performance. Phonetic codes (i.e. Soundex) is evidently inferior to string distance measures (i.e. Levenshtein distance and Longest common subsequence distance) for fuzzy matching in our implementation. The string distance measures increased the number of matched SPC terms, but at the expense of generating false positive matches. Results from Martindaleshow that 90% of the identified medical terms were correct. The majority of false positive matches were caused by extracting medical terms not describing ADRs. Conclusion: Sophisticated text extraction can considerably improve the identification of ADR information from adverse effects texts compared to a verbatim extraction.
|
13 |
Efficient Algorithms for the Block Edit Distance and Related ProblemsAnn, Hsing-Yen 18 May 2010 (has links)
Computing the similarity of two strings or sequences is one of the most important fundamental in computer field, and it has been widely studied for several decades.
In the last decade, it gained the researchers' attentions again because of the improvements of the hardware computation ability and the presence of huge amount of data in biotechnology.
In this dissertation, we pay attention to computing the edit distance between two sequences where the block-edit operations are involved in addition to the character-edit operations.
Previous researches show that this problem is NP-hard if recursive block moves are allowed.
Since we are interested in solving the editing problems by the polynomial-time optimization algorithms, we consider the simplified version of the edit distance problem.
We first focus on the longest common subsequence (LCS) of run-length encoded (RLE) strings, where the runs can be seen as a class of simplified blocks.
Then, we apply constraints to the problem, i.e. to find the constrained LCS (CLCS) of RLE strings.
Besides, we show that the problems which involve block-edit operations can still be solved by the polynomial-time optimization algorithms if some restrictions are applied.
Let X and Y be two sequences of lengths n and m, respectively.
Also, let N and M, be the numbers of runs in the corresponding RLE forms of X and Y, respectively.
In this dissertation, first, we propose a simple algorithm for computing the LCS of X and Y in O(NM + min{ p_1, p_2 }) time, where p_1 and p_2 denote the numbers of elements in the bottom and right boundaries of the matched blocks, respectively.
This new algorithm improves the previously known time bound O(min{nM, Nm}) and outperforms the time bounds O(NM log NM) or O((N+M+q) log (N+M+q)) for some cases, where q denotes the number of matched blocks.
Next, we give an efficient algorithm for solving the CLCS problem, which is to find a common subsequences Z of X and Y such that a given constrained sequence P is a subsequence of Z and the length of Z is maximized.
Suppose X, Y and P are all in RLE format, and the lengths of X, Y and P are n, m and r, respectively.
Let N, M and R be the numbers of runs in X, Y, and P, respectively.
We show that by RLE, the CLCS problem can be solved in O(NMr + min{q_1 r + q_4, q_2 r + q_5 }) time, where q_1 and q_2 denote the numbers of elements in the south and east boundaries of the partially matched blocks on the first layer, respectively, and q_4 and q_5 denote the numbers of elements of the west and north pillars in the bottom boundaries of all fully matched cuboids in the DP lattice, respectively.
When the input strings have good compression ratios, our work obviously outperforms the previously known DP algorithms and the Hunt-Szymanski-like algorithms.
Finally, we consider variations of the block edit distance problem that involve character insertions, character deletions, block copies and block deletions, for two given sequences X and Y.
In this dissertation, three variations are defined with different measuring functions, which are P(EIS, C), P(EI, L) and P(EI, N).
Then we show that with some preprocessing, the minimum block edit distances of these three variations can be obtained by dynamic programming in O(nm), O(nm log m) and O(nm^2) time, respectively, where n and m are the lengths of X and Y.
|
14 |
Subsequence Feature Maps For Protein Function AnnotationSarac, Omer Sinan 01 August 2008 (has links) (PDF)
With the advances in sequencing technologies, the number of protein sequences with
unknown function increases rapidly. Hence, computational methods for functional annotation
of these protein sequences become of the upmost importance. In this thesis,
we first defined a feature space mapping of protein primary sequences to fixed dimensional
numerical vectors. This mapping, which is called the Subsequence Profile Map
(SPMap), takes into account the models of the subsequences of protein sequences. The
resulting vectors were used as an input to support vector machines (SVM) for functional
classification of proteins. Second, we defined the protein functional annotation problem
as a classification problem and construct a classification framework defined on Gene Ontology
(GO) terms. Dierent classification methods as well as their combinations are
assessed on this framework which is based on 300 GO molecular function terms. The reiv
sults showed that combination enhances the classification accuracy. The resultant system
is made publicly available as an online function annotation tool.
|
15 |
Prediction Of Enzyme Classes In A Hierarchical Approach By Using SpmapYaman, Ayse Gul 01 September 2009 (has links) (PDF)
Enzymes are proteins that play an important role in biochemical reactions as catalysts. They are classified based on the reaction they catalyzed, in a hierarchical scheme by International Enzyme Commission (EC). This hierarchical scheme is expressed as a four-level tree structure and a unique number is assigned to each enzyme class. There are six major classes at
the top level according to the reaction they carried out and sub-classes at the lower levels are further specific reactions of these classes. The aim of this thesis is to build a three-level
classification model based on the hierarchical structure of EC classes. ENZYME database is used to extract the information of EC classes and enzymes are assigned to these EC classes.
Primary sequences of enzymes extracted from UniProtKB/Swiss-Prot database are used to extract features. A subsequence based feature extraction method, Subsequence Profile Map (SPMap) is used in this study. SPMap is a method that explicitly models the differences between positive and negative examples. SPMap pays attention to the conserved subsequences of protein sequences in the same class. SPMap generates the feature vector of each sample protein as a probability of fixed-length subsequences of this protein with respect to a probabilistic profile matrix calculated by clustering similar subsequences in the training dataset. In our case, positive and negative training datasets are prepared for each class, at each level of the tree structure. Subsequence Profile Map (SPMap) is used for feature extraction and Support
Vector Machines (SVMs) are used for classification. Five-fold cross validation is used to test the performance of the system. The overall sensitivity, specificity and AUC values for the six major EC classes are 93.08%, 98.95% and 0.993, respectively. The results at the second- and third- levels are also promising.
|
16 |
Automatically determine route and mode of tranport using a gps enabled phoneGilani, Himanshu 01 June 2005 (has links)
A system consisting of a GPS-enabled phone and a database has been designed and implemented. This system is capable of recording the route traveled by a user and determining the mode of transport (walk, bicycle, car or bus) used. The Java code running in the GPS-enabled phone automatically records location data, determines critical locations for the trip, and transmits the locations to a central database using the wireless capabilities of the phone. As the route information arrives at the database, it is processed by the mode detection algorithm that determines the mode of transportation being used by the individual for this route. The mode detection algorithm uses travel time, speed, location of bus stops and knowledge of bus routes. The system was tested on experimental data collected from 100 trips (25 trips for each mode of transportation). The correct mode of transport was detected on 97% of the trips.
|
17 |
The Expected Number of Patterns in a Random Generated Permutation on [n] = {1,2,...,n}Fokuoh, Evelyn 01 August 2018 (has links) (PDF)
Previous work by Flaxman (2004) and Biers-Ariel et al. (2018) focused on the number of distinct words embedded in a string of words of length n. In this thesis, we will extend this work to permutations, focusing on the maximum number of distinct permutations contained in a permutation on [n] = {1,2,...,n} and on the expected number of distinct permutations contained in a random permutation on [n]. We further considered the problem where repetition of subsequences are as a result of the occurrence of (Type A and/or Type B) replications. Our method of enumerating the Type A replications causes double counting and as a result causes the count of the number of distinct sequences to go down.
|
18 |
O problema da subsequência comum máxima sem repetições / The repetition-free longest common subsequence problemTjandraatmadja, Christian 26 July 2010 (has links)
Exploramos o seguinte problema: dadas duas sequências X e Y sobre um alfabeto finito, encontre uma subsequência comum máxima de X e Y sem símbolos repetidos. Estudamos a estrutura deste problema, particularmente do ponto de vista de grafos e de combinatória poliédrica. Desenvolvemos algoritmos de aproximação e heurísticas para este problema. O enfoque deste trabalho está na construção de um algoritmo baseado na técnica branch-and-cut, aproveitando-nos de um algoritmo de separação eficiente e de heurísticas e técnicas para encontrarmos uma solução ótima mais cedo. Também estudamos um problema mais fácil no qual este problema é baseado: dadas duas sequências X e Y sobre um alfabeto finito, encontre uma subsequência comum máxima de X e Y. Exploramos este problema do ponto de vista de combinatória poliédrica e descrevemos vários algoritmos conhecidos para resolvê-lo. / We explore the following problem: given two sequences X and Y over a finite alphabet, find a longest common subsequence of X and Y without repeated symbols. We study the structure of this problem, particularly from the point of view of graphs and polyhedral combinatorics. We develop approximation algorithms and heuristics for this problem. The focus of this work is in the construction of an algorithm based on the branch-and-cut technique, taking advantage of an efficient separation algorithm and of heuristics and techniques to find an optimal solution earlier. We also study an easier problem on which this problem is based: given two sequences X and Y over a finite alphabet, find a longest common subsequence of X and Y. We explore this problem from the point of view of polyhedral combinatorics and describe several known algorithms to solve it.
|
19 |
O problema da subsequência comum máxima sem repetições / The repetition-free longest common subsequence problemChristian Tjandraatmadja 26 July 2010 (has links)
Exploramos o seguinte problema: dadas duas sequências X e Y sobre um alfabeto finito, encontre uma subsequência comum máxima de X e Y sem símbolos repetidos. Estudamos a estrutura deste problema, particularmente do ponto de vista de grafos e de combinatória poliédrica. Desenvolvemos algoritmos de aproximação e heurísticas para este problema. O enfoque deste trabalho está na construção de um algoritmo baseado na técnica branch-and-cut, aproveitando-nos de um algoritmo de separação eficiente e de heurísticas e técnicas para encontrarmos uma solução ótima mais cedo. Também estudamos um problema mais fácil no qual este problema é baseado: dadas duas sequências X e Y sobre um alfabeto finito, encontre uma subsequência comum máxima de X e Y. Exploramos este problema do ponto de vista de combinatória poliédrica e descrevemos vários algoritmos conhecidos para resolvê-lo. / We explore the following problem: given two sequences X and Y over a finite alphabet, find a longest common subsequence of X and Y without repeated symbols. We study the structure of this problem, particularly from the point of view of graphs and polyhedral combinatorics. We develop approximation algorithms and heuristics for this problem. The focus of this work is in the construction of an algorithm based on the branch-and-cut technique, taking advantage of an efficient separation algorithm and of heuristics and techniques to find an optimal solution earlier. We also study an easier problem on which this problem is based: given two sequences X and Y over a finite alphabet, find a longest common subsequence of X and Y. We explore this problem from the point of view of polyhedral combinatorics and describe several known algorithms to solve it.
|
20 |
The doctrine of subsequence in the pentecostal and neo-pentecostal movementsElkington, Robert Lionel 01 1900 (has links)
The Pentecostal and Neo-Pentecostal movements propose a subsequent to salvation Spirit baptism. This baptism is viewed as an experience in which the Spirit either confers or awakens gifts within the life of the believer. The thesis ofthis paper is that Spirit baptism
occurs at conversion. Spirit filling on the other hand is one of many metaphors to describe the work of the eschatological Spirit subsequent to salvation. This distinguishing of Spirit baptism and Spirit filling is different to the Pentecostal and Neo-Pentecostal idea that Spirit baptism and Spirit filling are synonymous experiences that occur at some point subsequent to salvation. / Christian Spirituality, Church History and Missiology / Th. M. (Systematic Theology)
|
Page generated in 0.4234 seconds