Global ETD Search

21	Fast and accurate estimation of large-scale phylogenetic alignments and trees Liu, Kevin Jensen 06 July 2011 (has links) Phylogenetics is the study of evolutionary relationships. Phylogenetic trees and alignments play important roles in a wide range of biological research, including reconstruction of the Tree of Life - the evolutionary history of all organisms on Earth - and the development of vaccines and antibiotics. Today's phylogenetic studies seek to reconstruct trees and alignments on a greater number and variety of organisms than ever before, primarily due to exponential growth in affordable sequencing and computing power. The importance of phylogenetic trees and alignments motivates the need for methods to reconstruct them accurately and efficiently on large-scale datasets. Traditionally, phylogenetic studies proceed in two phases: first, an alignment is produced from biomolecular sequences with differing lengths, and, second, a tree is produced using the alignment. My dissertation presents the first empirical performance study of leading two-phase methods on datasets with up to hundreds of thousands of sequences. Relatively accurate alignments and trees were obtained using methods with high computational requirements on datasets with a few hundred sequences, but as datasets grew past 1000 sequences and up to tens of thousands of sequences, the set of methods capable of analyzing a dataset diminished and only the methods with the lowest computational requirements and lowest accuracy remained. Alternatively, methods have been developed to simultaneously estimate phylogenetic alignments and trees. Methods optimizing the treelength optimization problem - the most widely-used approach for simultaneous estimation - have not been shown to return more accurate trees and alignments than two-phase approaches. I demonstrate that treelength optimization under a particular class of optimization criteria represents a promising means for inferring accurate trees and alignments. The other methods for simultaneous estimation are not known to support analyses of datasets with a few hundred sequences due to their high computational requirements. The main contribution of my dissertation is SATe, the first fast and accurate method for simultaneous estimation of alignments and trees on datasets with up to several thousand nucleotide sequences. SATe improves upon the alignment and topological accuracy of all existing methods, especially on the most difficult-to-align datasets, while retaining reasonable computational requirements. / text Computational phylogenetics Multiple sequence alignment Phylogeny Treelength optimization problem Simultaneous estimation Biological datasets
22	Multiple Versions and Overlap in Digital Text Desmond Schmidt Unknown Date (has links) This thesis is unusual in that it tries to solve a problem that exists between two widely separated disciplines: the humanities (and to some extent also linguistics) on the one hand and information science on the other. Chapter 1 explains why it is essential to strike a balance between study of the solution and problem domains. Chapter 2 surveys the various models of cultural heritage text, starting in the remote past, through the coming of the digital era to the present. It establishes why current models are outdated and need to be revised, and also what significance such a revision would have. Chapter 3 examines the history of markup in an attempt to trace how inadequacies of representation arose. It then examines two major problems in cultural heritage and lin- guistics digital texts: overlapping hierarchies and textual variation. It assesses previously proposed solutions to both problems and explains why they are all inadequate. It argues that overlapping hierarchies is a subset of the textual variation problem, and also why markup cannot be the solution to either problem. Chapter 4 develops a new data model for representing cultural heritage and linguistics texts, called a ‘variant graph’, which separates the natural overlapping structures from the content. It develops a simplified list-form of the graph that scales well as the number of versions increases. It also describes the main operations that need to be performed on the graph and explores their algorithmic complexities. Chapter 5 draws on research in bioinformatics and text processing to develop a greedy algorithm that aligns n versions with non-overlapping block transpositions in O(M N ) time in the worst case, where M is the size of the graph and N is the length of the new version being added or updated. It shows how this algorithm can be applied to texts in corpus linguistics and the humanities, and tests an implementation of the algorithm on a variety of real-world texts. 08 Information and Computing Sciences overlapping hierarchies textual variation multiple sequence alignment textual criticism
23	Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations Nguyen, Ken D 14 December 2011 (has links) Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences' structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm. Multiple sequence alignments Algorithms Scoring functions Computer Sciences
24	Role of mutual information for predicting contact residues in proteins Gomes, Mireille January 2012 (has links) Mutual Information (MI) based methods are used to predict contact residues within proteins and between interacting proteins. There have been many high impact papers citing the successful use of MI for determining contact residues in a particular protein of interest, or in certain types of proteins, such as homotrimers. In this dissertation we have carried out a systematic study to assess if this popularly employed contact prediction tool is useful on a global scale. After testing original MI and leading MI based methods on large, cross-species datasets we found that in general the performance of these methods for predicting contact residues both within (intra-protein) and between proteins (inter-protein) is weak. We observe that all MI variants have a bias towards surface residues, and therefore predict surface residues instead of contact residues. This finding is in contrast to the relatively good performance of i-Patch (Hamer et al. [2010]), a statistical scoring tool for inter-protein contact prediction. i-Patch uses as input surface residues only, groups amino acids by physiochemical properties, and assumes the existence of patches of contact residues on interacting proteins. We examine whether using these ideas would improve the performance of MI. Since inter-protein contact residues are only on the surface of each protein, to disentangle surface from contact prediction we filtered out the confounding buried residues. We observed that considering surface residues only does indeed improve the interprotein contact prediction ability of all tested MI methods. We examined a specific "successful" case study in the literature and demonstrated that here, even when considering surface residues only, the most accurate MI based inter-protein contact predictor,MIc, performs no better than random. We have developed two novel MI variants; the first groups amino acids by their physiochemical properties, and the second considers patches of residues on the interacting proteins. In our analyses these new variants highlight the delicate trade-off between signal and noise that must be achieved when using MI for inter-protein contact prediction. The input for all tested MI methods is a multiple sequence alignment of homologous proteins. In a further attempt to understand why the MI methods perform poorly, we have investigated the influence of gaps in the alignment on intra-protein contact prediction. Our results suggest that depending on the evaluation criteria and the alignment construction algorithm employed, a gap cutoff of around 10% would maximise the performance of MI methods, whereas the popularly employed 0% gap cutoff may lead to predictions that are no better than random guesses. Based on the insight we have gained through our analyses, we end this dissertation by identifying a number of ways in which the contact residue prediction ability of MI variants may be improved, including direct coupling analysis. 572.6
25	Correction de données de séquençage de troisième génération / Error correction of third-generation sequencing data Morisse, Pierre 26 September 2019 (has links) Les objectifs de cette thèse s’inscrivent dans la large problématique du traitement des données issues de séquenceurs à très haut débit, et plus particulièrement des reads longs, issus de séquenceurs de troisième génération.Les aspects abordés dans cette problématiques se concentrent principalement sur la correction des erreurs de séquençage, et sur l’impact de la correction sur la qualité des analyses sous-jacentes, plus particulièrement sur l’assemblage. Dans un premier temps, l’un des objectifs de cette thèse est de permettre d’évaluer et de comparer la qualité de la correction fournie par les différentes méthodes de correction hybride (utilisant des reads courts en complément) et d’auto-correction (se basant uniquement sur l’information contenue dans les reads longs) de l’état de l’art. Une telle évaluation permet d’identifier aisément quelle méthode de correction est la mieux adaptée à un cas donné, notamment en fonction de la complexité du génome étudié, de la profondeur de séquençage, ou du taux d’erreurs des reads. De plus, les développeurs peuvent ainsi identifier les limitations des méthodes existantes, afin de guider leurs travaux et de proposer de nouvelles solutions visant à pallier ces limitations. Un nouvel outil d’évaluation, proposant de nombreuses métriques supplémentaires par rapport au seul outil disponible jusqu’alors, a ainsi été développé. Cet outil, combinant une approche par alignement multiple à une stratégie de segmentation, permet également une réduction considérable du temps nécessaire à l’évaluation. À l’aide de cet outil, un benchmark de l’ensemble des méthodes de correction disponibles est présenté, sur une large variété de jeux de données, de profondeur de séquençage, de taux d’erreurs et de complexité variable, de la bactérie A. baylyi à l’humain. Ce benchmark a notamment permis d’identifier deux importantes limitations des outils existants : les reads affichant des taux d’erreurs supérieurs à 30%, et les reads de longueur supérieure à 50 000 paires de bases. Le deuxième objectif de cette thèse est alors la correction des reads extrêmement bruités. Pour cela, un outil de correction hybride, combinant différentes approches de l’état de l’art, a été développé afin de surmonter les limitations des méthodes existantes. En particulier, cet outil combine une stratégie d’alignement des reads courts sur les reads longs à l’utilisation d’un graphe de de Bruijn, ayant la particularité d’être d’ordre variable. Le graphe est ainsi utilisé afin de relier les reads alignés, et donc de corriger les régions non couvertes des reads longs. Cette méthode permet ainsi de corriger des reads affichant des taux d’erreurs atteignant jusqu’à 44%, tout en permettant un meilleur passage à l’échelle sur de larges génomes et une diminution du temps de traitement, par rapport aux méthodes de l’état de l’art les plus efficaces. Enfin, le troisième objectif de cette thèse est la correction des reads extrêmement longs. Pour cela, un outil utilisant cette fois une approche par auto-correction a été développé, en combinant, de nouveau, différentes méthodologies de l’état de l’art. Plus précisément, une stratégie de calcul des chevauchements entre les reads, puis une double étape de correction, par alignement multiple puis par utilisation de graphes de de Bruijn locaux, sont utilisées ici. Afin de permettre à cette méthode de passer efficacement à l’échelle sur les reads extrêmement longs, la stratégie de segmentation mentionnée précédemment a été généralisée. Cette méthode d’auto-correction permet ainsi de corriger des reads atteignant jusqu’à 340 000 paires de bases, tout en permettant un excellent passage à l’échelle sur des génomes plus complexes, tels que celui de l’humain. / The aims of this thesis are part of the vast problematic of high-throughput sequencing data analysis. More specifically, this thesis deals with long reads from third-generation sequencing technologies. The aspects tackled in this topic mainly focus on error correction, and on its impact on downstream analyses such a de novo assembly. As a first step, one of the objectives of this thesis is to evaluate and compare the quality of the error correction provided by the state-of-the-art tools, whether they employ a hybrid (using complementary short reads) or a self-correction (relying only on the information contained in the long reads sequences) strategy. Such an evaluation allows to easily identify which method is best tailored for a given case, according to the genome complexity, the sequencing depth, or the error rate of the reads. Moreover, developpers can thus identify the limiting factors of the existing methods, in order to guide their work and propose new solutions allowing to overcome these limitations. A new evaluation tool, providing a wide variety of metrics, compared to the only tool previously available, was thus developped. This tool combines a multiple sequence alignment approach and a segmentation strategy, thus allowing to drastically reduce the evaluation runtime. With the help of this tool, we present a benchmark of all the state-of-the-art error correction methods, on various datasets from several organisms, spanning from the A. baylyi bacteria to the human. This benchmark allowed to spot two major limiting factors of the existing tools: the reads displaying error rates above 30%, and the reads reaching more than 50 000 base pairs. The second objective of this thesis is thus the error correction of highly noisy long reads. To this aim, a hybrid error correction tool, combining different strategies from the state-of-the-art, was developped, in order to overcome the limiting factors of existing methods. More precisely, this tool combines a short reads alignmentstrategy to the use of a variable-order de Bruijn graph. This graph is used in order to link the aligned short reads, and thus correct the uncovered regions of the long reads. This method allows to process reads displaying error rates as high as 44%, and scales better to larger genomes, while allowing to reduce the runtime of the error correction, compared to the most efficient state-of-the-art tools.Finally, the third objectif of this thesis is the error correction of extremely long reads. To this aim, aself-correction tool was developed, by combining, once again, different methologies from the state-of-the-art. More precisely, an overlapping strategy, and a two phases error correction process, using multiple sequence alignement and local de Bruijn graphs, are used. In order to allow this method to scale to extremely long reads, the aforementioned segmentation strategy was generalized. This self-correction methods allows to process reads reaching up to 340 000 base pairs, and manages to scale very well to complex organisms such as the human genome. Séquençage à haut débit Correction d'erreurs Assemblage Graphe de de Bruijn Alignement multiple High-throughput sequencing Error correction Assembly De Bruijn graphs Multiple Sequence Alignment 005.6
26	Techniky pro zarovnávání skupin biologických sekvencí / Techniques for Multiple Sequence Alignments Hrazdil, Jiří January 2009 (has links) This thesis summarizes ways of representation of biological sequences and file formats used for sequence exchange and storage. Next part deals with techniques used for sequence pairwise alignment, followed by extension of these techniques to the problem of multiple sequence alignment. Additional methods are introduced, that are suboptimal, but on the other hand are able to compute results in reasonable time. Practical part of this thesis consists of implementing multiple sequence alignment application in Java programming language.
27	Ensemble Methods for Historical Machine-Printed Document Recognition Lund, William B. 03 April 2014 (has links) (PDF) The usefulness of digitized documents is directly related to the quality of the extracted text. Optical Character Recognition (OCR) has reached a point where well-formatted and clean machine- printed documents are easily recognizable by current commercial OCR products; however, older or degraded machine-printed documents present problems to OCR engines resulting in word error rates (WER) that severely limit either automated or manual use of the extracted text. Major archives of historical machine-printed documents are being assembled around the globe, requiring an accurate transcription of the text for the automated creation of descriptive metadata, full-text searching, and information extraction. Given document images to be transcribed, ensemble recognition methods with multiple sources of evidence from the original document image and information sources external to the document have been shown in this and related work to improve output. This research introduces new methods of evidence extraction, feature engineering, and evidence combination to correct errors from state-of-the-art OCR engines. This work also investigates the success and failure of ensemble methods in the OCR error correction task, as well as the conditions under which these ensemble recognition methods reduce the Word Error Rate (WER), improving the quality of the OCR transcription, showing that the average document word error rate can be reduced below the WER of a state-of-the-art commercial OCR system by between 7.4% and 28.6% depending on the test corpus and methods. This research on OCR error correction contributes within the larger field of ensemble methods as follows. Four unique corpora for OCR error correction are introduced: The Eisenhower Communiqués, a collection of typewritten documents from 1944 to 1945; The Nineteenth Century Mormon Articles Newspaper Index from 1831 to 1900; and two synthetic corpora based on the Enron (2001) and the Reuters (1997) datasets. The Reverse Dijkstra Heuristic is introduced as a novel admissible heuristic for the A* exact alignment algorithm. The impact of the heuristic is a dramatic reduction in the number of nodes processed during text alignment as compared to the baseline method. From the aligned text, the method developed here creates a lattice of competing hypotheses for word tokens. In contrast to much of the work in this field, the word token lattice is created from a character alignment, preserving split and merged tokens within the hypothesis columns of the lattice. This alignment method more explicitly identifies competing word hypotheses which may otherwise have been split apart by a word alignment. Lastly, this research explores, in order of increasing contribution to word error rate reduction: voting among hypotheses, decision lists based on an in-domain training set, ensemble recognition methods with novel feature sets, multiple binarizations of the same document image, and training on synthetic document images. historical document recognition optical character recognition OCR OCR error correction multiple sequence alignment MSA text alignment progressive alignment machine learning Computer Sciences
28	Modules réactionnels : un nouveau concept pour étudier l'évolution des voies métaboliques / Reaction modules : a new concept to study the evolution of metabolic pathways Barba, Matthieu 16 December 2011 (has links) J'ai mis au point une méthodologie pour annoter les superfamilles d'enzymes, en décrire l'histoire et les replacer dans l'évolution de leurs voies métaboliques. J'en ai étudié trois : (1) les amidohydrolases cycliques, dont les DHOases (dihydroorotases, biosynthèse des pyrimidines), pour lesquelles j'ai proposé une nouvelle classification. L'arbre phylogénétique inclut les dihydropyrimidinases (DHPases) et allantoïnases (ALNases) qui ont des réactions similaires dans d'autres voies (dégradation des pyrimidines et des purines respectivement). (2) L'étude de la superfamille des DHODases (qui suivent les DHOases) montre une phylogénie semblable aux DHOases, avec également des enzymes d'autres voies, dont les DHPDases (qui suivent les DHPases). De cette observation est né le concept de module réactionnel, qui correspond à la conservation de l’enchaînement de réactions semblables dans différentes voies métaboliques. Cela a été utilisé lors de (3) l'étude des carbamoyltransférases (TCases) qui incluent les ATCases (précédant les DHOases). J'ai d'abord montré l'existence d'une nouvelle TCase potentiellement impliquée dans la dégradation des purines et lui ai proposé un nouveau rôle en utilisant le concept de module réactionnel (enchaînement avec l'ALNase). Dans ces trois grandes familles j'ai aussi mis en évidence trois groupes de paralogues non identifiés qui se retrouvent pourtant dans un même contexte génétique appelé « Yge » et qui formeraient donc un module réactionnel constitutif d'une nouvelle voie hypothétique. Appliqué à diverses voies, le concept de modules réactionnels refléterait donc les voies métaboliques ancestrales dont ils seraient les éléments de base. / I designed a methodology to annotate enzyme superfamilies, explain their history and describe them in the context of metabolic pathways evolution. Three superfamilies were studied: (1) cyclic amidohydrolases, including DHOases (dihydroorotases, third step of the pyrimidines biosynthesis), for which I proposed a new classification. The phylogenetic tree also includes dihydropyrimidinases (DHPases) and allantoinases (ALNases) which catalyze similar reactions in other pathways (pyrimidine and purine degradation, respectively). (2) The DHODases superfamily (after DHOases) show a similar phylogeny as DHOases, including enzymes from other pathways, DHPDases in particular (after DHPases). This led to the concept of reaction module, i.e. a conserved series of similar reactions in different metabolic pathways. This was used to study (3) the carbamoyltransferases (TCases) which include ATCases (before DHOases). I first isolated a new kind of TCase, potentially involved in the purine degradation, and I proposed a new role for it in the light of reaction modules (linked with ALNase). In those three superfamilies I also found three groups of unidentified paralogs that were remarkably part of the same genetic context called “Yge” which would be a reaction module part of an unidentified pathway. The concept of reactions modules may then reflect the ancestral metabolic pathways for which they would be basic elements. Alignement multiple Arbre phylogénétique Superfamille Amidohydrolase Dihydroorotase Carbamoyltransférase Voie métabolique Réaction chimique Ambigüité de substrat Module réactionnel Multiple sequence alignment Phylogenetic tree Amidohydrolase Dihydroorotase Carbamoyltransferase Metabolic pathway Chemical reaction Substrate ambiguity Reaction module
29	Predikce škodlivosti aminokyselinových mutací s využitím metody MAPP / Predicting the Effect of Amino Acid Substitutions on Protein Function Using MAPP Method Pelikán, Ondřej January 2014 (has links) This thesis discusses the issue of predicting the effect of amino acid substitutions on protein function using MAPP method. This method requires the multiple sequence alignment and phylogenetic tree constructed by third-party tools. Main goal of this thesis is to find the combination of suitable tools and their parameters to generate the inputs of MAPP method on the basis of analysis on one massively mutated protein. Then, the MAPP method is tested with chosen combination of parameters and tools on two large independent datasets and consequently is compared with the other tools focused on prediction of the effect of mutations. Apart from this the web interface for the MAPP method was created. This interface simplifies the use of the method since the user need not to install any tools or set any parameters.
30	Multiple sequence analysis in the presence of alignment uncertainty Herman, Joseph L. January 2014 (has links) Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignment. One way to address this is to jointly sample alignments along with other parameters of interest. In order to extend the range of applicability of this approach, the first chapter of this thesis introduces a probabilistic evolutionary model for protein structures on a phylogenetic tree; since protein structures typically diverge much more slowly than sequences, this allows for more reliable detection of remote homologies, improving the accuracy of the resulting alignments and trees, and reducing sensitivity of the results to the choice of dataset. In order to carry out inference under such a model, a number of new Markov chain Monte Carlo approaches are developed, allowing for more efficient convergence and mixing on the high-dimensional parameter space. The second part of the thesis presents a directed acyclic graph (DAG)-based approach for representing a collection of sampled alignments. This DAG representation allows the initial collection of samples to be used to generate a larger set of alignments under the same approximate distribution, enabling posterior alignment probabilities to be estimated reliably from a reasonable number of samples. If desired, summary alignments can then be generated as maximum-weight paths through the DAG, under various types of loss or scoring functions. The acyclic nature of the graph also permits various other types of algorithms to be easily adapted to operate on the entire set of alignments in the DAG. In the final part of this work, methodology is introduced for alignment-DAG-based sequence annotation using hidden Markov models, and RNA secondary structure prediction using stochastic context-free grammars. Results on test datasets indicate that the additional information contained within the DAG allows for improved predictions, resulting in substantial gains over simply analysing a set of alignments one by one. 572.8

Search results