Global ETD Search

11	Modeling Evolutionary Constraints and Improving Multiple Sequence Alignments using Residue Couplings Hossain, K.S.M. Tozammel 16 November 2016 (has links) Residue coupling in protein families has received much attention as an important indicator toward predicting protein structures and revealing functional insight into proteins. Existing coupling methods identify largely pairwise couplings and express couplings over amino acid combinations, which do not yield a mechanistic explanation. Most of these methods primarily use a multiple protein sequence alignment---most likely a resultant alignment---which better exposes couplings and is obtained through manual tweaking of an alignment constructed by a classical alignment algorithm. Classical alignment algorithms primarily focus on capturing conservations and may not fully unveil couplings in the alignment. In this dissertation, we propose methods for capturing both pairwise and higher-order couplings in protein families. Our methods provide mechanistic explanations for couplings using physicochemical properties of amino acids and discernibility between orders. We also investigate a method for mining frequent episodes---called coupled patterns---in an alignment produced by a classical algorithm for proteins and for exploiting the coupled patterns for improving the alignment quality in terms of exposition of couplings. We demonstrate the effectiveness of our proposed methods on a large collection of sequence datasets for protein families. / Ph. D. / Proteins are biomolecules that comprise amino acid compounds. A chain of amino acid (a.k.a. protein sequence) forms the primary structure of a protein, and the shaping of this chain into various folds gives rise to a more complex 3D structure, a natural state of proteins. It is through structures protein performs various activities. To preserve these activities in proteins, evolution allows only those changes in protein sequences that do not disrupt the overall structures and functions of proteins. Coupling is a evolutionary phenomenon that helps proteins preserve their structures and functions. Two or more amino acid positions are coupled if changes of amino acids at a position is compensated by changes in the other position(s). In this thesis, we propose a set of probabilistic methods for modeling such couplings between two or more positions. Our methods identify the most probable couplings in a set of protein sequences and express them with probabilistic graphical models (a powerful and interpretable framework), which can be used for answering questions related to protein structures, functions, and protein synthesis. Using this notion of coupling, we also develop a method for improving the quality of multiple protein sequence alignment, a widely used tool for protein sequence analyses. We evaluate our methods with a large collection of sequence datasets for protein families, and the results substantiate the efficacy of our methods. Read more residue coupling multiple sequence alignment graphical models pattern set mining
12	"Multiple Sequence Alignment Using External Sources Of Information" Yasin, Layal 28 January 2016 (has links) No description available. 510 PFAM Alignment PROSITE DIALIGN Domains Patterns Profiles Multiple sequence alignment Informatik (PPN619939052)
13	Lagrangian Relaxation - Solving NP-hard Problems in Computational Biology via Combinatorial Optimization Canzar, Stefan 14 November 2008 (has links) (PDF) This thesis is devoted to two $\mathcal{NP}$-complete combinatorial optimization problems arising in computational biology, the well-studied \emph{multiple sequence alignment} problem and the new formulated \emph{interval constraint coloring} problem. It shows that advanced mathematical programming techniques are capable of solving large scale real-world instances from biology to optimality. Furthermore, it reveals alternative methods that provide approximate solutions. In the first part of the thesis, we present a \emph{Lagrangian relaxation} approach for the multiple sequence alignment (MSA) problem. The multiple alignment is one common mathematical abstraction of the comparison of multiple biological sequences, like DNA, RNA, or protein sequences. If the weight of a multiple alignment is measured by the sum of the projected pairwise weights of all pairs of sequences in the alignment, then finding a multiple alignment of maximum weight is $\mathcal{NP}$-complete if the number of sequences is not fixed. The majority of the available tools for aligning multiple sequences implement heuristic algorithms; no current exact method is able to solve moderately large instances or instances involving sequences exhibiting a lower degree of similarity. We present a branch-and-bound (B\&B) algorithm for the MSA problem.\ignore{the multiple sequence alignment problem.} We approximate the optimal integer solution in the nodes of the B\&B tree by a Lagrangian relaxation of an ILP formulation for MSA relative to an exponential large class of inequalities, that ensure that all pairwise alignments can be incorporated to a multiple alignment. By lifting these constraints prior to dualization the Lagrangian subproblem becomes an \emph{extended pairwise alignment} (EPA) problem: Compute the longest path in an acyclic graph, that is penalized a charge for entering ``obstacles''. We describe an efficient algorithm that solves the EPA problem repetitively to determine near-optimal \emph{Lagrangian multipliers} via subgradient optimization. The reformulation of the dualized constraints with respect to additionally introduced variables improves the convergence rate dramatically. We account for the exponential number of dualized constraints by starting with an empty \emph{constraint pool} in the first iteration to which we add cuts in each iteration, that are most violated by the convex combination of a small number of preceding Lagrangian solutions (including the current solution). In this \emph{relax-and-cut} scheme, only inequalities from the constraint pool are dualized. The interval constraint coloring problem appears in the interpretation of experimental data in biochemistry. Monitoring hydrogen-deuterium exchange rates via mass spectroscopy is a method used to obtain information about protein tertiary structure. The output of these experiments provides aggregate data about the exchange rate of residues in overlapping fragments of the protein backbone. These fragments must be re-assembled in order to obtain a global picture of the protein structure. The interval constraint coloring problem is the mathematical abstraction of this re-assembly process. The objective of the interval constraint coloring problem is to assign a color (exchange rate) to a set of integers (protein residues) such that a set of constraints is satisfied. Each constraint is made up of a closed interval (protein fragment) and requirements on the number of elements in the interval that belong to each color class (exchange rates observed in the experiments). We introduce a polyhedral description of the interval constraint coloring problem, which serves as a basis to attack the problem by integer linear programming (ILP) methods and tools, which perform well in practice. Since the goal is to provide biochemists with all possible candidate solutions, we combine related solutions to equivalence classes in an improved ILP formulation in order to reduce the running time of our enumeration algorithm. Moreover, we establish the polynomial-time solvability of the two-color case by the integrality of the linear programming relaxation polytope $\mathcal{P}$, and also present a combinatorial polynomial-time algorithm for this case. We apply this algorithm as a subroutine to approximate solutions to instances with arbitrary but fixed number of colors and achieve an order of magnitude improvement in running time over the (exact) ILP approach. We show that the problem is $\mathcal{NP}$-complete for arbitrary number of colors, and we provide algorithms that, given an instance with $\mathcal{P}\neq\emptyset$, find a coloring that satisfies all the coloring requirements within $\pm 1$ of the prescribed value. In light of our $\mathcal{NP}$-completeness result, this is essentially the best one can hope for. Our approach is based on polyhedral theory and randomized rounding techniques. In practice, data emanating from the experiments are noisy, which normally causes the instance to be infeasible, and, in some cases, even forces $\mathcal{P}$ to be empty. To deal with this problem, the objective of the ILP is to minimize the total sum of absolute deviations from the coloring requirements over all intervals. The combinatorial approach for the two-color case optimizes the same objective function. Furthermore, we use this combinatorial method to compute, in a Lagrangian way, a bound on the minimum total error, which is exploited in a branch-and-bound manner to determine all optimal colorings. Alternatively, we study a variant of the problem in which we want to maximize the number of requirements that are satisfied. We prove that this variant is $\mathcal{APX}$-hard even in the two-color case and thus does not admit a polynomial time approximation scheme (PTAS) unless $\mathcal{P}=\mathcal{NP}$. Therefore, we slightly (by a factor of $(1+\epsilon)$) relax the condition on when a requirement is satisfied and propose a \emph{quasi-polynomial time approximation scheme} (QPTAS) which finds a coloring that ``satisfies'' the requirements of as many intervals as possible. Read more [INFO:INFO_OH] Computer Science/Other Lagrangian Relaxation Computational Biology Multiple Sequence Alignment Interval Constrained Coloring
14	Core column prediction for protein multiple sequence alignments DeBlasio, Dan, Kececioglu, John 19 April 2017 (has links) Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy. Read more Multiple sequence alignment Core blocks Alignment accuracy Accuracy estimation Parameter advising Machine learning Regression
15	Aplicação de estratégias híbridas em algoritmos de alinhamento múltiplo de sequências para ambientes de computação paralela e distribuída. / Application of hybrid strategies in multiple sequence alignments for parallel and distributed computing environments. Zafalon, Geraldo Francisco Donegá 11 November 2014 (has links) A Bioinformática tem se desenvolvido de forma intensa nos últimos anos. A necessidade de se processar os grandes conjuntos de sequências, sejam de nucleotídeos ou de aminoácidos, tem estimulado o desenvolvimento de diversas técnicas algorítmicas, de modo a tratar este problema de maneira factível. Os algoritmos de alinhamento de alinhamento múltiplo de sequências assumiram um papel primordial, tornando a execução de alinhamentos de conjuntos com mais de duas sequencias uma tarefa viável computacionalmente. No entanto, com o aumento vertiginoso tanto da quantidade de sequencias em um determinado conjunto, quanto do comprimento dessas sequencias, a utilização desses algoritmos de alinhamento múltiplo, sem o acoplamento de novas estratégias, tornou-se algo impraticável. Consequentemente, a computação de alto desempenho despontou como um dos recursos a serem utilizados, através da paralelização de diversas estratégias para sua execução em grandes sistemas computacionais. Além disso, com a contínua expansão dos conjuntos de sequências, outras estratégias de otimização passaram a ser agregadas aos algoritmos de alinhamento múltiplo paralelos. Com isso, o desenvolvimento de ferramentas para alinhamento múltiplo de sequencias baseadas em abordagens híbridas destaca-se, atualmente, como a solução com melhor aceitação. Assim, no presente trabalho, pode-se verificar o desenvolvimento de uma estratégia híbrida para os algoritmos de alinhamento múltiplo progressivos, cuja utilização e amplamente difundida, em Bioinformática. Nesta abordagem, conjugou-se a paralelização e o particionamento dos conjuntos de sequências, na fase de construção da matriz de pontuação, e a otimização das fases de construção da árvore filogenética e de alinhamento múltiplo, através dos algoritmos de colônia de formigas e simulated annealling paralelo, respectivamente. / Bioinformatics has been developed in a fast way in the last years. The need for processing large sequences sets, either nucleotides or aminoacids, has stimulated the development of many algorithmic techniques, to solve this problem in a feasible way. Multiple sequence alignment algorithms have played an important role, because with the reduced computational complexity provided by them, it is possible to perform alignments with more than two sequences. However, with the fast growing of the amount and length of sequences in a set, the use of multiple alignment algorithms without new optimization strategies became almost impossible. Therefore, high performance computing has emerged as one of the features being used, through the parallelization of many strategies for execution in large computational systems. Moreover, with the continued expansion of sequences sets, other optimization strategies have been coupled with parallel multiple sequence alignments. Thus, the development of multiple sequences alignment tools based on hybrid strategies has been considered the solution with the best results. In this work, we present the development of a hybrid strategy to progressive multiple sequence alignment, where its using is widespread in Bioinformatics. In this approach, we have aggregated the parallelization and the partitioning of sequences sets in the score matrix calculation stage, and the optimization of the stages of the phylogenetic tree reconstruction and multiple alignment through ant colony and parallel simulated annealing algorithms, respectively. Read more Algoritmos de otimização Alinhamento múltiplo de sequências Bioinformática Bioinformatics Multiple sequence alignment Optimization algorithms Parallel processing Processamento paralelo
16	Aplicação de estratégias híbridas em algoritmos de alinhamento múltiplo de sequências para ambientes de computação paralela e distribuída. / Application of hybrid strategies in multiple sequence alignments for parallel and distributed computing environments. Geraldo Francisco Donegá Zafalon 11 November 2014 (has links) A Bioinformática tem se desenvolvido de forma intensa nos últimos anos. A necessidade de se processar os grandes conjuntos de sequências, sejam de nucleotídeos ou de aminoácidos, tem estimulado o desenvolvimento de diversas técnicas algorítmicas, de modo a tratar este problema de maneira factível. Os algoritmos de alinhamento de alinhamento múltiplo de sequências assumiram um papel primordial, tornando a execução de alinhamentos de conjuntos com mais de duas sequencias uma tarefa viável computacionalmente. No entanto, com o aumento vertiginoso tanto da quantidade de sequencias em um determinado conjunto, quanto do comprimento dessas sequencias, a utilização desses algoritmos de alinhamento múltiplo, sem o acoplamento de novas estratégias, tornou-se algo impraticável. Consequentemente, a computação de alto desempenho despontou como um dos recursos a serem utilizados, através da paralelização de diversas estratégias para sua execução em grandes sistemas computacionais. Além disso, com a contínua expansão dos conjuntos de sequências, outras estratégias de otimização passaram a ser agregadas aos algoritmos de alinhamento múltiplo paralelos. Com isso, o desenvolvimento de ferramentas para alinhamento múltiplo de sequencias baseadas em abordagens híbridas destaca-se, atualmente, como a solução com melhor aceitação. Assim, no presente trabalho, pode-se verificar o desenvolvimento de uma estratégia híbrida para os algoritmos de alinhamento múltiplo progressivos, cuja utilização e amplamente difundida, em Bioinformática. Nesta abordagem, conjugou-se a paralelização e o particionamento dos conjuntos de sequências, na fase de construção da matriz de pontuação, e a otimização das fases de construção da árvore filogenética e de alinhamento múltiplo, através dos algoritmos de colônia de formigas e simulated annealling paralelo, respectivamente. / Bioinformatics has been developed in a fast way in the last years. The need for processing large sequences sets, either nucleotides or aminoacids, has stimulated the development of many algorithmic techniques, to solve this problem in a feasible way. Multiple sequence alignment algorithms have played an important role, because with the reduced computational complexity provided by them, it is possible to perform alignments with more than two sequences. However, with the fast growing of the amount and length of sequences in a set, the use of multiple alignment algorithms without new optimization strategies became almost impossible. Therefore, high performance computing has emerged as one of the features being used, through the parallelization of many strategies for execution in large computational systems. Moreover, with the continued expansion of sequences sets, other optimization strategies have been coupled with parallel multiple sequence alignments. Thus, the development of multiple sequences alignment tools based on hybrid strategies has been considered the solution with the best results. In this work, we present the development of a hybrid strategy to progressive multiple sequence alignment, where its using is widespread in Bioinformatics. In this approach, we have aggregated the parallelization and the partitioning of sequences sets in the score matrix calculation stage, and the optimization of the stages of the phylogenetic tree reconstruction and multiple alignment through ant colony and parallel simulated annealing algorithms, respectively. Read more Algoritmos de otimização Alinhamento múltiplo de sequências Bioinformática Processamento paralelo Bioinformatics Multiple sequence alignment Optimization algorithms Parallel processing
17	Parallel Algorithm for Memory Efficient Pairwise and Multiple Genome Alignment in Distributed Environment Ahmed, Nova 20 December 2004 (has links) The genome sequence alignment problems are very important ones from the computational biology perspective. These problems deal with large amount of data which is memory intensive as well as computation intensive. In the literature, two separate algorithms have been studied and improved – one is a Pairwise sequence alignment algorithm which aligns pairs of genome sequences with memory reduction and parallelism for the computation and the other one is the multiple sequence alignment algorithm that aligns multiple genome sequences and this algorithm is also parallelized efficiently so that the workload of the alignment program is well distributed. The parallel applications can be launched on different environments where shared memory is very well suited for these kinds of applications. But shared memory environment has the limitation of memory usage as well as scalability also these machines are very costly. A better approach is to use the cluster of computers and the cluster environment can be further enhanced to a grid environment so that the scalability can be improved introducing multiple clusters. Here the grid environment is studied as well as the shared memory and cluster environment for the two applications. It can be stated that for carefully designed algorithms the grid environment is comparable for its performance to other distributed environments and it sometimes outperforms the others in terms of the limitations of resources the other distributed environments have. Read more pairwise sequence alignment multiple sequence alignment parallel algorithm memory efficient pairwise alignment Computer Sciences
18	Efficient Characterization of Short Anelloviruses Fragments Found in Metagenomic Samples Al-Absi, Thabit January 2012 (has links) Some viral metagenomic serum samples contain a huge amount of Anellovirus, which is a genetically diverse family with a few conserved regions making it hard to efficiently characterize. Multiple sequence alignment of the Anelloviruses found in the sample must be constructed to get a clear picture of Anellovirus diversity and to identify stable regions. Using available multiple sequence alignment software directly on these fragments results in an MSA of a very poor quality due to their diversity, misaligned regions and low-quality regions present in the sequence. An efficient MSA must be constructed in order to characterize these Anellovirus present in the samples. Pairwise alignment is used to align one fragment to the database sequences at a time. The fragments are then aligned to the database sequences using the start and end position from the pairwise alignment results. The algorithm will also exclude non-aligned portions of the fragments, as these are very hard to handle properly and are often products of misassembly or chimeric sequenced fragments. Other tools to aid further analysis were developed, such as finding a non-overlapping window that contains the most fragments, find consensus of the alignment and extract any regions from the MSA for further analysis. An MSA was constructed with a high percent of correctly aligned bases compared to an MSA constructed using MSA softwares. The minimal number of genomes found in the sampled sequence was found as well as a distribution of the fragments along the database sequence. Moreover, highly conserved region and the window containing most fragments were extracted from the MSA and phylogenetic trees were constructed for these regions. Read more Anellovirus TTV torque teno virus MSA construction multiple sequence alignment construction anellovirus characterisation
19	Speeding Up the Convergence of Online Heuristic Search and Scaling Up Offline Heuristic Search Furcy, David Andre 25 November 2004 (has links) The most popular methods for solving the shortest-path problem in Artificial Intelligence are heuristic search algorithms. The main contributions of this research are new heuristic search algorithms that are either faster or scale up to larger problems than existing algorithms. Our contributions apply to both online and offline tasks. For online tasks, existing real-time heuristic search algorithms learn better informed heuristic values and in some cases eventually converge to a shortest path by repeatedly executing the action leading to a successor state with a minimum cost-to-goal estimate. In contrast, we claim that real-time heuristic search converges faster to a shortest path when it always selects an action leading to a state with a minimum f-value, where the f-value of a state is an estimate of the cost of a shortest path from start to goal via the state, just like in the offline A* search algorithm. We support this claim by implementing this new non-trivial action-selection rule in FALCONS and by showing empirically that FALCONS significantly reduces the number of actions to convergence of a state-of-the-art real-time search algorithm. For offline tasks, we improve on two existing ways of scaling up best-first search to larger problems. First, it is known that the WA* algorithm (a greedy variant of A) solves larger problems when it is either diversified (i.e., when it performs expansions in parallel) or committed (i.e., when it chooses the state to expand next among a fixed-size subset of the set of generated but unexpanded states). We claim that WA solves even larger problems when it is enhanced with both diversity and commitment. We support this claim with our MSC-KWA* algorithm. Second, it is known that breadth-first search solves larger problems when it prunes unpromising states, resulting in the beam search algorithm. We claim that beam search quickly solves even larger problems when it is enhanced with backtracking based on limited discrepancy search. We support this claim with our BULB algorithm. We show that both MSC-KWA* and BULB scale up to larger problems than several state-of-the-art offline search algorithms in three standard benchmark domains. Finally, we present an anytime variant of BULB and apply it to the multiple sequence alignment problem in biology. Read more Anytime search algorithms Beam search Real-time search Heuristic search Multiple sequence alignment
20	Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction Mir arabbaygi, Siavash 18 September 2015 (has links) The amount of biological sequence data is increasing rapidly, a promising development that would transform biology if we can develop methods that can analyze large-scale data efficiently and accurately. A fundamental question in evolutionary biology is building the tree of life: a reconstruction of relationships between organisms in evolutionary time. Reconstructing phylogenetic trees from molecular data is an optimization problem that involves many steps. In this dissertation, we argue that to answer long-standing phylogenetic questions with large-scale data, several challenges need to be addressed in various steps of the pipeline. One challenges is aligning large number of sequences so that evolutionarily related positions in all sequences are put in the same column. Constructing alignments is necessary for phylogenetic reconstruction, but also for many other types of evolutionary analyses. In response to this challenge, we introduce PASTA, a scalable and accurate algorithm that can align datasets with up to a million sequences. A second challenge is related to the interesting fact that various parts of the genome can have different evolutionary histories. Reconstructing a species tree from genome-scale data needs to account for these differences. A main approach for species tree reconstruction is to first reconstruct a set of ``gene trees'' from different parts of the genome, and to then summarize these gene trees into a single species tree. We argue that this approach can suffer from two challenges: reconstruction of individual gene trees from limited data can be plagued by estimation error, which translates to errors in the species tree, and also, methods that summarize gene trees are not scalable or accurate enough under some conditions. To address the first challenge, we introduce statistical binning, a method that re-estimates gene trees by grouping them into bins. We show that binning improves gene tree accuracy, and consequently the species tree accuracy. To address the second challenge, we introduce ASTRAL, a new summary method that can run on a thousand genes and a thousand species in a day and has outstanding accuracy. We show that the development of these methods has enabled biological analyses that were otherwise not possible. Read more Phylogenetics Multi-species coalescent model Multiple sequence alignment Statistical binning Species tree estimation

Search results