Global ETD Search

1	Computational problems in evolution : Multiple alignment, genome rearrangements, and tree reconstruction Elias, Isaac January 2006 (has links) Reconstructing the evolutionary history of a set of species is a fundamental problem in biology. This thesis concerns computational problems that arise in different settings and stages of phylogenetic tree reconstruction, but also in other contexts. The contributions include: • A new distance-based tree reconstruction method with optimal reconstruction radius and optimal runtime complexity. Included in the result is a greatly simplified proof that the NJ algorithm also has optimal reconstruction radius. (co-author Jens Lagergren) • NP-hardness results for the most common variations of Multiple Alignment. In particular, it is shown that SP-score, Star Alignment, and Tree Alignment, are NP hard for all metric symbol distances over all binary or larger alphabets. • A 1.375-approximation algorithm for Sorting By Transpositions (SBT). SBT is the problem of sorting a permutation using as few block-transpositions as possible. The complexity of this problem is still open and it was a ten-year-old open problem to improve the best known 1.5-approximation ratio. The 1.375-approximation algorithm is based on a new upper bound on the diameter of 3-permutations. Moreover, a new lower bound on the transposition diameter of the symmetric group is presented and the exact transposition diameter of simple permutations is determined. (co-author Tzvika Hartman) • Approximation, fixed-parameter tractable, and fast heuristic algorithms for two variants of the Ancestral Maximum Likelihood (AML) problem: when the phylogenetic tree is known and when it is unknown. AML is the problem of reconstructing the most likely genetic sequences of extinct ancestors along with the most likely mutation probabilities on the edges, given the phylogenetic tree and sequences at the leafs. (co-author Tamir Tuller) • An algorithm for computing the number of mutational events between aligned DNA sequences which is several hundred times faster than the famous Phylip packages. Since pairwise distance estimation is a bottleneck in distance-based phylogeny reconstruction, the new algorithm improves the overall running time of many distancebased methods by a factor of several hundred. (co-author Jens Lagergren) / QC 20110121 Distance Methods Tree reconstruction Sorting by Transpositions multiple alignment ancestral sequences Computer science Datavetenskap
2	COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences Liang, Chengzhi January 2001 (has links) Consensus pattern problem (CPP) aims at finding conserved regions, or motifs, in unaligned sequences. This problem is NP-hard under various scoring schemes. To solve this problem for protein sequences more efficiently,a new scoring scheme and a randomized algorithm based on substitution matrix are proposed here. Any practical solutions to a bioinformatics problem must observe twoprinciples: (1) the problem that it solves accurately describes the real problem; in CPP, this requires the scoring scheme be able to distinguisha real motif from background; (2) it provides an efficient algorithmto solve the mathematical problem. A key question in protein motif-finding is how to determine the motif length. One problem in EM algorithms to solve CPP is how to find good startingpoints to reach the global optimum. These two questions were both well addressed under this scoring scheme,which made the randomized algorithm both fast and accurate in practice. A software, COPIA (COnsensus Pattern Identification and Analysis),has been developed implementing this algorithm. Experiments using sequences from the von Willebrand factor (vWF)familyshowed that it worked well on finding multiple motifs and repeats. COPIA's ability to find repeats makes it also useful in illustrating the internal structures of multidomain proteins. Comparative studies using several groups of protein sequences demonstrated that COPIA performed better than the commonly used motif-finding programs. Computer Science bioinformatics software multiple alignment motif-finding consensus pattern problem
3	COPIA: A New Software for Finding Consensus Patterns in Unaligned Protein Sequences Liang, Chengzhi January 2001 (has links) Consensus pattern problem (CPP) aims at finding conserved regions, or motifs, in unaligned sequences. This problem is NP-hard under various scoring schemes. To solve this problem for protein sequences more efficiently,a new scoring scheme and a randomized algorithm based on substitution matrix are proposed here. Any practical solutions to a bioinformatics problem must observe twoprinciples: (1) the problem that it solves accurately describes the real problem; in CPP, this requires the scoring scheme be able to distinguisha real motif from background; (2) it provides an efficient algorithmto solve the mathematical problem. A key question in protein motif-finding is how to determine the motif length. One problem in EM algorithms to solve CPP is how to find good startingpoints to reach the global optimum. These two questions were both well addressed under this scoring scheme,which made the randomized algorithm both fast and accurate in practice. A software, COPIA (COnsensus Pattern Identification and Analysis),has been developed implementing this algorithm. Experiments using sequences from the von Willebrand factor (vWF)familyshowed that it worked well on finding multiple motifs and repeats. COPIA's ability to find repeats makes it also useful in illustrating the internal structures of multidomain proteins. Comparative studies using several groups of protein sequences demonstrated that COPIA performed better than the commonly used motif-finding programs. Computer Science bioinformatics software multiple alignment motif-finding consensus pattern problem
4	Computational problems in evolution : Multiple alignment, genome rearrangements, and tree reconstruction Elias, Isaac January 2006 (has links) <p>Reconstructing the evolutionary history of a set of species is a fundamental problem in biology. This thesis concerns computational problems that arise in different settings and stages of phylogenetic tree reconstruction, but also in other contexts. The contributions include:</p><p>• A new distance-based tree reconstruction method with optimal reconstruction radius and optimal runtime complexity. Included in the result is a greatly simplified proof that the NJ algorithm also has optimal reconstruction radius. (co-author Jens Lagergren)</p><p>• NP-hardness results for the most common variations of Multiple Alignment. In particular, it is shown that SP-score, Star Alignment, and Tree Alignment, are NP hard for all metric symbol distances over all binary or larger alphabets.</p><p>• A 1.375-approximation algorithm for Sorting By Transpositions (SBT). SBT is the problem of sorting a permutation using as few block-transpositions as possible. The complexity of this problem is still open and it was a ten-year-old open problem to improve the best known 1.5-approximation ratio. The 1.375-approximation algorithm is based on a new upper bound on the diameter of 3-permutations. Moreover, a new lower bound on the transposition diameter of the symmetric group is presented and the exact transposition diameter of simple permutations is determined. (co-author Tzvika Hartman)</p><p>• Approximation, fixed-parameter tractable, and fast heuristic algorithms for two variants of the Ancestral Maximum Likelihood (AML) problem: when the phylogenetic tree is known and when it is unknown. AML is the problem of reconstructing the most likely genetic sequences of extinct ancestors along with the most likely mutation probabilities on the edges, given the phylogenetic tree and sequences at the leafs. (co-author Tamir Tuller)</p><p>• An algorithm for computing the number of mutational events between aligned DNA sequences which is several hundred times faster than the famous Phylip packages. Since pairwise distance estimation is a bottleneck in distance-based phylogeny reconstruction, the new algorithm improves the overall running time of many distancebased methods by a factor of several hundred. (co-author Jens Lagergren)</p> Distance Methods Tree reconstruction Sorting by Transpositions multiple alignment ancestral sequences Computer science Datavetenskap
5	Machine Learning Approaches for Modeling and Correction of Confounding Effects in Complex Biological Data Wu, Chiung Ting 09 June 2021 (has links) With the huge volume of biological data generated by new technologies and the booming of new machine learning based analytical tools, we expect to advance life science and human health at an unprecedented pace. Unfortunately, there is a significant gap between the complex raw biological data from real life and the data required by mathematical and statistical tools. This gap is contributed by two fundamental and universal problems in biological data that are both related to confounding effects. The first is the intrinsic complexities of the data. An observed sample could be the mixture of multiple underlying sources and we may be only interested in one or part of the sources. The second type of complexities come from the acquisition process of the data. Different samples may be gathered at different time and/or from different locations. Therefore, each sample is associated with specific distortion that must be carefully addressed. These confounding effects obscure the signals of interest in the acquired data. Specifically, this dissertation will address the two major challenges in confounding effects removal: alignment and deconvolution. Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention time (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. To utilize this information, we develop an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. We applied ncGTW to two large-scale metabolomics LC-MS datasets, which identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. When the desired signal is buried in a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of individual sources, instead of mixtures. Though there are some promising supervised deconvolution methods, when there is no a priori information, unsupervised deconvolution is still needed. Among current unsupervised methods, Convex Analysis of Mixtures (CAM) is the most theoretically solid and strongest performing one. However, there are some major limitations of this method. Most importantly, the overall time complexity can be very high, especially when analyzing a large dataset or a dataset with many sources. Also, since there are some stochastic and heuristic steps, the deconvolution result is not accurate enough. To address these problems, we redesigned the modules of CAM. In the feature clustering step, we propose a clustering method, radius-fixed clustering, which could not only control the space size of the cluster, but also find out the outliers simultaneously. Therefore, the disadvantages of K-means clustering, such as instability and the need of cluster number are avoided. Moreover, when identifying the convex hull, we replace Quickhull with linear programming, which decreases the computation time significantly. To avoid the not only heuristic but also approximated step in optimal simplex identification, we propose a greedy search strategy instead. The experimental results demonstrate the vast improvement of computation time. The accuracy of the deconvolution is also shown to be higher than the original CAM. / Doctor of Philosophy / Due to the complexity of biological data, there are two major pre-processing steps: alignment and deconvolution. The alignment step corrects the time and location related data acquisition distortion by aligning the detected signals to a reference signal. Though many alignment methods are proposed for biological data, most of them fail to consider the relationships among samples carefully. This piece of structure information can help alignment when the data is noisy and/or irregular. To utilize this information, we develop a new method, Neighbor-wise Compound-specific Graphical Time Warping (ncGTW), inspired by graph theory. This new alignment method not only utilizes the structural information but also provides a reference-free solution. We show that the performance of our new method is better than other methods in both simulations and real datasets. When the signal is from a mixture, deconvolution is needed to recover the pure sources. Many biological questions can be better addressed when the data is in the form of single sources, instead of mixtures. There is a classic unsupervised deconvolution method: Convex Analysis of Mixtures (CAM). However, there are some limitations of this method. For example, the time complexity of some steps is very high. Thus, when facing a large dataset or a dataset with many sources, the computation time would be extremely long. Also, since there are some stochastic and heuristic steps, the deconvolution result may be not accurate enough. We improved CAM and the experimental results show that the speed and accuracy of the deconvolution is significantly improved. bioinformatics multiple alignment deconvolution unsupervised learning convex analysis feature selection tissue heterogeneity
6	Identification and Expression Analysis of Zebrafish Glypicans during Embryonic Development Brand, Michael, Gupta, Mansi 02 December 2015 (has links) (PDF) Heparan sulfate Proteoglycans (HSPG) are ubiquitous molecules with indispensable functions in various biological processes. Glypicans are a family of HSPG’s, characterized by a Gpi-anchor which directs them to the cell surface and/or extracellular matrix where they regulate growth factor signaling during development and disease. We report the identification and expression pattern of glypican genes from zebrafish. The zebrafish genome contains 10 glypican homologs, as opposed to six in mammals, which are highly conserved and are phylogenetically related to the mammalian genes. Some of the fish glypicans like Gpc1a, Gpc3, Gpc4, Gpc6a and Gpc6b show conserved synteny with their mammalian cognate genes. Many glypicans are expressed during the gastrulation stage, but their expression becomes more tissue specific and defined during somitogenesis stages, particularly in the developing central nervous system. Existence of multiple glypican orthologs in fish with diverse expression pattern suggests highly specialized and/or redundant function of these genes during embryonic development. Biotechnologie regenerative Therapie Zebrafish Multiple alignment calculation Sequence databases Sequence alignment Sulfates Embryos Morphogenic segmentation Somites ddc:570 rvk:WD 15000 rvk:XA 10000
7	Alinhamento múltiplo de seqüências através de técnicas de agrupamento / Multiple alignment of sequences through clustering techniques Peres, Patrícia Silva 24 February 2006 (has links) Made available in DSpace on 2015-04-11T14:02:59Z (GMT). No. of bitstreams: 1 Patricia Silva Peres.pdf: 506475 bytes, checksum: 40dfa72e28b5cca338c104148bd4ef06 (MD5) Previous issue date: 2006-02-24 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The simultaneous alignment of many DNA or protein sequences is one of the commonest tasks in computational molecular biology. Multiple alignments are important in many applications, such as, predicting the structure of new sequences, demonstrating the relationship between new sequences and existing families of sequences, inferring the evolutionary history of a family of sequences,finding the characteristic motifs (core blocks) between biological sequences, assembling fragments in DNA sequencing, and many others. Currently, the most popular strategy used for solving the multiple sequence alignment problem is the progressive alignment. Each step of this strategy might generate an error which is expected to be low for closely related sequences but increases as sequences diverge. Therefore, determining the order in which the sequences will be aligned is a key step in the progressive alignment strategy. Traditional approaches take into account, in each iteration of the progressive alignment, only the closest pair or groups of sequences to be aligned. Such strategy minimizes the error introduced in each step, but may not be the best option to minimize the final error. Based on that hypothesis, this work aims the study and the application of a global clustering technique to perform a previous analysis of all sequences in order to separate them into groups according to their similarities. These groups, then, guide the traditional progressive alignment, as an attempt to minimize the overall error introduced by the steps of the progressive alignment and improve the final result. To assess the reliability of this new strategy, three well-known methods were modified for the purpose of introducing the new sequence clustering stage. The accuracy of new versions of the methods was tested using three diferent reference collections. Besides, the modified methods were compared with their original versions. Results of the conducted experiments depict that the new versions of the methods with the global clustering stage really obtained better alignments than their original versions in the three reference collections and achieving improvement over the main methods found in literature, with an increase of only 3% on average in the running time. / O alinhamento simultâneo entre várias seqüências de DNA ou proteína é um dos principais problemas em biologia molecular computacional. Alinhamentos múltiplos são importantes em muitas aplicações, tais como, predição da estrutura de novas seqüências, demonstração do relacionamento entre novas seqüências e famílias de seqüências já existentes, inferência da história evolutiva de uma família de seqüências, descobrimento de padrões que sejam compartilhados entre seqüências, montagem de fragmentos de DNA, entre outras. Atualmente, a estratégia mais popular utilizada na resolução do problema do alinhamento múltiplo é o alinhamento progressivo. Cada etapa desta estratégia pode gerar uma taxa de erro que tenderá a ser baixa no caso de seqüências muito similares entre si, porêm tenderá a ser alta na medida em que as seqüências divergirem. Portanto, a determinação da ordem de alinhamento das seqüências constitui-se em um passo fundamental na estratégia de alinhamento progressivo. Estratégias tradicionais levam em consideração, a cada iteração do alinhamento progressivo, apenas o par ou grupo de seqüências mais próximo a ser alinhado. Tal estratégia minimiza a taxa de erro introduzida em cada etapa, porém pode não ser a melhor forma para minimizar a taxa de erro final. Baseado nesta hipótese, este trabalho tem por objetivo o estudo e aplicação de uma técnica de agrupamento global para executar uma análise prévia de todas as seqüências de forma a separálas em grupos de acordo com suas similaridades. Estes grupos, então, guiarão o alinhamento progressivo tradicional, numa tentativa de minimizar a taxa de erro global introduzida pelas etapas do alinhamento progressivo e melhorar o resultado final. Para avaliar a contabilidade desta nova estratégia, três métodos conhecidos foram modificados com o objetivo de agregar a nova etapa de agrupamento de seqüências. A acurácia das novas versões dos métodos foi testada utilizando três diferentes coleções de referências. Além disso, os métodos modificados foram comparadas com suas respectivas versões originais. Os resultados dos experimentos mostram que as novas versões dos métodos com a etapa de agrupamento global realmente obtiveram alinhamentos melhores do que suas versões originais nas três coleções de referência e alcançando melhorias sobre os principais métodos encontrados na literatura, com um aumento de apenas 3% em média no tempo de execução. Alinhamento Múltiplo de Seqüências Estratégia de Alinhamento Progressivo Técnicas de Agrupamento Multiple alignment of sequences Progressive alignment strategy Clustering techniques
8	Identification and Expression Analysis of Zebrafish Glypicans during Embryonic Development Brand, Michael, Gupta, Mansi 02 December 2015 (has links) Heparan sulfate Proteoglycans (HSPG) are ubiquitous molecules with indispensable functions in various biological processes. Glypicans are a family of HSPG’s, characterized by a Gpi-anchor which directs them to the cell surface and/or extracellular matrix where they regulate growth factor signaling during development and disease. We report the identification and expression pattern of glypican genes from zebrafish. The zebrafish genome contains 10 glypican homologs, as opposed to six in mammals, which are highly conserved and are phylogenetically related to the mammalian genes. Some of the fish glypicans like Gpc1a, Gpc3, Gpc4, Gpc6a and Gpc6b show conserved synteny with their mammalian cognate genes. Many glypicans are expressed during the gastrulation stage, but their expression becomes more tissue specific and defined during somitogenesis stages, particularly in the developing central nervous system. Existence of multiple glypican orthologs in fish with diverse expression pattern suggests highly specialized and/or redundant function of these genes during embryonic development. Biotechnologie, regenerative Therapie info:eu-repo/classification/ddc/570 ddc:570
9	Modélisation et techniques d'optimisation en bio-informatique et fouille de données / Modelling and techniques of optimization in bioinformatics and data mining Belghiti, Moulay Tayeb 01 February 2008 (has links) Cette thèse est particulièrement destinée à traiter deux types de problèmes : clustering et l'alignement multiple de séquence. Notre objectif est de résoudre de manière satisfaisante ces problèmes globaux et de tester l'approche de la Programmation DC et DCA sur des jeux de données réelles. La thèse comporte trois parties : la première partie est consacrée aux nouvelles approches de l'optimisation non convexe. Nous y présentons une étude en profondeur de l'algorithme qui est utilisé dans cette thèse, à savoir la programmation DC et l'algorithme DC (DCA). Dans la deuxième partie, nous allons modéliser le problème clustering en trois sous-problèmes non convexes. Les deux premiers sous-problèmes se distinguent par rapport au choix de la norme utilisée, (clustering via les normes 1 et 2). Le troisième sous-problème utilise la méthode du noyau, (clustering via la méthode du noyau). La troisième partie sera consacrée à la bio-informatique. On va se focaliser sur la modélisation et la résolution de deux sous-problèmes : l'alignement multiple de séquence et l'alignement de séquence d'ARN par structure. Tous les chapitres excepté le premier se terminent par des tests numériques. / This Ph.D. thesis is particularly intended to treat two types of problems : clustering and the multiple alignment of sequence. Our objective is to solve efficiently these global problems and to test DC Programming approach and DCA on real datasets. The thesis is divided into three parts : the first part is devoted to the new approaches of nonconvex optimization-global optimization. We present it a study in depth of the algorithm which is used in this thesis, namely the programming DC and the algorithm DC ( DCA). In the second part, we will model the problem clustering in three nonconvex subproblems. The first two subproblems are distinguished compared to the choice from the norm used, (clustering via norm 1 and 2). The third subproblem uses the method of the kernel, (clustering via the method of the kernel). The third part will be devoted to bioinformatics, one goes this focused on the modeling and the resolution of two subproblems : the multiple alignment of sequence and the alignment of sequence of RNA. All the chapters except the first end in numerical tests. Optimisation combinatoire Classification non supervisée Optimisation convexe et non convexe Clustering Alignement multiple de séquence Méthode des noyaux Alignement par structure Optimisation DC et algorithme DCA Clustering Multiple alignment of sequence Programming DC Algorithm DC Method of the kernel Nonconvex optimization
10	Implementace algoritmu pro hledání podobností DNA řetězců v FPGA / Approximate String Matching Algorithm Implementation in FPGA Pařenica, Martin January 2007 (has links) This paper describes sequence alignment algorithms of nucleotide sequences. There are described pairwise alignment algorithms using database search or dynamic programming. Then in the paper is description of dynamic programming for multiple sequences and algorithm that builds phylogenetic trees. At the end of the first part of the paper is the description of technology FPGA. In the second part that is more practical is described implemntation of the choosen one algorithm. This part includes also examples of some multiple alignments.

Search results