Global ETD Search

1	Alinhamento múltiplo de genomas de eucariotos com montagens altamente fragmentadas / Multiple alignment of large eukaryotic genomes with highly fragmented assemblies Epamino, George Willian Condomitti 04 August 2017 (has links) O advento do sequenciamento de nova geração (NGS - Next Generation Sequencing) nos últimos anos proporcionou um aumento expressivo no número de projetos genômicos. De maneira simplificada, as máquinas sequenciadoras geram como resultado fragmentos de DNA que são utilizados por programas montadores de genoma. Esses programas tentam juntar os fragmentos de DNA de modo a obter a representação completa da sequência genômica (por exemplo um cromossomo) da espécie sendo sequenciada. Em alguns casos o processo de montagem pode ser executado com maior facilidade para organismos com genomas de tamanhos pequenos (por exemplo bactérias com genoma em torno de 5Mpb), através de pipelines que automatizam a maior parte da tarefa. Um cenário mais complicado surge quando a espécie possui genoma com grande comprimento (acima de 1Gpb) e elementos repetidos, como no caso de alguns eucariotos. Nesses casos o resultado da montagem é geralmente composto por milhares de fragmentos (chamados de contigs), uma ordem de magnitude muito superior ao número de cromossomos estimado para um organismo (comumente da ordem de dois dígitos), dando origem a uma montagem altamente fragmentada. Uma atividade comum nesses projetos é a comparação da montagem com a de outro genoma como forma de validação e também para identificação de regiões conservadas entre os organismos. Embora o problema de alinhamento par-a-par de genomas grandes seja bem contornado por abordagens existentes, o alinhamento múltiplo (AM) de genomas grandes em estado fragmentado ainda é uma tarefa de difícil resolução, por demandar alto custo computacional e grande quantidade de tempo. Este trabalho consiste em uma metologia para fazer alinhamento múltiplo de genomas grandes de eucariotos com montagens altamente fragmentadas. Nossa implementação, baseada em alinhamento estrela, se mostrou capaz de fazer AM de grupos de montagens com diversos níveis de fragmentação. O maior deles, um conjunto de 5 genomas de répteis, levou 14 horas de processamento para fornecer um mapa de regiões conservadas entre as espécies. O algoritmo foi implementado em um software que batizamos de FROG (FRagment Overlap multiple Genome alignment), de código aberto e disponível sob licença GPLv3. / The advent of Next Generation Sequencing (NGS) in recent years has led to an expressive increase in the number of genomic projects. In a simplified way, sequencing machines generate DNA fragments that are used by genome assembler software. These programs try to merge the DNA fragments to obtain the complete representation of the genomic sequence (for example a chromosome) of the species being sequenced. In some cases the assembling process can be performed more easily for organisms with small-sized genomes (e.g. bacteria with a genome length of approximately 5Mpb) through pipelines that automate most of the task. A trickier scenario arises when the species has a very large genome (above 1Gbp) and complex elements, as in the case of some eukaryotes. In those cases the result of the assembly is usually composed of thousands of fragments (called contigs), an order of magnitude much higher than the number of chromosomes estimated for an organism (usually in the order two digits), giving rise to a highly fragmented assembly. A common activity in these projects is the comparison of the assembly with that of another genome as a form of validation and also to identify common elements between organisms. Although the problem of pairwise alignment of large genomes is well circumvented by existing approaches, multiple alignment of large genomes with highly fragmented assemblies remains a difficult task due to its time and computational requirements. This work consists of a methodology for doing multiple alignment of large eukaryotic genomes with highly fragmented assemblies, a problem that few solutions are able to cope with. Our star alignment-based implementation, was able to accomplish a MSA of groups of assemblies with different levels of fragmentation. The largest of them, a set of 5 reptilian genomes where the B. jararaca assembly (800,000 contigs, N50 of 3.1Kbp) was used as anchor, took 14 hours of execution time to provide a map of conserved regions among the participating species. The algorithm was implemented in a software named FROG (FRagment Overlap multiple Genome alignment), available under the General Public License v3 (GPLv3) terms. Alinhamento de genomas Bioinformática Bioinformatics Comparative genomics Genome alignment Genômica comparativa
2	Alinhamento múltiplo de genomas de eucariotos com montagens altamente fragmentadas / Multiple alignment of large eukaryotic genomes with highly fragmented assemblies George Willian Condomitti Epamino 04 August 2017 (has links) O advento do sequenciamento de nova geração (NGS - Next Generation Sequencing) nos últimos anos proporcionou um aumento expressivo no número de projetos genômicos. De maneira simplificada, as máquinas sequenciadoras geram como resultado fragmentos de DNA que são utilizados por programas montadores de genoma. Esses programas tentam juntar os fragmentos de DNA de modo a obter a representação completa da sequência genômica (por exemplo um cromossomo) da espécie sendo sequenciada. Em alguns casos o processo de montagem pode ser executado com maior facilidade para organismos com genomas de tamanhos pequenos (por exemplo bactérias com genoma em torno de 5Mpb), através de pipelines que automatizam a maior parte da tarefa. Um cenário mais complicado surge quando a espécie possui genoma com grande comprimento (acima de 1Gpb) e elementos repetidos, como no caso de alguns eucariotos. Nesses casos o resultado da montagem é geralmente composto por milhares de fragmentos (chamados de contigs), uma ordem de magnitude muito superior ao número de cromossomos estimado para um organismo (comumente da ordem de dois dígitos), dando origem a uma montagem altamente fragmentada. Uma atividade comum nesses projetos é a comparação da montagem com a de outro genoma como forma de validação e também para identificação de regiões conservadas entre os organismos. Embora o problema de alinhamento par-a-par de genomas grandes seja bem contornado por abordagens existentes, o alinhamento múltiplo (AM) de genomas grandes em estado fragmentado ainda é uma tarefa de difícil resolução, por demandar alto custo computacional e grande quantidade de tempo. Este trabalho consiste em uma metologia para fazer alinhamento múltiplo de genomas grandes de eucariotos com montagens altamente fragmentadas. Nossa implementação, baseada em alinhamento estrela, se mostrou capaz de fazer AM de grupos de montagens com diversos níveis de fragmentação. O maior deles, um conjunto de 5 genomas de répteis, levou 14 horas de processamento para fornecer um mapa de regiões conservadas entre as espécies. O algoritmo foi implementado em um software que batizamos de FROG (FRagment Overlap multiple Genome alignment), de código aberto e disponível sob licença GPLv3. / The advent of Next Generation Sequencing (NGS) in recent years has led to an expressive increase in the number of genomic projects. In a simplified way, sequencing machines generate DNA fragments that are used by genome assembler software. These programs try to merge the DNA fragments to obtain the complete representation of the genomic sequence (for example a chromosome) of the species being sequenced. In some cases the assembling process can be performed more easily for organisms with small-sized genomes (e.g. bacteria with a genome length of approximately 5Mpb) through pipelines that automate most of the task. A trickier scenario arises when the species has a very large genome (above 1Gbp) and complex elements, as in the case of some eukaryotes. In those cases the result of the assembly is usually composed of thousands of fragments (called contigs), an order of magnitude much higher than the number of chromosomes estimated for an organism (usually in the order two digits), giving rise to a highly fragmented assembly. A common activity in these projects is the comparison of the assembly with that of another genome as a form of validation and also to identify common elements between organisms. Although the problem of pairwise alignment of large genomes is well circumvented by existing approaches, multiple alignment of large genomes with highly fragmented assemblies remains a difficult task due to its time and computational requirements. This work consists of a methodology for doing multiple alignment of large eukaryotic genomes with highly fragmented assemblies, a problem that few solutions are able to cope with. Our star alignment-based implementation, was able to accomplish a MSA of groups of assemblies with different levels of fragmentation. The largest of them, a set of 5 reptilian genomes where the B. jararaca assembly (800,000 contigs, N50 of 3.1Kbp) was used as anchor, took 14 hours of execution time to provide a map of conserved regions among the participating species. The algorithm was implemented in a software named FROG (FRagment Overlap multiple Genome alignment), available under the General Public License v3 (GPLv3) terms. Alinhamento de genomas Bioinformática Genômica comparativa Bioinformatics Comparative genomics Genome alignment
3	Implementierung des Genom-Alignments auf modernen hochparallelen Plattformen Knodel, Oliver 28 June 2011 (has links) Durch die wachsende Bedeutung der DNS-Sequenzierung wurden die Geräte zur Sequenzierung weiterentwickelt und ihr Durchsatz so erhöht, dass sie Millionen kurzer Nukleotidsequenzen innerhalb weniger Tage liefern. Moderne Algorithmen und Programme, welche die dadurch entstehenden großen Datenmengen in akzeptabler Zeit verarbeiten können, ermitteln jedoch nur einen Bruchteil der Positionen der Sequenzen in bekannten Datenbanken. Eine derartige Suche ist eine der wichtigsten Aufgaben in der modernen Molekularbiologie. Diese Arbeit untersucht mögliche Übertragungen moderner Genom-Alignment Programme auf hochparallele Plattformen wie FPGA und GPU. Die derzeitig an das Problem angepassten Programme und Algorithmen werden untersucht und hinsichtlich ihrer Parallelisierbarkeit auf den beiden Plattformen FPGA und GPU analysiert. Nach einer Bewertung der Alternativen erfolgt die Auswahl eines Algorithmus. Anschließend wird dessen Übertragung auf die beiden Plattformen entworfen und implementiert. Dabei stehen die Geschwindigkeit der Suche, die Anzahl der ermittelten Positionen sowie die Nutzbarkeit im Vordergrund. Der auf der GPU implementierte reduzierte Smith & Waterman-Algorithmus ist effizient an die Problemstellung angepasst und erreicht für kurze Sequenzen höhere Geschwindigkeiten als bisherige Realisierungen auf Grafikkarten. Eine vergleichbare Umsetzung auf dem FPGA benötigt eine deutlich geringere Laufzeit, findet ebenfalls jede Position in der Datenbank und erreicht dabei ähnliche Geschwindigkeiten wie moderne leistungsfähige Programme, die aber heuristisch arbeiten. Die Anzahl der gefundenen Positionen ist bei FPGA und GPU damit mehr als doppelt so hoch wie bei sämtlichen vergleichbaren Programmen. / Further developments of DNA sequencing devices produce millions of short nucleotide sequences. Finding the positions of these sequences in databases of known sequences is an important problem in modern molecular biology. Current heuristic algorithms and programs only find a small fraction of these positions. In this thesis genome alignment algorithms are implemented on massively parallel platforms as FPGA and GPU. The next generation sequencing technologies that are currently in use are reviewed regarding their possible parallelization on FPGA and GPU. After evaluation one algorithm is chosen for parallelization. Its implementation on both platforms is designed and realized. Runtime, accuracy as well as usability are important features of the implementation. The reduced Smith & Waterman algorithm which is realized on the GPU outperforms similar GPU programs in speed and efficiency for short sequences. The runtime of the FPGA approach is similar to those of widely used heuristic software mappers and much lower than on the GPU. Furthermore the FPGA guarantees to find all alignment positions of a sequence in the database, which is more than twice the number that is found by comparable software algorithms. info:eu-repo/classification/ddc/004 ddc:004
4	Implementierung des Genom-Alignments auf modernen hochparallelen Plattformen / Implementing Genome Alignment Algorithms on Highly Parallel Platforms Knodel, Oliver 26 March 2014 (has links) (PDF) Durch die wachsende Bedeutung der DNS-Sequenzierung wurden die Geräte zur Sequenzierung weiterentwickelt und ihr Durchsatz so erhöht, dass sie Millionen kurzer Nukleotidsequenzen innerhalb weniger Tage liefern. Moderne Algorithmen und Programme, welche die dadurch entstehenden großen Datenmengen in akzeptabler Zeit verarbeiten können, ermitteln jedoch nur einen Bruchteil der Positionen der Sequenzen in bekannten Datenbanken. Eine derartige Suche ist eine der wichtigsten Aufgaben in der modernen Molekularbiologie. Diese Arbeit untersucht mögliche Übertragungen moderner Genom-Alignment Programme auf hochparallele Plattformen wie FPGA und GPU. Die derzeitig an das Problem angepassten Programme und Algorithmen werden untersucht und hinsichtlich ihrer Parallelisierbarkeit auf den beiden Plattformen FPGA und GPU analysiert. Nach einer Bewertung der Alternativen erfolgt die Auswahl eines Algorithmus. Anschließend wird dessen Übertragung auf die beiden Plattformen entworfen und implementiert. Dabei stehen die Geschwindigkeit der Suche, die Anzahl der ermittelten Positionen sowie die Nutzbarkeit im Vordergrund. Der auf der GPU implementierte reduzierte Smith & Waterman-Algorithmus ist effizient an die Problemstellung angepasst und erreicht für kurze Sequenzen höhere Geschwindigkeiten als bisherige Realisierungen auf Grafikkarten. Eine vergleichbare Umsetzung auf dem FPGA benötigt eine deutlich geringere Laufzeit, findet ebenfalls jede Position in der Datenbank und erreicht dabei ähnliche Geschwindigkeiten wie moderne leistungsfähige Programme, die aber heuristisch arbeiten. Die Anzahl der gefundenen Positionen ist bei FPGA und GPU damit mehr als doppelt so hoch wie bei sämtlichen vergleichbaren Programmen. / Further developments of DNA sequencing devices produce millions of short nucleotide sequences. Finding the positions of these sequences in databases of known sequences is an important problem in modern molecular biology. Current heuristic algorithms and programs only find a small fraction of these positions. In this thesis genome alignment algorithms are implemented on massively parallel platforms as FPGA and GPU. The next generation sequencing technologies that are currently in use are reviewed regarding their possible parallelization on FPGA and GPU. After evaluation one algorithm is chosen for parallelization. Its implementation on both platforms is designed and realized. Runtime, accuracy as well as usability are important features of the implementation. The reduced Smith & Waterman algorithm which is realized on the GPU outperforms similar GPU programs in speed and efficiency for short sequences. The runtime of the FPGA approach is similar to those of widely used heuristic software mappers and much lower than on the GPU. Furthermore the FPGA guarantees to find all alignment positions of a sequence in the database, which is more than twice the number that is found by comparable software algorithms. Genom-Alignment Genom Alignment FPGA HPC GPU Genome-Alignment Genome Alignment FPGA HPC GPU ddc:004 rvk:ST 640
5	Celogenomové zarovnání pomocí suffixových stromů / Whole genome alignment using suffix trees Klouba, Lukáš January 2017 (has links) The aim of this thesis is to create an algorithm that allows the alignment of the genome of two organisms by means of suffix structures and to implement it into the programming language environment R. The thesis deals with the description of the construction of the suffix structures and the methods of whole genome alignment. The result of the thesis is a functional algorithm for whole genome alignment by means of suffix structures implemented in the software environment R and its comparison with similar programs for the whole genome alignment.
6	Graph-Based Whole Genome Phylogenomics Fujimoto, Masaki Stanley 01 June 2020 (has links) Understanding others is a deeply human urge basic in our existential quest. It requires knowing where someone has come from and where they sit amongst peers. Phylogenetic analysis and genome wide association studies seek to tell us where we’ve come from and where we are relative to one another through evolutionary history and genetic makeup. Current methods do not address the computational complexity caused by new forms of genomic data, namely long-read DNA sequencing and increased abundances of assembled genomes, that are becoming evermore abundant. To address this, we explore specialized data structures for storing and comparing genomic information. This work resulted in the creation of novel data structures for storing multiple genomes that can be used for identifying structural variations and other types of polymorphisms. Using these methods we illuminate the genetic history of organisms in our efforts to understand the world around us. Genomics Next-Gen Sequencing Parallel Programming Data Structures Phylogenetics Phylogenomics de Bruijn Graph NGS Read Mapping Whole Genome Alignment Whole Genome Analysis Physical Sciences and Mathematics
7	Two Problems in Computational Genomics Belal, Nahla Ahmed 22 March 2011 (has links) This work addresses two novel problems in the field of computational genomics. The first is whole genome alignment and the second is inferring horizontal gene transfer using posets. We define these two problems and present algorithmic approaches for solving them. For the whole genome alignment, we define alignment graphs for representing different evolutionary events, and define a scoring function for those graphs. The problem defined is proven to be NP-complete. Two heuristics are presented to solve the problem, one is a dynamic programming approach that is optimal for a class of sequences that we define in this work as breakable arrangements. And, the other is a greedy approach that is not necessarily optimal, however, unlike the dynamic programming approach, it allows for reversals. For inferring horizontal gene transfer, we define partial order sets among species, with respect to different genes, and infer genes involved in horizontal gene transfer by comparing posets for different genes. The posets are used to construct a tree for each gene. Those trees are then compared and tested for contradiction, where contradictory trees correspond to genes that are candidates of horizontal gene transfer. / Ph. D. horizontal gene transfer Two Problems in Computational Genomics whole genome alignment dynamic programming Graph theory biology and genetics graph algorithms partial order sets
8	Algorithmes de comparaison de génomes appliqués aux génomes bactériens / Algorithms for the comparisons of genomic sequences applied to bacterial genomes Uricaru, Raluca 14 December 2010 (has links) Avec plus de 1000 génomes complets disponibles (la grande majorité venant de bactéries), les analyses comparatives de génomes deviennent indispensables pour leurs annotations fonctionnelles, ainsi que pour la compréhension de leur structure et leur évolution, et s'appliquent par exemple en phylogénomique ou au design des vaccins. L'une des approches de plus utilisées pour comparer des génomes est l'alignement de leurs séquences d'ADN, i.e. alignement de génomes complets, c'est à dire identifier les régions de similarité en s'affranchissant de toute annotation. Malgré des améliorations significatives durant les dernières années, des outils performants pour cette approche ainsi que des méthodes pour l'estimation de la qualité des résultats qu'elle produit, en particulier sur les génomes bactériens, restent encore à développer. Outre leurs grandes tailles qui rendent les solutions classiques basées sur la programmation dynamique inutilisables, l'alignement de génomes complets posent des difficultés supplémentaires dues à leur évolution particulière, comprenant: la divergence, qui estompe les similarités entre les séquences, le réordonnancent des portions génomiques (réarrangements), ou l'acquisition de matériel génétique extérieur, qui produit des régions non alignables entres les séquences, e.g. transfert horizontal des gènes, phages. En conséquence, les solutions pour l'alignement de génomes sont des heuristiques, dont la plus commune est appelée stratégie basée sur des ancres. Cette stratégie commence par identifier un ensemble initial de régions de similarité (phase 1). Ensuite une phase de chaînage sélectionne un sous-ensemble (non-chevauchantes et généralement colinéaires) de ces similarités de poids maximal, nommées ancres (phase 2). Les phases 1 et 2 sont appliquées de manière récursive sur les régions encore non-alignées (phase 3). La dernière phase consiste en l'application systématique des outils d'alignement classiques sur toutes les régions courtes qui n'ont pas encore été alignées. Cette thèse adresse plusieurs problèmes liés à l'alignement de génomes complets dont: l'évaluation de la qualité des résultats produits par les outils d'alignement et l'amélioration de la stratégie basée sur des ancres. Premièrement, nous avons créé un protocole pour évaluer la qualité des résultats d'alignement, contenant des mesures de calcul quantitatives et qualitatives, dont certaines basées sur des connaissances biologiques. Une analyse de la qualité des alignements produits par deux des principaux outils existants sur des paires de génomes bactériens intra-espèces révèle leurs limitations: des similarités non détectées et des portions d'alignement incorrectes. À partir de ces résultats, qui suggèrent un manque de sensibilité et spécificité, nous proposons un nouvel outil pour l'alignement deux à deux de génomes complets, YOC, qui implémente une version simplifiée de la stratégie basée sur des ancres, contenant seulement deux phases. Dans la phase 1, YOC améliore la sensibilité en utilisant comme ancres, pour la première fois dans cette stratégie, des similarités locales basées sur des graines espacées, capables de détecter des similarités plus longues dans des régions plus divergentes. Cette phase est suivie par une méthode de chainage adaptée aux similarités locales, un nouveau type de chaînage colinéaire, permettant des chevauchements proportionnels. Nous avons donné une formulation de ce nouveau problème et réalisé un premier algorithme. L'algorithme, qui adopte une approche de programmation dynamique basée sur le paradigme de la ``sweep-line'', donne une solution optimale, i.e. est exacte, et s'exécute en temps quadratique. Nous avons montré que cet algorithme, comparé au chainage colinéaire classique, améliore les résultats sur des génomes bactériens, tout en restant aussi efficace en pratique. / With more than 1000 complete genomes available (among which, the vast majority come from bacteria), comparative genomic analysis become essential for the functional annotation of genomes, the understanding of their structure and evolution and have applications in phylogenomics or vaccine design. One of the main approaches for comparing genomes is by aligning their DNA sequences, i.e. whole genome alignment (WGA), which means identifying the similarity regions without any prior annotation knowledge. Despite the significant improvements during the last years, reliable tools for WGA and methodology for estimating its quality, in particular for bacterial genomes, still need to be designed. Besides their extremely large lengths that make classical dynamic programming alignment methods unsuitable, aligning whole genomes involves several additional difficulties, due to the mechanisms through which genomes evolve: the divergence, which let sequence sim ilarity vanish over time, the reordering of genomic segments (rearrangements), or the acquisition of external genetic material generating regions that are unalignable between sequences, e.g. horizontal gene transfer, phages. Therefore, whole genome alignment tools implement heuristics, among which the most common is the anchor based strategy. It starts by detecting an initial set of similarity regions (phase 1), and, through a chaining phase (phase 2), selects a non-overlapping maximum-weighted, usually collinear, subset of those similarities, called anchors. Phases 1 and 2 are recursively applied on yet unaligned regions (phase 3). The last phase (phase 4) consists in systematically applying classical alignment tools to all short regions still left unaligned.This thesis addresses several problems related to whole genome alignment: the evaluation of the quality of results given by WGA tools and the improvement of the classical anchor based strategy. We first designed a protocol for evaluating the quality of alignment results, based on both computational and biological measures. An evaluation of the results given by two state of the art WGA tools on pairs of intra-species bacterial genomes revealed their shortcomings: the failure of detecting some of the similarities between sequences and the misalignment of some regions. Based on these results, which imply a lack in both sensitivity and specificity, we propose a novel, pairwise whole genome alignment tool, YOC, implementing a simplified two-phase version of the anchor strategy. In phase 1, YOC improves sensitivity by using as anchors, for the first time, local similarities based on spaced seeds that are capable of detecting larger similarity regions in divergent sequences. This ph ase is followed by a chaining method adapted to local similarities, a novel type of collinear chaining, allowing for proportional overlaps. We give a formulation for this novel problem and provide the first algorithm for it. The algorithm, implementing a dynamic programming approach based on the sweep-line paradigm, is exact and runs in quadratic time. We show that, compared to classical collinear chaining, chaining with overlaps improves on real bacterial data, while remaining almost as efficient in practice. Our novel tool, YOC, is evaluated together with other four WGA tools on a dataset composed of 694 pairs of intra-species bacterial genomes. The results show that YOC improves on divergent cases by detecting more distant similarities and by avoiding misaligned regions. In conclusion, YOC should be easier to apply automatically and systematically to incoming genomes, for it does not require a post-filtering step to detect misalignment and is less complex to calibrate. Genomique comparative Alignement de génomes complets Stratégie basée sur des ancres Graines espacées Chainage des fragments Graphe trapézoïdal Comparative genomics Whole genome alignment Anchor based strategy Spaced seeds Fragment chaining Trapezoid graphs

Search results