Global ETD Search

1	Comparison of DNA sequence assembly algorithms using mixed data sources Bamidele-Abegunde, Tejumoluwa 15 April 2010 DNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the correct formation of a genome sequence from its DNA fragments ("reads") by aligning and merging the fragments. There are different sequencing technologies -- some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data.<p> This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their performance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers.<p> Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For example, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies.<p> In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project. Sanger sequencing Next generation sequencing technoloiges DNA sequence assembly
2	Comparison of DNA sequence assembly algorithms using mixed data sources Bamidele-Abegunde, Tejumoluwa 15 April 2010 (has links) DNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the correct formation of a genome sequence from its DNA fragments ("reads") by aligning and merging the fragments. There are different sequencing technologies -- some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data.<p> This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their performance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers.<p> Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For example, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies.<p> In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project. Sanger sequencing Next generation sequencing technoloiges DNA sequence assembly
3	Ant Colony Optimization Algorithms for Sequence Assembly with Haplotyping Wei, Liang-Tai 24 August 2005 (has links) The Human Genome Project completed in 2003 and the draft of human genome sequences were also yielded. It has been known that any two human gnomes are almost identical, and only very little difference makes human diversities. Single nucleotide polymorphism (SNP) means that a single-base nucleotide changes in DNA. A SNP sequence from one of a pair of chromosomes is called a haplotype. In this thesis, we study how to reconstruct a pair of chromosomes from a given set of fragments obtained by DNA sequencing in an individual. We define a new problem, the chromosome pair assembly problem, for the chromosome reconstruction. The goal of the problem is to find a pair of sequences such that the pair of output sequences have the minimum mismatch with the input fragments and their lengths are minimum. We first transform the problem instance into a directed multigraph. And then we propose an efficient algorithm to solve the problem. We apply the ACO algorithm to optimize the ordering of input fragments and use dynamic programming to determine SNP sites. After the chromosome pair is reconstructed, the two haplotypes can also be determined. We perform our algorithm on some artificial test data. The experiments show that our results are near the optimal solutions of the test data. Haplotype Dynamic Programming Sequence Assembly Ant Colony Optimization Algorithms
4	PARSES: A Pipeline for Analysis of RNA-Sequencing Exogenous Sequences Coco, Joseph 20 May 2011 (has links) RNA-Sequencing (RNA-Seq) has become one of the most widely used techniques to interrogate the transcriptome of an organism since the advent of next generation sequencing technologies [1]. A plethora of tools have been developed to analyze and visualize the transcriptome data from RNA-Seq experiments, solving the problem of mapping reads back to the host organism's genome [2] [3]. This allows for analysis of most reads produced by the experiments, but these tools typically discard reads that do not match well with the reference genome. This additional information could reveal important insight into the experiment and possible contributing factors to the condition under consideration. We introduce PARSES, a pipeline constructed from existing sequence analysis tools, which allows the user to interrogate RNA-Sequencing experiments for possible biological contamination or the presence of exogenous sequences that may shed light on other factors influencing an organism's condition. exogenous agents RNA-Seq contamination sequence alignment cancer etiology sequence assembly taxonomical classification cancer treatment
5	Algorithms and analysis for next generation biosensing and sequencing systems Shamaiah, Manohar 19 November 2012 (has links) Recent advancements in massively parallel biosensing and sequencing technologies have revolutionized the field of molecular biology and paved the way to novel and exciting innovations in medicine, biology, and environmental monitoring. Among them, biosensor arrays (e.g., DNA and protein microarrays) have gained a lot of attention. DNA microarrays are parallel affinity biosensors that can detect the presence and quantify the amounts of nucleic acid molecules of interest. They rely on chemical attraction between target nucleic acid sequences and their Watson-Crick complements that serve as probes and capture the targets. The molecular binding between the probes and targets is a stochastic process and hence the number of captured targets at any time is a random variable. Detection in conventional DNA microarrays is based on a single measurement taken in the steady state of the binding process. Recently developed real-time DNA microarrays, on the other hand, acquire multiple temporal measurements which allow more precise characterization of the reaction and enable faster detection based on the early dynamics of the binding process. In this thesis, I study target estimation and limits of performance of real time affinity biosensors. Target estimation is mapped to the problem of estimating parameters of discretely observed nonlinear diffusion processes. Performance of the estimators is characterized analytically via Cramer-Rao lower bound on the mean-square error. The proposed algorithms are verified on both simulated and experimental data, demonstrating significant gains over state-of-the-art techniques. In addition to biosensor arrays, in this thesis I present studies of the signal processing aspects of next-generation sequencing systems. Novel sequencing technologies will provide significant improvements in many aspects of human condition, ultimately leading towards the understanding, diagnosis, treatment and prevention of diseases. Reliable decision-making in such downstream applications is predicated upon accurate base-calling, i.e., identification of the order of nucleotides from noisy sequencing data. Base-calling error rates are nonuniform and typically deteriorate with the length of the reads. I have studied performance limits of base-calling, characterizing it by means of an upper bound on the error rates. Moreover, in the context of shotgun sequencing, I analyzed how accuracy of an assembled sequence depends on coverage, i.e., on the average number of times each base in a target sequence is represented in different reads. These analytical results are verified using experimental data. Among many downstream applications of high-throughput biosensing and sequencing technologies, reconstruction of gene regulatory networks is of particular importance. In this thesis, I consider the gene network inference problem and propose a probabilistic graphical approach for solving it. Specifically, I develop graphical models and design message passing algorithms which are then verified using experimental data provided by the Dialogue for Reverse Engineering Assessment and Methods (DREAM) initiative. / text Affinity biosensors DNA sequencing Stochastic differential equations Sequence assembly Viterbi algorithm
6	MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data 2014 November 1900 (has links) The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs. Bioinformatics Sequence Alignment Smith-Waterman Algorithm GPU Computing CUDA Sequence Assembly Metagenomics Next-Generation-Sequencing
7	Identification et évolution des séquences orthologues par séquençage massif chez les polyploïdes / Identification and evolution of orthologous sequences in polyploid species by next-gen sequencing Boutte, Julien 03 December 2015 (has links) Les nouvelles technologies de séquençage (NTS) offrent de nouvelles opportunités d'explorer les génomes et transcriptomes d'espèces polyploïdes. L'assemblage de transcriptomes et l'identification des copies de gènes dupliqués par allopolyploïdisation (homéologues) constituent cependant un véritable défi C ‘est plus particulièrement le cas dans un contexte de superposition de plusieurs évènements de polyploïdie et en l'absence de génome de référence diploïde. Les Spartines (Poaceae, Chloridoideae) représentent un excellent système pour étudier les conséquences à court terme des évènements d'hybridation et de polyploïdisation. En effet, S. maritima (hexaploïde) s'est hybridée à deux reprises avec S. alterniflora (hexaploïde) suite à son introduction récente en Europe, formant deux hybrides homoploïdes (S. x townsendii et S. x neyrautii). La duplication du génome de S. x townsendii a formé une nouvelle espèce allododécaploïde S. anglica (à la fin du XIXème siècle) qui a depuis envahi les marais salés de plusieurs continents. L'identification des gènes dupliqués au sein de S. anglica et de ses parents est importante pour la compréhension de son succès évolutif. Cependant, leurs niveaux de ploïdie, et l'absence d'espèce diploïde de référence chez les spartines nécessitent le développement d'outils adaptés. Dans ce contexte, nous avons développé et validé différents outils bioinformatiques permettant de détecter des polymorphismes afin d'identifier les différents haplotypes au sein de jeux de données NTS. Ces approches nous ont permis d'étudier l'hétérogénéité des domaines de l'ADN ribosomique 45S de S. maritima. Nous avons mis en évidence la perte de copies homéologues en conséquence de la diploïdisation en cours. Afin de développer les ressources transcriptomiques de ces espèces, cinq nouveaux transcriptomes de référence (110 423 contigs annotés pour les 5 espèces dont 37 867 contigs non-redondants) ont été assemblés et annotés. Les co-alignements des haplotypes parentaux et hybrides/allopolyploïdes nous ont permis d'identifier les homéo-SNPs discriminant les séquences homéologues. De plus, nous avons évalué la divergence entre les copies de gènes, identifié et confirmé les évènements de duplications récents au sein des Spartines. Au cours de cette thèse, nous avons également initié des approches de phylogénomique des spartines, qui permettront de préciser l'origine évolutive des copies dupliquées. / Next generation sequencing (NGS) technologies offer new opportunities to explore polyploid genomes and their corresponding transcriptomes. However, transcriptome assemblies and identification of homoeologous gene copies (duplicated by polyploidy) remain challenging, particularly in the context of recurrent polyploidy and the absence of diploid reference parents. Spartina species (Poaceae, Chloridoideae) represent an excellent system to study the short term consequences of hybridization and polyploidization in natural populations. The European S. maritima (hexaploid) hybridized twice with the American S. alterniflora (hexaploid) following its recent introduction to Europe, which resulted in the formation of two homoploid hybrids (S. x townsendii and S. x neyrautii). Whole genome duplication of S. x townsendii resulted in the fertile new allododecaploid S. anglica species (during the 19th century) that has now invaded saltmarshes on several continents. Identification of duplicated genes in S. anglica and its parental species is critical to understand its evolutionary success but their high ploidy levels require the development of adapted tools. In this context, we developed and validated different bioinformatics tools to detect polymorphisms and identify the different haplotypes from NGS datasets. These approaches enabled the study of the heterogeneity of the highly repeated 45S rDNA in S. maritima. In order to develop transcriptomic resources for these species, 5 new reference transcriptomes (110 423 annotated contigs for the 5 species with 37 867 non-redundant contigs) were assembled and annotated. Co-alignments of parental and hybrid/allopolyploid haplotypes allowed the identification of homoeoSNPs discriminating homoelogs. The divergence between duplicated genes was used to identify and confirm the recent duplication events in Spartina. Phylogenomic approaches on Spartina were also initiated in this thesis in the perspective of exploring the evolutionary history of the duplicated copies. Homéologie SNPs Assemblage de novo de séquences Ngs Polyploïdie Spartina Homoeology SNPs De novo sequence assembly Ngs Polyploidy Spartina
8	Assembly, Annotation and Optical Mapping of the A Subgenome of Avena Lee, Rebekah Ann 01 December 2017 (has links) Common oat (Avena) has held a significant place within the global crop community for centuries; although its cultivation has decreased over the past century, its nutritional benefits have recently garnered increased interest for human consumption. No published reference sequences are available for any of the three oat subgenomes. Here we report a quality sequence assembly, annotation and hybrid optical map of the A-genome diploid Avena atlantica Baum and Fedak. The assembly is composed of a total of 3,417 contigs with an N50 of 11.86 Mb and an estimated completeness of 97.6%. This genome sequence will be a valuable research tool within the oat community. Avena oats annotation genome sequence assembly BioNano physical mapping Life Sciences Plant Sciences
9	Assembly, Annotation and Optical Mapping of the A Subgenome of Avena Lee, Rebekah Ann 01 December 2017 (has links) Common oat (Avena) has held a significant place within the global crop community for centuries; although its cultivation has decreased over the past century, its nutritional benefits have recently garnered increased interest for human consumption. No published reference sequences are available for any of the three oat subgenomes. Here we report a quality sequence assembly, annotation and hybrid optical map of the A-genome diploid Avena atlantica Baum and Fedak. The assembly is composed of a total of 3,417 contigs with an N50 of 11.86 Mb and an estimated completeness of 97.6%. This genome sequence will be a valuable research tool within the oat community. Avena oats annotation genome sequence assembly BioNano physical mapping Life Sciences Plant Sciences
10	The Genome Sequence of Gossypium herbaceum (A1), a Domesticated Diploid Cotton Freeman, Alex J 01 April 2018 (has links) Gossypium herbaceum is a species of cotton native to Africa and Asia. As part of a larger effort to investigate structural variation in assorted diploid and polyploid cotton genomes we have sequenced and assembled the genome of G. herbaceum. Cultivated G. herbaceum is an A1-genome diploid from the Old World (Africa) with a genome size of approximately 1.7 Gb. Long range information is essential in constructing a high-quality assembly, especially when the genome is expected to be highly repetitive. Here we present a quality draft genome of G. herbaceum (cv. Wagad) using a multi-platform sequencing strategy (PacBio RS II, Dovetail Genomics, Phase Genomics, BioNano Genomics). PacBio RS II (60X) long reads were de novo assembled using the CANU assembler. Illumina sequence reads generated from the PROXIMO library method from Phase Genomics, and BioNano high-fidelity whole genome maps were used to further scaffolding. Finally, the assembly was polished using PILON. This multi-platform long range sequencing strategy will help greatly in attaining high quality de novo reconstructions of genomes. This assembly will be used towards comparative analysis with G. arboreum, which is also a domesticated A2-genome diploid. Not only will this provide a quality reference genome for G. herbaceum, it also provides an opportunity to assess recent technologies such as Dovetail Genomics, Phase Genomics, and Bionano Genomics. The G. herbaceum genome sequence serves as an example to the plant genomics community for those who have an interest in using multi-platform sequencing technologies for de novo genome sequencing. Gossypium G. herbaceum cotton Pacific Biosciences draft sequence assembly proximity guided assembly Life Sciences Plant Sciences

Search results