Spelling suggestions: "subject:"dequence alignment."" "subject:"dequence lignment.""
21 |
GPU accelerated sequence alignment /Zhao Kaiyong.Zhao, Kaiyong 15 November 2016 (has links)
DNA sequence alignment is a fundamental task in gene information processing, which is about searching the location of a string (usually based on newly collected DNA data) in the existing huge DNA sequence databases. Due to the huge amount of newly generated DNA data and the complexity of approximate string match, sequence alignment becomes a time-consuming process. Hence how to reduce the alignment time becomes a significant research problem. Some algorithms of string alignment based on HASH comparison, suffix array and BWT, which have been proposed for DNA sequence alignment. Although these algorithms have reached the speed of O(N), they still cannot meet the increasing demand if they are running on traditional CPUs. Recently, GPUs have been widely accepted as an efficient accelerator for many scientific and commercial applications. A typical GPU has thousands of processing cores which can speed up repetitive computations significantly as compared to multi-core CPUs. However, sequence alignment is one kind of computation procedure with intensive data access, i.e., it is memory-bounded. The access to GPU memory and IO has more significant influence in performance when compared to the computing capabilities of GPU cores. By analyzing GPU memory and IO characteristics, this thesis produces novel parallel algorithms for DNA sequence alignment applications. This thesis consists of six parts. The first two parts explain some basic knowledge of DNA sequence alignment and GPU computing. The third part investigates the performance of data access on different types of GPU memory. The fourth part describes a parallel method to accelerate short-read sequence alignment based on BWT algorithm. The fifth part proposes the parallel algorithm for accelerating BLASTN, one of the most popular sequence alignment software. It shows how multi-threaded control and multiple GPU cards can accelerate the BLASTN algorithm significantly. The sixth part concludes the whole thesis. To summarize, through analyzing the layout of GPU memory and comparing data under the mode of multithread access, this thesis analyzes and concludes a perfect optimization method to achieve sequence alignment on GPU. The outcomes can help practitioners in bioinformatics to improve their working efficiency by significantly reducing the sequence alignment time.
|
22 |
Parallel Algorithm for Memory Efficient Pairwise and Multiple Genome Alignment in Distributed EnvironmentAhmed, Nova 20 December 2004 (has links)
The genome sequence alignment problems are very important ones from the computational biology perspective. These problems deal with large amount of data which is memory intensive as well as computation intensive. In the literature, two separate algorithms have been studied and improved – one is a Pairwise sequence alignment algorithm which aligns pairs of genome sequences with memory reduction and parallelism for the computation and the other one is the multiple sequence alignment algorithm that aligns multiple genome sequences and this algorithm is also parallelized efficiently so that the workload of the alignment program is well distributed. The parallel applications can be launched on different environments where shared memory is very well suited for these kinds of applications. But shared memory environment has the limitation of memory usage as well as scalability also these machines are very costly. A better approach is to use the cluster of computers and the cluster environment can be further enhanced to a grid environment so that the scalability can be improved introducing multiple clusters. Here the grid environment is studied as well as the shared memory and cluster environment for the two applications. It can be stated that for carefully designed algorithms the grid environment is comparable for its performance to other distributed environments and it sometimes outperforms the others in terms of the limitations of resources the other distributed environments have.
|
23 |
Measuring deviation from a deeply conserved consensus in protein multiple sequence alignmentsMokin, Sergey. January 1900 (has links)
Thesis (M.Sc.). / Written for the Dept. of Biology. Title from title page of PDF (viewed 2008/12/07). Includes bibliographical references.
|
24 |
Family of Hidden Markov Models and its applications to phylogenetics and metagenomicsNguyen, Nam-phuong Duc 24 October 2014 (has links)
A Profile Hidden Markov Model (HMM) is a statistical model for representing a multiple sequence alignment (MSA). Profile HMMs are important tools for sequence homology detection and have been used in wide a range of bioinformatics applications including protein structure prediction, remote homology detection, and sequence alignment. Profile HMM methods result in accurate alignments on datasets with evolutionarily similar sequences; however, I will show that on datasets with evolutionarily divergent sequences, the accuracy of HMM-based methods degrade. My dissertation presents a new statistical model for representing an MSA by using a set of HMMs. The family of HMM (fHMM) approach uses multiple HMMs instead of a single HMM to represent an MSA. I present a new algorithm for sequence alignment using the fHMM technique. I show that using the fHMM technique for sequence alignment results in more accurate alignments than the single HMM approach. As sequence alignment is a fundamental step in many bioinformatics pipelines, improvements to sequence alignment result in improvements across many different fields. I show the applicability of fHMM to three specific problems: phylogenetic placement, taxonomic profiling and identification, and MSA estimation. In phylogenetic placement, the problem addressed is how to insert a query sequence into an existing tree. In taxonomic identification and profiling, the problems addressed are how to taxonomically classify a query sequence, and how to estimate a taxonomic profile on a set of sequences. Finally, both profile HMM and fHMM require a backbone MSA as input in order to align the query sequences. In MSA estimation, the problem addressed is how to estimate a ``de novo'' MSA without the use of an existing backbone alignment. For each problem, I present a software pipeline that implements the fHMM specifically for that domain: SEPP for phylogenetic placement, TIPP for taxonomic profiling and identification, and UPP for MSA estimation. I show that SEPP has improved accuracy compared to the single HMM approach. I also show that SEPP results in more accurate phylogenetic placements compared to existing placement methods, and SEPP is more computationally efficient, both in peak memory usage and running time. I show that TIPP more accurately classifies novel sequences compared to the single HMM approach, and TIPP estimates more accurate taxonomic profiles than leading methods on simulated metagenomic datasets. I show how UPP can estimate ``de novo'' alignments using fHMM. I present results that show UPP is more accurate and efficient than existing alignment methods, and estimates accurate alignments and trees on datasets containing both full-length and fragmentary sequences. Finally, I show that UPP can estimate a very accurate alignment on a dataset with 1,000,000 sequences in less than 2 days without the need of a supercomputer. / Computer Sciences / text
|
25 |
Evidence Combination in Hidden Markov Models for Gene PredictionBrejova, Bronislava January 2005 (has links)
This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. <br /><br /> We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes. <br /><br /> The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions, and are directly applicable to our gene finder.
|
26 |
Advancing Loop Prediction to Ultra-High Resolution SamplingMiller, Edward Blake January 2014 (has links)
Homology modeling is integral to structure-based drug discovery. Robust homology modeling to atomic-level accuracy requires in the general case successful prediction of protein loops containing small segments of secondary structure. For loops identified to possess α-helical segments, an alternative dihedral library is employed composed of (phi,psi) angles commonly found in helices. Even with imperfect knowledge coming from sequence-based secondary structure, helix or hairpin embedded loops, up to 17 residues in length, are successfully predicted to median sub-angstrom RMSD. Having demonstrated success with these cases, performance costs for these and other similar long loop predictions will be discussed. Dramatic improvements in both speed and accuracy are possible through the development of a Cβ-based scoring function, applicable to hydrophobic residues, that can be applied as early as half-loop buildup. With this scoring function, up to a 30-fold reduction in the cost to produce competitive sub-2 A loops are observed. Through the use of this scoring function, an efficient method will be presented to achieve ultra-high resolution buildup that restrains combinatorial explosion and offers an alternative to the current approach to full-loop buildup. This novel method is designed to be inherently suitable for homology model refinement.
|
27 |
Développement d'une méthode automatique fiable de modélisation de la structure tridimensionnelle des protéines par homologie et application au protéome de Brucella melitensisLambert, Christophe GF 26 September 2003 (has links)
La connaissance de la structure tridimensionnelle (3D) des protéines est une information capitale. Néanmoins, le nombre de protéines dont la structure 3D a été déterminée expérimentalement est cent fois plus faible que le nombre de protéines connues aujourd'hui. Cet écart ne pourra pas être comblé, car les techniques expérimentales de détermination de structure (diffraction de rayons X et résonance magnétique nucléaire) sont coûteuses et lentes (un an de travail en moyenne pour une seule protéine).
Un moyen d'obtenir plus rapidement la structure 3D de protéines est de la prédire par des moyens bioinformatiques. La technique de prédiction la plus précise actuellement est la modélisation par homologie. Celle-ci est basée sur la similitude de structure entre deux protéines de séquences similaires. L'étape critique de cette méthode est l'étape d'alignement entre la séquence à modéliser et une séquence similaire de structure connue.
Notre travail a consisté tout d'abord en la conception d'une nouvelle méthode d'alignement pairé très fiable. Cette méthode a ensuite été incluse dans un système automatique de modélisation par homologie: la bonne qualité des structures prédites par le système trouve en partie son origine dans le programme d'alignement utilisé.
Enfin, nous avons appliqué notre système de modélisation automatique à la modélisation de toutes les protéines déduites du génome d'une bactérie pathogène étudiée dans notre unité de recherche: Brucella melitensis. Cela nous a conduit à créer une banque de données structurales et fonctionnelles consacrée au génome de cette bactérie. Cette banque de données est devenue un outil de travail indispensable pour plusieurs équipes de recherche européennes qui étudient Brucella melitensis.
|
28 |
RNA Homology Searches Using Pair SeedingDarbha, Sriram January 2005 (has links)
Due to increasing numbers of non-coding RNA (ncRNA) being discovered recently, there is interest in identifying homologs of a given structured RNA sequence. Exhaustive homology searching for structured RNA molecules using covariance models is infeasible on genome-length sequences. Hence, heuristic methods are employed, but they largely ignore structural information in the query. We present a novel method, which uses secondary structure information, to perform homology searches for a structured RNA molecule. We define the concept of a <em>pair seed</em> and theoretically model alignments of random and related paired regions to compute expected sensitivity and specificity. We show that our method gives theoretical gains in sensitivity and specificity compared to a BLAST-based heuristic approach. We provide experimental verification of this gain. <br /><br /> We also show that pair seeds can be effectively combined with the spaced seeds approach to nucleotide homology search. The hybrid search method has theoretical specificity superior to that of the BLAST seed. We provide experimental evaluation of our hypotheses. Finally, we note that our method is easily modified to process pseudo-knotted regions in the query, something outside the scope of covariance model based methods.
|
29 |
RNA Homology Searches Using Pair SeedingDarbha, Sriram January 2005 (has links)
Due to increasing numbers of non-coding RNA (ncRNA) being discovered recently, there is interest in identifying homologs of a given structured RNA sequence. Exhaustive homology searching for structured RNA molecules using covariance models is infeasible on genome-length sequences. Hence, heuristic methods are employed, but they largely ignore structural information in the query. We present a novel method, which uses secondary structure information, to perform homology searches for a structured RNA molecule. We define the concept of a <em>pair seed</em> and theoretically model alignments of random and related paired regions to compute expected sensitivity and specificity. We show that our method gives theoretical gains in sensitivity and specificity compared to a BLAST-based heuristic approach. We provide experimental verification of this gain. <br /><br /> We also show that pair seeds can be effectively combined with the spaced seeds approach to nucleotide homology search. The hybrid search method has theoretical specificity superior to that of the BLAST seed. We provide experimental evaluation of our hypotheses. Finally, we note that our method is easily modified to process pseudo-knotted regions in the query, something outside the scope of covariance model based methods.
|
30 |
Evidence Combination in Hidden Markov Models for Gene PredictionBrejova, Bronislava January 2005 (has links)
This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species. <br /><br /> We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes. <br /><br /> The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions, and are directly applicable to our gene finder.
|
Page generated in 0.0791 seconds