Global ETD Search

1	Aligning multiple sequences adaptively Ye, Yongtao, 叶永滔 January 2014 (has links) With the rapid development of genome sequencing, an ever-increasing number of molecular biology analyses rely on the construction of an accurate multiple sequence alignment (MSA), such as motifs detection, phylogeny inference and structure prediction. Although many methods have been developed during the last two decades, most of them may perform poorly on some types of inputs, in particular when families of sequences fall below thirty percent similarity. Therefore, this thesis introduced two different effective approaches to improve the overall quality of multiple sequence alignment. First, by considering the similarity of the input sequences, we proposed an adaptive approach to compute better substitution matrices for each pair of sequences, and then apply the progressive alignment method to align them. For example, for inputs with high similarity, we consider the whole sequences and align them with global pair-Hidden Markov model, while for those with moderate low similarity, we may ignore the ank regions and use some local pair-Hidden Markov models to align them. To test the effectiveness of this approach, we have implemented a multiple sequence alignment tool called GLProbs and compared its performance with one dozen leading tools on three benchmark alignment databases, and GLProbs' alignments have the best scores in almost all testings. We have also evaluated the practicability of the alignments of GLProbs by applying the tool to three biological applications, namely phylogenetic tree reconstruction, protein secondary structure prediction and the detection of high risk members for cervical cancer in the HPV-E6 family, and the results are very encouraging. Second, based on our previous study, we proposed another new tool PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies input sequences into two types: normally related sequences and distantly related sequences. For normally related sequences, it uses an adaptive approach to construct the guide tree, and based on this guide tree, aligns the sequences progressively. To be more precise, it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the best method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree; instead it uses the non-progressive sequence annealing method to construct the multiple sequence alignment. By combining the strength of the progressive and non-progressive methods, and with a better way to construct the guide tree, PnpProbs improves the quality of multiple sequence alignments significantly for not only general input sequences, but also those very distantly related. With those encouraging empirical results, our developed software tools have been appreciated by the community gradually. For example, GLProbs has been invited and incorporated into the JAva Bioinformatics Analysis Web Services system (JABAWS). / published_or_final_version / Computer Science / Master / Master of Philosophy Sequence alignment (Bioinformatics)
2	Biological sequence analyses theory, algorithms, and applications / Ma, Fangrui. January 2009 (has links) Thesis (Ph.D.)--University of Nebraska-Lincoln, 2009. / Title from title screen (site viewed October 13, 2009). PDF text: xv, 233 p. : ill. ; 4 Mb. UMI publication number: AAT 3360173. Includes bibliographical references. Also available in microfilm and microfiche formats.
3	A two-pronged approach to improve distant homology detection Lee, Marianne M. January 2009 (has links) Thesis (Ph. D.)--Ohio State University, 2009. / Title from first page of PDF file. Includes bibliographical references (p. 91-100).
4	Sequence alignment : algorithm development and applications / Jiang, Tianwei. January 2009 (has links) Includes bibliographical references (p. 64-71).
5	The limits of progressive multiple sequence alignment / Sheneman, Lucas James. January 1900 (has links) Thesis (Ph. D., Bioinformatics and Computational Biology)--University of Idaho, August 2008. / Major professor: James A. Foster. Includes bibliographical references (leaves 90-94). Also available online (PDF file) by subscription or by purchasing the individual file.
6	Exploring microbial community structures and functions of activated sludge by high-throughput sequencing Ye, Lin, 叶林 January 2012 (has links) To investigate the diversities and abundances of nitrifiers and to apply the highthroughput sequencing technologies to analyze the overall microbial community structures and functions in the wastewater treatment bioreactors were the major objectives of this study. Specifically, this study was conducted: (1) to investigate the diversities and abundances of AOA, AOB and NOB in bioreactors, (2) to explore the bacterial communities in bioreactors using 454 pyrosequencing, and (3) to analyze the metagenomes of activated sludge using Illumina sequencing. A lab-scale nitrification bioreactor was operated for 342 days under low DO (0.15~0.5 mg/L) and high nitrogen loading (0.26~0.52 kg-N/(m3d)). T-RFLP and cloning analysis showed there were only one dominant AOA, AOB and NOB species in the bioreactor, respectively. The amoA gene of the dominant AOA had a similarity of 89.3% with the isolated AOA species Nitrosopumilus maritimus SCM1. The AOB species detected in the bioreactor belonged to Nitrosomonas genus. The abundance of AOB was more than 40 times larger than that of AOA. The percentage of NOB in total bacteria increased from not detectable to 30% when DO changed from 0.15 to 0.5 mg/L. Compared with traditional methods, pyrosequencing analysis of the bacteria in this bioreactor provided unprecedented information. 494 bacterial OTUs was obtained at 3% distance cutoff. Furthermore, 454 pyrosequencing was applied to investigate the bacterial communities of activated sludge samples from 14 WWTPs of Asia (mainland China, Hong Kong, and Singapore) and North America (Canada and the United States). The results revealed huge amounts of OTUs in activated sludge, i.e. 1183~3567 OTUs in one sludge sample at 3% distance cutoff. Clear geographical differences among these samples were observed. The AOB amoA genes in different WWTPs were found quite diverse while the 16S rRNA genes were relatively conserved. To explore microbial community structures and functions in the abovementioned labscale bioreactor and a full-scale bioreactor, over six gigabases of metagenomic sequence data and 150,000 paired-end reads of PCR amplicons were generated from the activated sludge in the two bioreactors on Illumina HiSeq2000 platform. Three kinds of sequences (16S rRNA amplicons, 16S rRNA gene tags and predicted genes) were used to conduct taxonomic assignment and their applicabilities and reliabilities were compared. Specially, based on 16S rRNA and amoA gene sequences, AOB were found more abundant than AOA in the two bioreactors. Furthermore, the analysis of the metabolic profiles and pathways indicated that the overall pathways in the two bioreactors were quite similar. However, the abundances of some specific genes in the two bioreactors were different. In addition, 454 pyrosequencing was also used to detect potentially pathogenic bacteria in environmental samples. It was found most abundant potentially pathogenic bacteria in the WWTPs were affiliated with Aeromonas and Clostridium. Aeromonas veronii, Aeromonas hydrophila and Clostridium perfringens were species most similar to the potentially pathogenic bacteria found in this study. Overall, the percentage of the sequences closely related to known pathogenic bacteria sequences was about 0.16% of the total sequences. Additionally, a Java application (BAND) was developed for graphical visualization of microbial abundance data. / published_or_final_version / Civil Engineering / Doctoral / Doctor of Philosophy Sequence alignment (Bioinformatics) Microbial genomics - Data processing.
7	Iterative de Bruijn graph assemblers for second-generation sequencing reads Peng, Yu, 彭煜 January 2012 (has links) The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy Nucleotide sequence - Data processing. Sequence alignment (Bioinformatics)
8	Combinatorial optimization and application to DNA sequence analysis Gupta, Kapil. January 2008 (has links) Thesis (Ph.D)--Industrial and Systems Engineering, Georgia Institute of Technology, 2009. / Committee Chair: Lee, Eva K.; Committee Member: Barnes, Earl; Committee Member: Fan, Yuhong; Committee Member: Johnson, Ellis; Committee Member: Yuan, Ming. Part of the SMARTech Electronic Thesis and Dissertation Collection.
9	GPU accelerated sequence alignment /Zhao Kaiyong. Zhao, Kaiyong 15 November 2016 (has links) DNA sequence alignment is a fundamental task in gene information processing, which is about searching the location of a string (usually based on newly collected DNA data) in the existing huge DNA sequence databases. Due to the huge amount of newly generated DNA data and the complexity of approximate string match, sequence alignment becomes a time-consuming process. Hence how to reduce the alignment time becomes a significant research problem. Some algorithms of string alignment based on HASH comparison, suffix array and BWT, which have been proposed for DNA sequence alignment. Although these algorithms have reached the speed of O(N), they still cannot meet the increasing demand if they are running on traditional CPUs. Recently, GPUs have been widely accepted as an efficient accelerator for many scientific and commercial applications. A typical GPU has thousands of processing cores which can speed up repetitive computations significantly as compared to multi-core CPUs. However, sequence alignment is one kind of computation procedure with intensive data access, i.e., it is memory-bounded. The access to GPU memory and IO has more significant influence in performance when compared to the computing capabilities of GPU cores. By analyzing GPU memory and IO characteristics, this thesis produces novel parallel algorithms for DNA sequence alignment applications. This thesis consists of six parts. The first two parts explain some basic knowledge of DNA sequence alignment and GPU computing. The third part investigates the performance of data access on different types of GPU memory. The fourth part describes a parallel method to accelerate short-read sequence alignment based on BWT algorithm. The fifth part proposes the parallel algorithm for accelerating BLASTN, one of the most popular sequence alignment software. It shows how multi-threaded control and multiple GPU cards can accelerate the BLASTN algorithm significantly. The sixth part concludes the whole thesis. To summarize, through analyzing the layout of GPU memory and comparing data under the mode of multithread access, this thesis analyzes and concludes a perfect optimization method to achieve sequence alignment on GPU. The outcomes can help practitioners in bioinformatics to improve their working efficiency by significantly reducing the sequence alignment time.
10	Advancing Loop Prediction to Ultra-High Resolution Sampling Miller, Edward Blake January 2014 (has links) Homology modeling is integral to structure-based drug discovery. Robust homology modeling to atomic-level accuracy requires in the general case successful prediction of protein loops containing small segments of secondary structure. For loops identified to possess α-helical segments, an alternative dihedral library is employed composed of (phi,psi) angles commonly found in helices. Even with imperfect knowledge coming from sequence-based secondary structure, helix or hairpin embedded loops, up to 17 residues in length, are successfully predicted to median sub-angstrom RMSD. Having demonstrated success with these cases, performance costs for these and other similar long loop predictions will be discussed. Dramatic improvements in both speed and accuracy are possible through the development of a Cβ-based scoring function, applicable to hydrophobic residues, that can be applied as early as half-loop buildup. With this scoring function, up to a 30-fold reduction in the cost to produce competitive sub-2 A loops are observed. Through the use of this scoring function, an efficient method will be presented to achieve ultra-high resolution buildup that restrains combinatorial explosion and offers an alternative to the current approach to full-loop buildup. This novel method is designed to be inherently suitable for homology model refinement. Sequence alignment (Bioinformatics) Proteins--Mathematical models Chemistry, Physical and theoretical Chemistry

Search results