Global ETD Search

641	Improving the quality of multiple sequence alignment Lu, Yue 15 May 2009 (has links) Multiple sequence alignment is an important bioinformatics problem, with applications in diverse types of biological analysis, such as structure prediction, phylogenetic analysis and critical sites identification. In recent years, the quality of multiple sequence alignment was improved a lot by newly developed methods, although it remains a difficult task for constructing accurate alignments, especially for divergent sequences. In this dissertation, we propose three new methods (PSAlign, ISPAlign, and NRAlign) for further improving the quality of multiple sequences alignment. In PSAlign, we propose an alternative formulation of multiple sequence alignment based on the idea of finding a multiple alignment which preserves all the pairwise alignments specified by edges of a given tree. In contrast with traditional NP-hard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while still retaining very good performance when compared to traditional heuristics. In ISPAlign, by using additional hits from database search of the input sequences, a few strategies have been proposed to significantly improve alignment accuracy, including the construction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of intermediate sequence search to link distant homologs, and the use of secondary structure information. In NRAlign, we observe that it is possible to further improve alignment accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on all the benchmarks that are commonly used to measure alignment accuracy. Multiple Sequence Alignment Algorithms Bioinformatics
642	Algorithms for Gene Clustering Analysis on Genomes Yi, Gang Man 2011 May 1900 (has links) The increased availability of data in biological databases provides many opportunities for understanding biological processes through these data. As recent attention has shifted from sequence analysis to higher-level analysis of genes across multiple genomes, there is a need to develop efficient algorithms for these large-scale applications that can help us understand the functions of genes. The overall objective of my research was to develop improved methods which can automatically assign groups of functionally related genes in large-scale data sets by applying new gene clustering algorithms. Proposed gene clustering algorithms that can help us understand gene function and genome evolution include new algorithms for protein family classification, a window-based strategy for gene clustering on chromosomes, and an exhaustive strategy that allows all clusters of small size to be enumerated. I investigate the problems of gene clustering in multiple genomes, and define gene clustering problems using mathematical methodology and solve the problems by developing efficient and effective algorithms. For protein family classification, I developed two supervised classification algorithms that can assign proteins to existing protein families in public databases and, by taking into account similarities between the unclassified proteins, allows for progressive construction of new families from proteins that cannot be assigned. This approach is useful for rapid assignment of protein sequences from genome sequencing projects to protein families. A comparative analysis of the method to other previously developed methods shows that the algorithm has a higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models. The proposed algorithm performs well even on families with very few proteins and on families with low sequence similarity. Apart from the analysis of individual sequences, identifying genomic regions that descended from a common ancestor helps us study gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs serve as evidence used in function prediction, operon detection, etc. Thus, reliable identification of gene clusters is critical to functional annotation and analysis of genes. I developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. By placing a stricter limit on the maximum cluster size, I developed another algorithm that uses a different formulation based on constraining the overall size of a cluster and statistical estimates that allow direct comparisons of clusters of different size. A comparative analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements. gene clustering protein family bioinformatics
643	Use of bioinformatics to investigate and analyze transposable element insertions in the genomes of caenorhabditis elegans and drosophila melanogaster, and into the target plasmid pGDV1 Julian, Andrea Marian 17 February 2005 (has links) Transposable elements (TEs) are utilized for the creation of a wide range of transgenic organisms. However, in some systems, this technique is not very efficient due to low transposition frequencies and integration into unstable or transcriptionally inactive genomic regions. One approach to ameliorate this problem is to increase knowledge of how transposons move and where they integrate into target genomes. Most transposons do not insert randomly into their host genome, with class II TEs utilizing target sequences of between 2 8 bp in length, which are duplicated upon insertion. Furthermore, amongst insertion sites, certain sites are preferred for insertion and hence are classified as hot spots, while others not targeted by TEs are referred to as cold spots. The hypothesis tested in this analysis is that in addition to the primary consensus target sequence, secondary and tertiary DNA structures have a significant influence on TE target site preference. Bioinformatics was used to predict and analyze the structure of the flanking DNA around known insertion sites and cold spots for various TEs, to understand why insertion sites are used preferentially to cold spots for element integration. Hidden Markov Models were modeled and trained to analyze datasets of insertions of the P element in the Drosophila melanogaster genome, the Tc1 element in the Caenorhabditis elegans genome, and insertions of the Mos1, piggyBac and Hermes transposons into the target plasmid pGDV1. Analysis of the DNA structural profiles of the insertion sites for the P element and Hermes transposons revealed that both transposons targeted regions of DNA with a relatively high degree of bendability/flexibility at the insertion site. However, similar trends were not observed for the Tc1, Mos1 or piggyBac transposons. Hence, it is believed that the secondary structural features of DNA can contribute to target site preference for some, but not all transposable elements. bioinformatics transposon insertion sites bendability
644	Biological sequence analyses theory, algorithms, and applications / Ma, Fangrui. January 2009 (has links) Thesis (Ph.D.)--University of Nebraska-Lincoln, 2009. / Title from title screen (site viewed October 13, 2009). PDF text: xv, 233 p. : ill. ; 4 Mb. UMI publication number: AAT 3360173. Includes bibliographical references. Also available in microfilm and microfiche formats.
645	Subcellular structure modeling and tracking for cell dynamics study Wen, Quan. January 2008 (has links) Thesis (Ph.D.) -- University of Texas at Arlington, 2008.
646	A two-pronged approach to improve distant homology detection Lee, Marianne M. January 2009 (has links) Thesis (Ph. D.)--Ohio State University, 2009. / Title from first page of PDF file. Includes bibliographical references (p. 91-100).
647	Developing a bioinformatics utility belt to eliminate search redundancy from the ever-growing databases Taylor, Misha. Engelen, Robert A. van. January 2003 (has links) Thesis (M.S.)--Florida State University, 2003. / Advisor: Dr. Robert van Engelen, Florida State University, College of Arts and Sciences, Department of Computer Science. Title and description from dissertation home page (viewed Oct. 1, 2003). Includes bibliographical references.
648	Sequence alignment : algorithm development and applications / Jiang, Tianwei. January 2009 (has links) Includes bibliographical references (p. 64-71).
649	Identification of cancer subtypes and subtypes-specific drivers using high-throughput data wih application to medulloblastoma Chen, Peikai., 陈培凯. January 2012 (has links) Cancer is a fearful, deadly disease. Currently there is almost no cure. The reason is that the disease mechanisms are hardly understood to humans. This in turn is because of the complex molecular activities that underlie cancer processes. Some variables of these processes, such as gene expressions, copy number profiles and point mutations, recently became measurable in high-throughput. However, these data are massive and non-readable even to experts. A lot of efforts are being made to develop engineering tools for the analysis and interpretation of these data, for various purposes. In this thesis, we focus on addressing the problem of individuality in cancer. More specifically, we are interested in knowing the subgroups of processes in a cancer, called subtypes. This problem has both theoretical and practical implications. Theoretically, classification of cancer patients represents an understanding of the disease, and may help speed up drug development. Practically, subgroups of patients can be treated with different protocols for optimal outcomes. Towards this end, we propose an approach with two specific aims: performing subtypes for a given set of high-throughput data, and identifying candidate genes (called drivers) that drive the subtype-specific processes. First, we assume that a subtype has a distinctive process, compared not just with normal controls, but also with other cases of the same cancer. The process is characterized with a set of differentially expressed genes uniquely found in the corresponding subtype. Based on this assumption, we develop a signature based subtyping algorithm, which on the one hand divides a set of cases into as many subtypes as possible, while on the other hand merges subtypes that have too small a signature set. We applied this algorithm to datasets of the pediatric brain tumor of medulloblastoma, and found no more than three subtypes can meet the above criteria. Second, we explore subtype patterns of the copy number profiles. By regarding all events on a chromosome arm as a single event, we quantize the copy number profiles into event profiles. An unsupervised decision tree training algorithm is specifically designed for detecting subtypes on these profiles. The trained decision tree is intuitive, predictive, easy to implement and deterministic. Its application to datasets of medulloblastoma reveals interesting subtype patterns characterized with co-occurrence of CNA events. / published_or_final_version / Electrical and Electronic Engineering / Doctoral / Doctor of Philosophy Cancer - Genetic aspects. Medulloblastoma. Bioinformatics.
650	Deciphering the mechanisms of genetic disorders by high throughput genomic data Bao, Suying, 鲍素莹 January 2013 (has links) A new generation of non-Sanger-based sequencing technologies, so called “next-generation” sequencing (NGS), has been changing the landscape of genetics at unprecedented speed. In particular, our capacity in deciphering the genotypes underlying phenotypes, such as diseases, has never been greater. However, before fully applying NGS in medical genetics, researchers have to bridge the widening gap between the generation of massively parallel sequencing output and the capacity to analyze the resulting data. In addition, even a list of candidate genes with potential causal variants can be obtained from an effective NGS analysis, to pinpoint disease genes from the long list remains a challenge. The issue becomes especially difficult when the molecular basis of the disease is not fully elucidated. New NGS users are always bewildered by a plethora of options in mapping, assembly, variant calling and filtering programs and may have no idea about how to compare these tools and choose the “right” ones. To get an overview of various bioinformatics attempts in mapping and assembly, a series of performance evaluation work was conducted by using both real and simulated NGS short reads. For NGS variant detection, the performances of two most widely used toolkits were assessed, namely, SAM tools and GATK. Based on the results of systematic evaluation, a NGS data processing and analysis pipeline was constructed. And this pipeline was proved a success with the identification of a mutation (a frameshift deletion on Hnrnpa1, p.Leu181Valfs*6) related to congenital heart defect (CHD) in procollagen type IIA deficient mice. In order to prioritize risk genes for diseases, especially those with limited prior knowledge, a network-based gene prioritization model was constructed. It consists of two parts: network analysis on known disease genes (seed-based network strategy)and network analysis on differential expression (DE-based network strategy). Case studies of various complex diseases/traits demonstrated that the DE-based network strategy can greatly outperform traditional gene expression analysis in predicting disease-causing genes. A series of simulation work indicated that the DE-based strategy is especially meaningful to diseases with limited prior knowledge, and the model’s performance can be further advanced by integrating with seed-based network strategy. Moreover, a successful application of the network-based gene prioritization model in influenza host genetic study further demonstrated the capacity of the model in identifying promising candidates and mining of new risk genes and pathways not biased toward our current knowledge. In conclusion, an efficient NGS analysis framework from the steps of quality control and variant detection, to those of result analysis and gene prioritization has been constructed for medical genetics. The novelty in this framework is an encouraging attempt to prioritize risk genes for not well-characterized diseases by network analysis on known disease genes and differential expression data. The successful applications in detecting genetic factors associated with CHD and influenza host resistance demonstrated the efficacy of this framework. And this may further stimulate more applications of high throughput genomic data in dissecting the genetic components of human disorders in the near future. / published_or_final_version / Biochemistry / Doctoral / Doctor of Philosophy Nucleotide sequence - Data processing Bioinformatics

Search results