Global ETD Search

131	Improved Analysis of Nanopore Sequence Data and Scanning Nanopore Techniques Szalay, Tamas 25 July 2017 (has links) The field of nanopore research has been driven by the need to inexpensively and rapidly sequence DNA. In order to help realize this goal, this thesis describes the PoreSeq algorithm that identifies and corrects errors in real-world nanopore sequencing data and improves the accuracy of \textit{de novo} genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA advances through the nanopore and then using this model to find the sequence that best explains multiple reads of the same region of DNA. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85\% to 99\% at 100X coverage. We also use the algorithm to assemble \textit{E. coli} with 30X coverage and the $\lambda$ genome at a range of coverages from 3X to 50X. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods. This thesis also reports preliminary progress towards controlling the motion of DNA using two nanopores instead of one. The speed at which the DNA travels through the nanopore needs to be carefully controlled to facilitate the detection of individual bases. A second nanopore in close proximity to the first could be used to slow or stop the motion of the DNA in order to enable a more accurate readout. The fabrication process for a new pyramidal nanopore geometry was developed in order to facilitate the positioning of the nanopores. This thesis demonstrates that two of them can be placed close enough to interact with a single molecule of DNA, which is a prerequisite for being able to use the driving force of the pores to exert fine control over the motion of the DNA. Another strategy for reading the DNA is to trap it completely with one pore and to move the second nanopore instead. To that end, this thesis also shows that a single strand of immobilized DNA can be captured in a scanning nanopore and examined for a full hour, with data from many scans at many different voltages obtained in order to detect a bound protein placed partway along the molecule. / Engineering and Applied Sciences - Applied Physics Physics, General Biology, Bioinformatics
132	Statistical Methods for Large-Scale Integrative Genomics Li, Yang 25 July 2017 (has links) In the past 20 years, we have witnessed a significant advance of high-throughput genetic and genomic technologies. With the massively generated genomics data, there is a pressing need for statistical methods that can utilize them to make quantitative inference on substantive scientific questions. My research has been focusing on statistical methods for large-scale integrative genomics. The human genome encodes more than 20,000 genes, while the functions of about 50% (>10,000) genes remains unknown up to date. The determination of the functions of the poorly characterized genes is crucial for understanding biological processes and human diseases. In the era of Big Data, the availability of massive genomic data provides us unprecedented opportunity to identify the association between genes and predict their biological functions. Genome sequencing data and mRNA expression data are the two most important classes of genomic data. This thesis presents three research projects in self-contained chapters: (1) a statistical framework for inferring evolutionary history of human genes and identifying gene modules with shared evolutionary history from genome sequencing data, (2) a statistical method to predict frequent and specific gene co-expression by integrating a large number of mRNA expression datasets, and (3) robust variable and interaction selection for high-dimensional classification problem under the discriminant analysis and logistic regression model. Chapter 1. Human has more than 20,000 genes but till now most of their functions are uncharacterized. Determination of the function for poorly characterized genes is crucial for understanding biological processes and study of human diseases. Functionally associated genes tend to gain and lose simultaneously during evolution, therefore identifying co-evolution of genes predicts gene-gene associations. In this chapter, we propose a mixture of tree-structured hidden Markov models for gene evolution process, and a Bayesian model-based clustering algorithm to detect gene modules with shared evolutionary history (named as evolutionary conserved modules, ECM). Dirichlet process prior is adopted for estimation of number of gene clusters and an efficient Gibbs sampler is developed for posterior distribution computation. By simulation study and benchmarks on real data sets, we show that our algorithm outperforms traditional methods that use simple metrics (e.g. Hamming distance, Pearson correlation) to measure the similarity between genes presence/absence patterns. We apply our methods on 1,025 canonical human pathways gene sets, and found a large portion of the detected gene associations are substantiated by other sources of evidence. The rest of genes have predicted functions of high priority to be verified by further biological experiments. Chapter 2. The availability of gene expression measurements across thousands of experimental conditions provides the opportunity to predict gene function based on shared mRNA expression. While many biological complexes and pathways are coordinately expressed, their genes may be organized into co-expression modules with distinct patterns in certain tissues or conditions, which can provide insight into pathway organization and function. We developed the algorithm CLIC (clustering by inferred co-expression, www.gene-clic.org) that clusters a set of functionally-related genes into co-expressed modules, highlights the most relevant datasets, and predicts additional co-expressed genes. Using a statistical Bayesian partition model, CLIC simultaneously partitions the input gene set into disjoint co-expression modules and weights the most relevant datasets for each module. CLIC then expands each module with additional members that co-express with the module’s genes more than the background model in the weighted datasets. We applied CLIC to (i) model the background correlation in each of 3,662 mouse and human microarray datasets from the Gene Expression Omnibus (GEO), (ii) partition each of 900 annotated complexes/pathways into co-expression modules, and (iii) expand each co-expression module with additional genes showing frequent and specific co-expression over multiple GEO datasets. CLIC provided very strong functional predictions for many completely uncharacterized genes, including a link between protein C7orf55 and the mitochondrial ATP synthase complex that we experimentally validated via CRISPR knock-out. CLIC software is freely available and should become increasingly powerful with the growing wealth of transcriptomic datasets. Chapter 3. Discriminant analysis and logistic regression are fundamental tools for classification problems. Quadratic discriminant analysis has the ability to exploit interaction effects of predictors, but the selection of interaction terms is non-trivial and the Gaussian assumption is often too restrictive for many real problems. Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms, where in the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. Compared with existing methods on quadratic discriminant analysis variable selection (e.g., (Murphy et al., 2010), (Zhang and Wang, 2011) and (Maugis et al., 2011)), SODA can deal with high-dimensional data with the number of predictors much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. Theoretical analysis establishes the consistency of SODA under high-dimensional setting. Empirical performance of SODA is assessed on both simulated and real data and is found to be superior to all existing methods we have tested. For all the three real datasets we have studied, SODA selected more parsimonious models achieving higher classification accuracies compared to other tested methods. / Statistics Statistics Biology, Bioinformatics
133	Sequence features affecting translation initiation in eukaryotes: A bioinformatic approach Yao, Xiaoquan January 2008 (has links) Sequence features play an important role in the regulation of translation initiation. This thesis focuses on the sequence features affecting eukaryotic initiation. The characteristics of 5' untranslated region in Saccharomyces cerevisiae were explored. It is found that the 40 nucleotides upstream of the start codon is the critical region for translation initiation in yeast. Moreover, this thesis attempted to solve some controversies related to the start codon context. Two key nucleotides in the start codon context are the third nucleotide upstream of the start codon (-3 site) and the nucleotide immediately following the start codon (+4 site). Two hypotheses regarding +4G (G at +4 site) in Kozak consensus, the translation initiation hypothesis and the amino acid constraint hypothesis, were tested. The relationship between the -3 and +4 sites in seven eukaryotic species does not support the translation initiation hypothesis. The amino acid usage at the position after the initiator (P1' position) compared to other positions in the coding sequences of seven eukaryotic species was examined. The result is consistent with the amino acid constraint hypothesis. In addition, this thesis explored the relationship between +4 nucleotide and translation efficiency in yeast. The result shows that +4 nucleotide is not important for translation efficiency, which does not support the translation initiation hypothesis. This work improves our current understanding of eukaryotic translation initiation process. Biology, Molecular. Biology, Bioinformatics.
134	An algorithm for the stochastic simulation of gene expression and cell population dynamics Charlebois, Daniel A January 2010 (has links) Over the past few years, it has been increasingly recognized that stochastic mechanisms play a key role in the dynamics of biological systems. Genetic networks are one example where molecular-level fluctuations are of particular importance. Here stochasticity in the expression of gene products can result in genetically identical cells in the same environment displaying significant variation in biochemical or physical attributes. This variation can influence individual and population-level fitness. In this thesis we first explore the background required to obtain analytical solutions and perform simulations of stochastic models of gene expression. Then we develop an algorithm for the stochastic simulation of gene expression and heterogeneous cell population dynamics. The algorithm combines an exact method to simulate molecular-level fluctuations in single cells and a constant-number Monte Carlo approach to simulate the statistical characteristics of growing cell populations. This approach permits biologically realistic and computationally feasible simulations of environment and time-dependent cell population dynamics. The algorithm is benchmarked against steady-state and time-dependent analytical solutions of gene expression models, including scenarios when cell growth, division, and DNA replication are incorporated into the modelling framework. Furthermore, using the algorithm we obtain the steady-state cell size distribution of a large cell population, grown from a small initial cell population undergoing stochastic and asymmetric division, to the size distribution of a small representative sample of this population simulated to steady-state. These comparisons demonstrate that the algorithm provides an accurate and efficient approach to modelling the effects of complex biological features on gene expression dynamics. The algorithm is also employed to simulate expression dynamics within 'bet-hedging' cell populations during their adaption to environmental stress. These simulations indicate that the cell population dynamics algorithm provides a framework suitable for simulating and analyzing realistic models of heterogeneous population dynamics combining molecular-level stochastic reaction kinetics, relevant physiological details, and phenotypic variability and fitness. Biology, Bioinformatics. Biophysics, General.
135	Hierarchical text categorization and its application to bioinformatics Kiritchenko, Svetlana January 2006 (has links) In a hierarchical categorization problem, categories are partially ordered to form a hierarchy. In this dissertation, we explore two main aspects of hierarchical categorization: learning algorithms and performance evaluation. We introduce the notion of consistent hierarchical classification that makes classification results more comprehensible and easily interpretable for end-users. Among the previously introduced hierarchical learning algorithms, only a local top-down approach produces consistent classification. The present work extends this algorithm to the general case of DAG class hierarchies and possible internal class assignments. In addition, a new global hierarchical approach aimed at performing consistent classification is proposed. This is a general framework of converting a conventional "flat" learning algorithm into a hierarchical one. An extensive set of experiments on real and synthetic data indicate that the proposed approach significantly outperforms the corresponding "flat" as well as the local top-down method. For evaluation purposes, we use a novel hierarchical evaluation measure that is superior to the existing hierarchical and non-hierarchical evaluation techniques according to a number of formal criteria. Also, this dissertation presents the first endeavor of applying the hierarchical text categorization techniques to the tasks of bioinformatics. Three bioinformatics problems are addressed. The objective of the first task, indexing biomedical articles with Medical Subject Headings (MeSH), is to associate documents with biomedical concepts from the specialized vocabulary of MeSH. In the second application, we tackle a challenging problem of gene functional annotation from biomedical literature. Our experiments demonstrate a considerable advantage of hierarchical text categorization techniques over the "flat" method on these two tasks. In the third application, our goal is to enrich the analysis of plain experimental data with biological knowledge. In particular, we incorporate the functional information on genes directly into the clustering process of microarray data with the outcome of an improved biological relevance and value of clustering results. Biology, Bioinformatics. Computer Science.
136	Novel methods and strategies for microarray data analysis Xiong, Huiling January 2008 (has links) Microarray technology has been used as a routine high-throughput tool in biological research to characterize gene expression, and overwhelming volumes of data are generated in every microarray experiment as a consequence. However, there are many kinds of non-biological variations and systematic biases in microarray data which can confound the extraction of the true signals of gene expression. Thus comprehensive bioinformatic and statistical analyses are crucially required, typically including normalization, regulated gene identification, clustering and meta-analysis. The main purpose of my study is to develop robust analytical methods and programs for spotted cDNA-type microarray data. First, I established a novel normalization method based on the Generalized Procrustes Analysis (GPA) algorithm. I compared the GPA-based method with six other popular normalization methods, including Global, Lowess, Scale, Quantile, Variance Stabilization Normalization, and one boutique array-specific housekeeping gene method by using several different empirical criteria, and demonstrated that the GPA-based method was consistently better in reducing across-slide variability and removing systematic bias. In particular, being free from the biological assumptions that most genes (95%) are not differentially expressed on the array, the GPA method is therefore more robust, and appropriate for diverse types of array sets, including the boutique array where the majority of genes may be differentially expressed. Second, I utilized statistical analysis to assess the quality of a novel goldfish brain cDNA microarray, which provides statistical validation of microarray data result. Thirdly, I developed a new program suite as a user-friendly analytical pipeline integrating most popular analytical methods for microarray data analysis. Finally, I proposed a novel analytical strategy to extract season-related gene expression information from multiple microarray data sets by using comprehensive data transformation and normalization analysis, differential gene identification, and multivariate analysis. Biology, General. Biology, Bioinformatics.
137	On Gene Duplication Warren, Robert B January 2010 (has links) Due the sheer size and complexity of genomes, it is essential to develop automated methods to analyze them. To compare genomes, one distance measure that has been proposed is to determine the minimum number of evolutionary changes needed to transform one genome into another. In recent years, great progress has been made in this area with efficient exact algorithms that can transform one genome to another applying a wide range of evolutionary operations. However, gene duplications, a common occurrence and arguably the most important evolutionary operation, have proven to be one of the most difficult evolutionary operations to integrate. We examine the most successful gene duplication algorithms: a family of algorithms that we call the rearrangement-duplication algorithms. Rather than compare two genomes, these algorithms attempt to efficiently remove the duplicates from a genome using the fewest number of duplications and other evolutionary operations. In this thesis we give a complete survey of all the genome halving algorithms, a highly successful group of rearrangement-duplication algorithms that efficiently and exactly handle whole genome doubling ( tetraploidization). We also introduce the genome aliquoting algorithms, a new variation on the genome halving problem, that attempts to handle unlimited scale whole genome duplications. As a new and challenging problem there are currently no efficient exact algorithms. However, early results include two approximation algorithms. Biology, Bioinformatics. Computer Science.
138	Genome Rearrangements: Structural Inference and Functional Consequences Munoz, Adriana January 2010 (has links) As genomes evolve over hundreds of millions years, the chromosomes become rearranged, with segments of some chromosomes inverted, while other chromosomes reciprocally exchange chunks from their ends. These rearrangements lead to the scrambling of the elements of one genome with respect to another descended from a common ancestor. Multidisciplinary work undertakes to mathematically model these processes and to develop statistical analyses and mathematical algorithms to understand the scrambling in the chromosomes of two or more related genomes. A major focus is the reconstruction of the gene order of the ancestral genomes. There has been a trend in increasing the phylogenetic scope of genome sequencing without finishing the sequence for each genome. With less interest in completing the sequence, there is an increasing number of genomes being published in scaffold or even contig form. Rearrangement algorithms, including gene order-based phylogenetic tools, require whole genome data on gene order or syntenic block order. Then, for gene order-based comparisons or phylogeny, how can we use rearrangement algorithms to handle genomes available in contig or scaffold form only? For contig data, we develop a model for the behaviour of the genomic distance as a function of evolutionary time, and discuss how to invert this function in order to infer elapsed time. We show how to correct for the effect of chromosomal fragmentation in sets of contigs. We apply our methods to data originating mostly in the 12-genome Drosophila project [15]. We compare ten Drosophila genomes with two other dipteran genomes and two outgroup insect genomes. For scaffolds, our method involves optimally filling in genes missing in the scaffolds, and using the augmented scaffolds directly in the rearrangement algorithms as if they were chromosomes, while making a number of corrections, e.g., we correct for the number of extra fusion/fission operations required to make scaffolds comparable to full assemblies. We model the relationship between scaffold density and genomic distance, and estimate the parameters of this model while comparing the angiosperms genomes Ricinus communis and Vitis vinifera. A separate question arises of what the biological consequences of breakpoint creation are, rather than just their structural aspects. The question I will ask is whether proximity to the site of a breakpoint event changes the activity of a gene. I propose to investigate this by comparing the distribution of distances to the nearest breakpoint of genes that change expression in human versus the distribution of genes that do not change expression in human, compared to other primate species (e.g. macaque or chimpanzee). Keywords: chromosome rearrangement, comparative genomics, phylogenomics, phylogenetic tree, inversion, reciprocal translocation, transposition, DCJ, breakpoint, gene expression. Biology, Bioinformatics. Computer Science.
139	Applying Evolutionary Computation and Ensemble Approaches to Protein Contact Map and Protein Function Determination Chapman, Samuel D. 13 January 2017 (has links) <p> Proteins are important biological molecules that perform many different functions in an organism. They are composed of sequences of amino acids that play a large part in determining both their structure and function. In turn, the structures of proteins are related to their functions. Using computational methods for protein study is a popular approach, offering the possibility of being faster and cheaper than experimental methods. These software-based methods are able to take information such as the protein sequence and other empirical data and output predictions such as protein structure or function.</p><p> In this work, we have developed a set of computational methods that are used in the application of protein structure prediction and protein function prediction. For protein structure prediction, we use the evolution of logic circuits to produce logic circuit classifiers that predict the protein contact map of a protein based on high-dimensional feature data. The diversity of the evolved logic circuits allows for the creation of ensembles of classifiers, and the answers from these ensembles are combined to produce more-accurate answers. We also apply a number of ensemble algorithms to our results.</p><p> Our protein function prediction work is based on the use of six existing computational protein function prediction methods, of which four were optimized for use on a benchmark dataset, along with two others developed by collaborators. We used a similar ensemble framework, combining the answers from the six methods into an ensemble using an algorithm, CONS, that we helped develop.</p><p> Our contact map prediction study demonstrated that it was possible to evolve logic circuits for this purpose, and that ensembles of the classifiers improved performance. The results fell short of state-of-the-art methods, and additional ensemble algorithms failed to improve the performance. However, the method was also able to work as a feature detector, discovering salient features from the high-dimensional input data, a computationally-intractable problem. In our protein function prediction work, the combination of methods similarly led to a robust ensemble. The CONS ensemble, while not performing as well as the best individual classifier in absolute terms, was nevertheless very close in terms of performance. More intriguingly, there were many specific cases where it performed better than any single method, indicating that this ensemble provided valuable information not captured by any single methods. </p><p> To our knowledge, this is the first time the evolution of logic circuits has been used in any Bioinformatics problem, and it is expected that as the method becomes more developed, results will improve. It is also expected that the feature-detection aspect of this method can be used in other studies. The function prediction study also marks, to our knowledge, the most-comprehensive ensemble classification for protein function prediction. Finally, we expect that the ensemble classification methods used and developed in our protein structure and function work here will pave the way towards stronger ensemble predictors in the future.</p>
140	Comparative genomics of the Mycobacterium tuberculosis complex Mostowy, Serge. January 2005 (has links) No description available. Biology, Bioinformatics. Biology, Microbiology.

Search results