1 |
The hetnet awakens| understanding complex diseases through data integration andopen scienceHimmelstein, Daniel S. 07 July 2016 (has links)
<p> Human disease is complex. However, the explosion of biomedical data is providing new opportunities to improve our understanding. My dissertation focused on how to harness the biodata revolution. Broadly, I addressed three questions: how to integrate data, how to extract insights from data, and how to make science more open. </p><p> To integrate data, we pioneered the hetnet—a network with multiple node and relationship types. After several preludes, we released Hetionet v1.0, which contains 2,250,197 relationships of 24 types. Hetionet encodes the collective knowledge produced by millions of studies over the last half century. </p><p> To extract insights from data, we developed a machine learning approach for hetnets. In order to predict the probability that an unknown relationship exists, our algorithm identifies influential network patterns. We used the approach to prioritize disease—gene associations and drug repurposing opportunities. By evaluating our predictions on withheld knowledge, we demonstrated the systematic success of our method. </p><p> After encountering friction that interfered with data integration and rapid communication, I began looking at how to make science more open. The quest led me to explore realtime open notebook science and expose publishing delays at journals as well as the problematic licensing of publicly-funded research data.</p>
|
2 |
Positive-Unlabeled Learning in the Context of Protein Function PredictionYoungs, Noah 19 December 2014 (has links)
<p> With the recent proliferation of large, unlabeled data sets, a particular subclass of semisupervised learning problems has become more prevalent. Known as positive-unlabeled learning (PU learning), this scenario provides only positive labeled examples, usually just a small fraction of the entire dataset, with the remaining examples unknown and thus potentially belonging to either the positive or negative class. Since the vast majority of traditional machine learning classifiers require both positive and negative examples in the training set, a new class of algorithms has been developed to deal with PU learning problems.</p><p> A canonical example of this scenario is topic labeling of a large corpus of documents. Once the size of a corpus reaches into the thousands, it becomes largely infeasible to have a curator read even a sizable fraction of the documents, and annotate them with topics. In addition, the entire set of topics may not be known, or may change over time, making it impossible for a curator to annotate which documents are NOT about certain topics. Thus a machine learning algorithm needs to be able to learn from a small set of positive examples, without knowledge of the negative class, and knowing that the unlabeled training examples may contain an arbitrary number of additional but as yet unknown positive examples. </p><p> Another example of a PU learning scenario recently garnering attention is the protein function prediction problem (PFP problem). While the number of organisms with fully sequenced genomes continues to grow, the progress of annotating those sequences with the biological functions that they perform lags far behind. Machine learning methods have already been successfully applied to this problem, but with many organisms having a small number of positive annotated training examples, and the lack of availability of almost any labeled negative examples, PU learning algorithms have the potential to make large gains in predictive performance.</p><p> The first part of this dissertation motivates the protein function prediction problem, explores previous work, and introduces novel methods that improve upon previously reported benchmarks for a particular type of learning algorithm, known as Gaussian Random Field Label Propagation (GRFLP). In addition, we present improvements to the computational efficiency of the GRFLP algorithm, and a modification to the traditional structure of the PFP learning problem that allows for simultaneous prediction across multiple species.</p><p> The second part of the dissertation focuses specifically on the positive-unlabeled aspects of the PFP problem. Two novel algorithms are presented, and rigorously compared to existing PU learning techniques in the context of protein function prediction. Additionally, we take a step back and examine some of the theoretical considerations of the PU scenario in general, and provide an additional novel algorithm applicable in any PU context. This algorithm is tailored for situations in which the labeled positive examples are a small fraction of the set of true positive examples, and where the labeling process may be subject to some type of bias rather than being a random selection of true positives (arguably some of the most difficult PU learning scenarios).</p><p> The third and fourth sections return to the PFP problem, examining the power of tertiary structure as a predictor of protein function, as well as presenting two case studies of function prediction performance on novel benchmarks. Lastly, we conclude with several promising avenues of future research into both PU learning in general, and the protein function prediction problem specifically. </p>
|
3 |
Gene set enrichment and projection| A computational tool for knowledge discovery in transcriptomesStamm, Karl D. 18 August 2016 (has links)
<p> Explaining the mechanism behind a genetic disease involves two phases, collecting and analyzing data associated to the disease, then interpreting those data in the context of biological systems. The objective of this dissertation was to develop a method of integrating complementary datasets surrounding any single biological process, with the goal of presenting the response to a signal in terms of a set of downstream biological effects. This dissertation specifically tests the hypothesis that computational projection methods overlaid with domain expertise can direct research towards relevant systems-level signals underlying complex genetic disease. To this end, I developed a software algorithm named Geneset Enrichment and Projection Displays (GSEPD) that can visualize multidimensional genetic expression to identify the biologically relevant gene sets that are altered in response to a biological process. </p><p> This dissertation highlights a problem of data interpretation facing the medical research community, and shows how computational sciences can help. By bringing annotation and expression datasets together, a new analytical and software method was produced that helps unravel complicated experimental and biological data. </p><p> The dissertation shows four coauthored studies where the experts in their field have desired to annotate functional significance to a gene-centric experiment. Using GSEPD to show inherently high dimensional data as a simple colored graph, a subspace vector projection directly calculated how each sample behaves like test conditions. The end-user medical researcher understands their data as a series of somewhat-independent subsystems, and GSEPD provides a dimensionality reduction for high throughput experiments of limited sample size. Gene Ontology analyses are accessible on a sample-to-sample level, and this work highlights not just the expected biological systems, but many annotated results available in vast online databases.</p>
|
4 |
Enhancing Space and Time Efficiency of Genomics in Practice through Sophisticated Applications of the FM-IndexMuggli, Martin D. 22 January 2019 (has links)
<p> Genomic sequence data has become so easy to get that the computation to process it has become a bottleneck in the advancement of biological science. A data structure known as the FM-Index both compresses data and allows efficient querying, thus can be used to implement more efficient processing methods. In this work we apply advanced formulations of the FM-Index to existing problems and show our methods exceed the performance of competing tools. </p><p>
|
5 |
Algorithms for Reconstruction of Gene Regulatory Networks from High-Throughput Gene Expression DataDeng, Wenping 15 February 2019 (has links)
<p> Understanding gene interactions in complex living systems is one of the central tasks in system biology. With the availability of microarray and RNA-Seq technologies, a multitude of gene expression datasets has been generated towards novel biological knowledge discovery through statistical analysis and reconstruction of gene regulatory networks (GRN). Reconstruction of GRNs can reveal the interrelationships among genes and identify the hierarchies of genes and hubs in networks. The new algorithms I developed in this dissertation are specifically focused on the reconstruction of GRNs with increased accuracy from microarray and RNA-Seq high-throughput gene expression data sets. </p><p> The first algorithm (Chapter 2) focuses on modeling the transcriptional regulatory relationships between transcription factors (TF) and pathway genes. Multiple linear regression and its regularized version, such as Ridge regression and LASSO, are common tools that are usually used to model the relationship between predictor variables and dependent variable. To deal with the outliers in gene expression data, the group effect of TFs in regulation and to improve the statistical efficiency, it is proposed to use Huber function as loss function and Berhu function as penalty function to model the relationships between a pathway gene and many or all TFs. A proximal gradient descent algorithm was developed to solve the corresponding optimization problem. This algorithm is much faster than the general convex optimization solver CVX. Then this Huber-Berhu regression was embedded into partial least square (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data. The result showed this method can identify the true regulatory TFs for each pathway gene with high efficiency. </p><p> The second algorithm (Chapter 3) focuses on building multilayered hierarchical gene regulatory networks (ML-hGRNs). A backward elimination random forest (BWERF) algorithm was developed for constructing an ML-hGRN operating above a biological pathway or a biological process. The algorithm first divided construction of ML-hGRN into multiple regression tasks; each involves a regression between a pathway gene and all TFs. Random forest models with backward elimination were used to determine the importance of each TF to a pathway gene. Then the importance of a TF to the whole pathway was computed by aggregating all the importance values of the TF to the individual pathway gene. Next, an expectation maximization algorithm was used to cut the TFs to form the first layer of direct regulatory relationships. The upper layers of GRN were constructed in the same way only replacing the pathway genes by the newly cut TFs. Both simulated and real gene expression data were used to test the algorithms and demonstrated the accuracy and efficiency of the method. </p><p> The third algorithm (Chapter 4) focuses on Joint Reconstruction of Multiple Gene Regulatory Networks (JRmGRN) using gene expression data from multiple tissues or conditions. In the formulation, shared hub genes across different tissues or conditions were assumed. Under the framework of the Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log-likelihood function. It was formulated as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. Both simulated and real gene expression data manifested JRmGRN had better performance than existing methods.</p><p>
|
6 |
Application of Graph Theoretic Clustering on Some Biomedical Data SetsAhlert, Darla 11 June 2015 (has links)
<p> Clustering algorithms have become a popular way to analyze biomedical data sets and in particular, gene expression data. Since these data sets are often large, it is difficult to gather useful information from them as a whole. Clustering is a proven method to extract knowledge about the data that can eventually lead to many discoveries in the biological world. Hierarchical clustering is used frequently to interpret gene expression data, but recently, graph-theoretic clustering algorithms have started to gain some attraction for analysis of this type of data. We consider five graph-theoretic clustering algorithms run over a post-mortem gene expression dataset, as well as a few different biomedical data sets, in which the ground truth, or class label, is known for each data point. We then externally evaluate the algorithms based on the accuracy of the resulting clusters against the ground truth clusters. Comparing the results of each of the algorithms run over all of the datasets, we found that our algorithms are efficient on the real biomedical datasets but find gene expression data especially difficult to handle.</p>
|
7 |
A Motif Discovery and Analysis Pipeline for Heterogeneous Next-Generation Sequencing DataRamsay, Trevor 10 October 2015 (has links)
<p> Bioinformatics has made great strides in understanding the regulation of gene expression, but many of the tools developed for this purpose depend on data from a limited number of species. Despite their unique genetic attributes, there remains a dearth of research into undomesticated trees. The poplar tree, <i> Populus trichocarpa</i>, has undergone multiple rounds of genome duplication during its evolution. In addition its life cycle varies from other annual crop and model plants previously studied, leading to significant technical challenges to understand the unique biology of these trees. For example, the process of secondary growth occurs as the tree stems thicken, and creates secondary xylem (wood) and phloem (inner bark) for water and products of photosynthesis transport, respectively. Because of this, the research group I work with studies the secondary growth of <i>P. trichocarpa</i> (Spicer, 2010) (Groover, et al., 2010) (Groover, et al., 2006) (Groover, 2005).</p><p> The genomic tools to investigate gene regulation in <i>P. trichocarpa </i> are readily available. Next-generation sequencing technologies such as RNA-Seq and ChIP-Seq can be used to understand gene expression and binding of transcription factors to specific locations in the genome. Similarly, a variety of specialized bioinformatic tools such as EdgeR, Cufflinks, and MACS can be used to analyze gene binding and expression from sequencing data provided by ChIP-seq and RNA-seq (Blahnik, et al., 2010) (Mortazavi, et al., 2008) (Robinson, 2010) (Robinson, 2007) (Robinson, et al., 2008) (McCarthy, 2012) (Trapnell, 2013) (Zhang, 2008). The binding and expression data these tools provide form a foundation for analyzing the gene expression regulation in <i> P. trichocarpa.</i></p><p> The goal of my project is to provide a motif discovery and analysis pipeline for analyses of <i>Populus</i> species. The motif discovery and analysis pipeline utilizes heterogeneous data collected from poplar and aspen mutants to elucidate the gene regulatory mechanisms involved in secondary growth. The experiments target transcription factors related to secondary growth, and through analysis of the variety of transcription factor binding experiments, I have identified the motifs involved in gene regulation of secondary growth within <i>P. trichocarpa.</i> (Filkov, et al., 2008).</p>
|
8 |
Remote Homology Detection in Proteins Using Graphical ModelsDaniels, Noah Manus 24 July 2013 (has links)
<p> Given the amino acid sequence of a protein, researchers often infer its structure and function by finding homologous, or evolutionarily-related, proteins of known structure and function. Since structure is typically more conserved than sequence over long evolutionary distances, recognizing remote protein homologs from their sequence poses a challenge. </p><p> We first consider all proteins of known three-dimensional structure, and explore how they cluster according to different levels of homology. An automatic computational method reasonably approximates a human-curated hierarchical organization of proteins according to their degree of homology. </p><p> Next, we return to homology prediction, based only on the one-dimensional amino acid sequence of a protein. Menke, Berger, and Cowen proposed a Markov random field model to predict remote homology for beta-structural proteins, but their formulation was computationally intractable on many beta-strand topologies. </p><p> We show two different approaches to approximate this random field, both of which make it computationally tractable, for the first time, on all protein folds. One method simplifies the random field itself, while the other retains the full random field, but approximates the solution through stochastic search. Both methods achieve improvements over the state of the art in remote homology detection for beta-structural protein folds.</p>
|
9 |
A web semantic for SBML mergeThavappiragasam, Mathialakan 05 November 2014 (has links)
<p> The manipulation of XML based relational representations of biological systems (BioML for Bioscience Markup Language) is a big challenge in systems biology. The needs of biologists, like translational study of biological systems, cause their challenges to become grater due to the material received in next generation sequencing. Among these BioML's, SBML is the de facto standard file format for the storage and exchange of quantitative computational models in systems biology, supported by more than 257 software packages to date. The SBML standard is used by several biological systems modeling tools and several databases for representation and knowledge sharing. Several sub systems are integrated in order to construct a complex bio system. The issue of combining biological sub-systems by merging SBML files has been addressed in several algorithms and tools. But it remains impossible to build an automatic merge system that implements reusability, flexibility, scalability and sharability. The technique existing algorithms use is name based component comparisons. This does not allow integration into Workflow Management System (WMS) to build pipelines and also does not include the mapping of quantitative data needed for a good analysis of the biological system. In this work, we present a deterministic merging algorithm that is consumable in a given WMS engine, and designed using a novel biological model similarity algorithm. This model merging system is designed with integration of four sub modules: SBMLChecker, SBMLAnot, SBMLCompare, and SBMLMerge, for model quality checking, annotation, comparison, and merging respectively. The tools are integrated into the BioExtract server leveraging iPlant collaborative resources to support users by allowing them to process large models and design work flows. These tools are also embedded into a user friendly online version SW4SBMLm.</p>
|
10 |
Taxonomic assignment of gene sequences using hidden Markov modelsHuang, Huanhua 16 October 2014 (has links)
<p> Our ability to study communities of microorganisms has been vastly improved by the development of high-throughput DNA sequences. These technologies however can only sequence short fragments of organism's genomes at a time, which introduces many challenges in translating sequences results to biological insight. The field of bioinformatics has arisen in part to address these problems. </p><p> One bioinformatics problem is assigning a genetic sequence to a source organism. It is now common to use high−throughput, short−read sequencing technologies, such as the Illumina MiSeq, to sequence the 16S rRNA gene from a community of microorganisms. Researchers use this information to generate a profile of the different microbial organisms (i.e., the taxonomic composition) present in an environmental sample. There are a number of approaches for assigning taxonomy to genetic sequences, but all suffer from problems with accuracy. The methods that have been most widely used are pairwise alignment methods, like BLAST, UCLUST, and RTAX, and probability-based methods, such as RDP and MOTHUR. These methods can classify microbial sequences with high accuracy when sequences are long (e.g., thousand bases), however accuracy decreases as sequences are shorter. Current high−throughout sequencing technologies generates sequences between about 150 and 500 bases in length. </p><p> In my thesis I have developed new software for assigning taxonomy to short DNA sequences using profile Hidden Markov Models (HMMs). HMMs have been applied in related areas, such as assigning biological functions to protein sequences, and I hypothesize that it might be useful for achieving high accuracy taxonomic assignments from 16S rRNA gene sequences. My method builds models of 16S rRNA sequences for different taxonomic groups (kingdom, phylum, class, order, family genus and species) using the Greengenes 16S rRNA database. Given a sequence with unknown taxonomic origin, my method searches each kingdom model to determine the most likely kingdom. It then searches all of the phyla within the highest scoring kingdom to determine the most likely phylum. This iterative process continues until the sequence cannot be assigned at a taxonomic level with a user-defined confidence level, or until a species-level assignment is made that meets the user-defined confidence level. </p><p> I next evaluated this method on both artificial and real microbial community data, with both qualitative and quantitative metrics of method performance. The evaluation results showed that in the qualitative analyses (specificity and sensitivity) my method is not as good as the previously existing methods. However, the accuracy in the quantitative analysis was better than some other pre-existing methods. This suggests that my current implementation is sensitive to false positives, but is better at classifying more sequences than the other methods. </p><p> I present my method, my evaluations, and suggestions for next steps that might improve the performance of my HMM-based taxonomic classifier.</p>
|
Page generated in 0.1085 seconds