1 |
Significant pattern discovery in gene location and phylogenyRiley, Michael January 2009 (has links)
This thesis documents the investigation into the acquisition of knowledge from biological data using computational methods for the discovery of significantly frequent patterns in gene location and phylogeny. Beginning with an initial statistical analysis of distribution of gene locations in the flowering plant Arabidopsis thaliana, we discover unexplained elements of order. The second area of this research looks into frequent patterns in the single dimensional linear structure of the physical locations of genes on the genome of Saccharomyces cerevisiae. This is an area of epigenetics which has, hitherto, attracted little attention. The frequent patterns are patterns of structure represented in Datalog, suitable for analyses using the logic programming methodology Prolog. This is used to find patterns in gene location with respect to various gene attributes such as molecular function and the distance between genes. Here we find significant frequent patterns in neighbouring pairs of genes. We also discover very significant patterns in the molecular function of genes separated by distances of between 5,000 and 20,000 base pairs. However, in complete contrast to the latter result, we find that the distribution of genes of molecular function within a local region of ±20, 000 base pairs is locationally independent. In the second part of this research we look for significantly frequent patterns of phylogenetic subtrees in a broad database of phylogenetic trees. Here we investigate the use of two types of frequent phylogenetic structures. Firstly, phylogenetic pairs are used to determine relationships between organisms. Secondly, phylogenetic triple structures are used to represent subtrees. Frequent subtree mining is then used to establish phylogenetic relationships with a high confidence between a small set of organisms. This exercise was invaluable to enable these procedures to be extended in future to encompass much larger sets of organisms. This research has revealed effective methods for the analysis of, and has discovered patterns of order in the locations of genes within genomes. Research into phylogenetic tree generation based on protein structure has discovered the requirements for an effective method to extract elements of phylogenetic information from a phylogenetic database and reconstruct a single consensus tree from that information. In this way it should be possible to produce a species tree of life with high degree of confidence and resolution.
|
2 |
The implementation of phylogenetic structural equation modeling for biological data from variance-covariance matrices, phylogenies, and comparative analysesSantos, Juan Carlos 05 August 2010 (has links)
One statistical approach with a long history in the social sciences is a multivariate method called Structural Equation Modeling (SEM). The development of SEM followed the evolution of factor and path analyses, multiple regression analysis and MACOVA. One of the key innovations of factor analysis and SEM is that they group a set of multivariate statistical approaches that condense variability among a set of variables in fewer latent (unobserved) factors. Most biological systems are multivariate, which are not easily dissected into their component parts. However, most biologists use only univariate statistical methods, which have definitive limitations in accounting for more than a few variables simultaneously. Therefore, the implementation of methodologies like SEM into biological research is necessary. However, SEM cannot be applied directly to most biological datasets or generalized across species because of the hierarchical pattern of evolutionary history (i.e., phylogenetic non-independence or signal). This report includes the theoretical grounds for the development of Phylogenetic SEM in preparation of the development of utilitarian algorithms. I have divided this report in six parts: (1) a brief introduction to factor analysis and SEM from historical perspective and a brief description of its utility; (2) a summary of the implications of using biological data and the underlying hierarchical structure due to shared common ancestry or phylogeny; (3) a summary of the two most common comparative methods to incorporate the phylogeny in univariate analyses (i.e., phylogenetic independent contrasts and phylogenetic generalized least squares); (4) I describe how some intermediate output from both comparative methods can be used to estimate the variance–covariance matrix that has been corrected for phylogenetic signal; (5) I describe how to perform a exploratory factor analysis, specifically principal component analysis, with the corrected variance–covariance matrix; and (6) I describe the development of the phylogenetic confirmatory factor analysis and phylogenetic SEM. I hope that this report encourages other researchers to develop adequate multivariate analysis that incorporate the evolutionary principles in its analyses. / text
|
3 |
Analysis of large-scale molecular biological data using self-organizing mapsWirth, Henry 19 December 2012 (has links) (PDF)
Modern high-throughput technologies such as microarrays, next generation sequencing and mass spectrometry provide huge amounts of data per measurement and challenge traditional analyses. New strategies of data processing, visualization and functional analysis are inevitable. This thesis presents an approach which applies a machine learning technique known as self organizing maps (SOMs). SOMs enable the parallel sample- and feature-centered view of molecular phenotypes combined with strong visualization and second-level analysis capabilities.
We developed a comprehensive analysis and visualization pipeline based on SOMs. The unsupervised SOM mapping projects the initially high number of features, such as gene expression profiles, to meta-feature clusters of similar and hence potentially co-regulated single features. This reduction of dimension is attained by the re-weighting of primary information and does not entail a loss of primary information in contrast to simple filtering approaches. The meta-data provided by the SOM algorithm is visualized in terms of intuitive mosaic portraits. Sample-specific and common properties shared between samples emerge as a handful of localized spots in the portraits collecting groups of co-regulated and co-expressed meta-features. This characteristic color patterns reflect the data landscape of each sample and promote immediate identification of (meta-)features of interest. It will be demonstrated that SOM portraits transform large and heterogeneous sets of molecular biological data into an atlas of sample-specific texture maps which can be directly compared in terms of similarities and dissimilarities. Spot-clusters of correlated meta-features can be extracted from the SOM portraits in a subsequent step of aggregation. This spot-clustering effectively enables reduction of the dimensionality of the data in two subsequent steps towards a handful of signature modules in an unsupervised fashion.
Furthermore we demonstrate that analysis techniques provide enhanced resolution if applied to the meta-features. The improved discrimination power of meta-features in downstream analyses such as hierarchical clustering, independent component analysis or pairwise correlation analysis is ascribed to essentially two facts: Firstly, the set of meta-features better represents the diversity of patterns and modes inherent in the data and secondly, it also possesses the better signal-to-noise characteristics as a comparable collection of single features.
Additionally to the pattern-driven feature selection in the SOM portraits, we apply statistical measures to detect significantly differential features between sample classes. Implementation of scoring measurements supplements the basal SOM algorithm. Further, two variants of functional enrichment analyses are introduced which link sample specific patterns of the meta-feature landscape with biological knowledge and support functional interpretation of the data based on the ‘guilt by association’ principle.
Finally, case studies selected from different ‘OMIC’ realms are presented in this thesis. In particular, molecular phenotype data derived from expression microarrays (mRNA, miRNA), sequencing (DNA methylation, histone modification patterns) or mass spectrometry (proteome), and also genotype data (SNP-microarrays) is analyzed. It is shown that the SOM analysis pipeline implies strong application
capabilities and covers a broad range of potential purposes ranging from time series and treatment-vs.-control experiments to discrimination of samples according to genotypic, phenotypic or taxonomic classifications.
|
4 |
Ontology engineering: the brain gene ontology case studyWang, Yufei Unknown Date (has links)
The emergence of ontologies has marked another stage in the evolution of knowledge engineering. In the biomedical domain especially, a notable number of ontologies have been developed for knowledge acquisition, maintenance, sharing and reuse from large and distributed databases in order to reach the critical requirements of biomedical analysis and application. This research aims at the development of a Brain Gene Ontology by adopting a constructive IS methodology which tightly combines the processes of ontology learning, building, reuse and evaluation together. Brain Gene Ontology is a part of the BGO project that is being developed by KEDRI (Gottgtroy and Jain, 2005). The objective is to represent knowledge of the genes and proteins that are related to specific brain disorders like epilepsy and schizophrenia. The current stage focuses on the crucial neuronal parameters such as AMPA, GABA, CLC and SCN through their direct or indirect interactions with other genes and proteins. In this case, ontological representations were able to provide the conceptual framework and the knowledge itself to understand more about relationships among those genes and their links to brain disorders. It also provided a semantic repository of systematically ordered molecules concerned. The research adopts Protégé-Frames, which is an open source ontology tool suite for BGO development. Some Protégé plug-ins were also used to extend the applicable functions and improve knowledge representation. Basically, the research discusses the availability and the framework of the constructive Information System research methodology for ontology development, it also describes the process that bridges different notions of the brain, genes and proteins in various databases, and illustrates how to build and implement the ontology with Protégé-Frames and its plug-ins. The results of the BGO development proved that the constructive IS methodology does help to fill in the cognitive gap between domain users and ontology developers, the extensible, component-based architectures of Protégé-Frames significantly support the various activities in the ontology development process, and through explicitly specifying the meaning of fundamental concepts and their relations, ontology can actually integrate knowledge from multiple biological knowledge bases.
|
5 |
Grouping Biological DataRundqvist, David January 2006 (has links)
<p>Today, scientists in various biomedical fields rely on biological data sources in their research. Large amounts of information concerning, for instance, genes, proteins and diseases are publicly available on the internet, and are used daily for acquiring knowledge. Typically, biological data is spread across multiple sources, which has led to heterogeneity and redundancy.</p><p>The current thesis suggests grouping as one way of computationally managing biological data. A conceptual model for this purpose is presented, which takes properties specific for biological data into account. The model defines sub-tasks and key issues where multiple solutions are possible, and describes what approaches for these that have been used in earlier work. Further, an implementation of this model is described, as well as test cases which show that the model is indeed useful.</p><p>Since the use of ontologies is relatively new in the management of biological data, the main focus of the thesis is on how semantic similarity of ontological annotations can be used for grouping. The results of the test cases show for example that the implementation of the model, using Gene Ontology, is capable of producing groups of data entries with similar molecular functions.</p>
|
6 |
Ontology engineering: the brain gene ontology case studyWang, Yufei Unknown Date (has links)
The emergence of ontologies has marked another stage in the evolution of knowledge engineering. In the biomedical domain especially, a notable number of ontologies have been developed for knowledge acquisition, maintenance, sharing and reuse from large and distributed databases in order to reach the critical requirements of biomedical analysis and application. This research aims at the development of a Brain Gene Ontology by adopting a constructive IS methodology which tightly combines the processes of ontology learning, building, reuse and evaluation together. Brain Gene Ontology is a part of the BGO project that is being developed by KEDRI (Gottgtroy and Jain, 2005). The objective is to represent knowledge of the genes and proteins that are related to specific brain disorders like epilepsy and schizophrenia. The current stage focuses on the crucial neuronal parameters such as AMPA, GABA, CLC and SCN through their direct or indirect interactions with other genes and proteins. In this case, ontological representations were able to provide the conceptual framework and the knowledge itself to understand more about relationships among those genes and their links to brain disorders. It also provided a semantic repository of systematically ordered molecules concerned. The research adopts Protégé-Frames, which is an open source ontology tool suite for BGO development. Some Protégé plug-ins were also used to extend the applicable functions and improve knowledge representation. Basically, the research discusses the availability and the framework of the constructive Information System research methodology for ontology development, it also describes the process that bridges different notions of the brain, genes and proteins in various databases, and illustrates how to build and implement the ontology with Protégé-Frames and its plug-ins. The results of the BGO development proved that the constructive IS methodology does help to fill in the cognitive gap between domain users and ontology developers, the extensible, component-based architectures of Protégé-Frames significantly support the various activities in the ontology development process, and through explicitly specifying the meaning of fundamental concepts and their relations, ontology can actually integrate knowledge from multiple biological knowledge bases.
|
7 |
Understanding disease and disease relationships using transcriptomic dataOerton, Erin January 2019 (has links)
As the volume of transcriptomic data continues to increase, so too does its potential to deepen our understanding of disease; for example, by revealing gene expression patterns shared between diseases. However, key questions remain around the strength of the transcriptomic signal of disease and the identification of meaningful commonalities between datasets, which are addressed in this thesis as follows. The first chapter, Concordance of Microarray Studies of Parkinson's Disease, examines the agreement between differential expression signatures across 33 studies of Parkinson's disease. Comparison of these studies, which cover a range of microarray platforms, tissues, and disease models, reveals a characteristic pattern of differential expression in the most highly-affected tissues in human patients. Using correlation and clustering analyses to measure the representativeness of different study designs to human disease, the work described acts as a guideline for the comparison of microarray studies in the following chapters. In the next chapter, Using Dysregulated Signalling Paths to Understand Disease, gene expression changes are linked on the human signalling network, enabling identification of network regions dysregulated in disease. Applying this method across a large dataset of 141 common and rare diseases identifies dysregulated processes shared between diverse conditions, which relate to known disease- and drug-sharing-relationships. The final chapter, Understanding and Predicting Disease Relationships Through Similarity Fusion, explores the integration of gene expression with other data types - in this case, ontological, phenotypic, literature co-occurrence, genetic, and drug data - to understand relationships between diseases. A similarity fusion approach is proposed to overcome the differences in data type properties between each space, resulting in the identification of novel disease relationships spanning multiple bioinformatic levels. The similarity of disease relationships between each data type is considered, revealing that relationships in differential expression space are distinct from those in other molecular and clinical spaces. In summary, the work described in this thesis sets out a framework for the comparative analysis of transcriptomic data in disease, including the integration of biological networks and other bioinformatic data types, in order to further our knowledge of diseases and the relationships between them.
|
8 |
A data cleaning and annotation framework for genome-wide studies.Ranjani Ramakrishnan 11 1900 (has links) (PDF)
M.S. / Computer Science and Engineering / Genome-wide studies are sensitive to the quality of annotation data included for analyses and they often involve overlaying both computationally derived and experimentally generated data onto a genomic scaffold. A framework for successful integration of data from diverse sources needs to address, at a minimum, the conceptualization of the biological identity in the data sources, the relationship between the sources in terms of the data present, the independence of the sources and, any discrepancies in the data. The outcome of the process should either resolve or incorporate these discrepancies into downstream analyses. In this thesis we identify factors that are important in detecting errors within and between sources and present a generalized framework to detect discrepancies. An implementation of our workflow is used to demonstrate the utility of the approach in the construction of a genome-wide mouse transcription factor binding map and in the classification of Single nucleotide polymorphisms. We also present the impact of these discrepancies on downstream analyses. The framework is extensible and we discuss future directions including summarization of the discrepancies in a biological relevant manner.
|
9 |
Grouping Biological DataRundqvist, David January 2006 (has links)
Today, scientists in various biomedical fields rely on biological data sources in their research. Large amounts of information concerning, for instance, genes, proteins and diseases are publicly available on the internet, and are used daily for acquiring knowledge. Typically, biological data is spread across multiple sources, which has led to heterogeneity and redundancy. The current thesis suggests grouping as one way of computationally managing biological data. A conceptual model for this purpose is presented, which takes properties specific for biological data into account. The model defines sub-tasks and key issues where multiple solutions are possible, and describes what approaches for these that have been used in earlier work. Further, an implementation of this model is described, as well as test cases which show that the model is indeed useful. Since the use of ontologies is relatively new in the management of biological data, the main focus of the thesis is on how semantic similarity of ontological annotations can be used for grouping. The results of the test cases show for example that the implementation of the model, using Gene Ontology, is capable of producing groups of data entries with similar molecular functions.
|
10 |
Caractérisation logique de données : application aux données biologiques / Logical Characterization of Data : application to Biological DataChambon, Arthur 13 December 2017 (has links)
L’analyse de groupes de données binaires est aujourd’hui un défi au vu des quantités de données collectées. Elle peut être réalisée par des approches logiques. Ces approches identifient dessous-ensembles d’attributs booléens pertinents pour caractériser les observations d’un groupe et peuvent aider l’utilisateur à mieux comprendre les propriétés de ce groupe.Cette thèse présente une approche pour caractériser des groupes de données binaires en identifiant un sous-ensemble minimal d’attributs permettant de distinguer les données de différents groupes.Nous avons défini avec précision le problème de la caractérisation multiple et proposé de nouveaux algorithmes qui peuvent être utilisés pour résoudre ses différentes variantes. Notre approche de caractérisation de données peut être étendue à la recherche de patterns (motifs) dans le cadre de l’analyse logique de données. Un pattern peut être considéré comme une explication partielle des observations positives pouvant être utilisées par les praticiens, par exemple à des fins de diagnostic. De nombreux patterns existent et plusieurs critères de préférence peuvent être ajoutés pour se concentrer sur des ensembles plus restreints (prime patterns,strong patterns,. . .). Nous proposons donc une comparaison entre ces deux méthodologies ainsi que des algorithmes pour générer des patterns. Un autre objectif est d’étudier les propriétés des solutions calculées en fonction des propriétés topologiques des instances. Des expériences sont menées sur de véritables ensembles de données biologiques. / Analysis of groups of binary data is now a challenge given the amount of collected data. It can be achieved by logical based approaches. These approaches identify subsets of relevant Boolean attributes to characterize the observations of a group and may help the user to better understand the properties of this group. This thesis presents an approach for characterizing groups of binary data by identifying a minimal subset of attributes that allows to distinguish data from different groups. We have precisely defined the multiple characterization problem and proposed new algorithms that can be used to solve its different variants. Our data characterization approach can be extended to search for patterns in the framework of logical analysis of data. A pattern can be considered as a partial explanation of the positive observations that can be used by practitioners, for instance for diagnosis purposes. Many patterns may exist and several preference criteria can be added in order to focus on more restricted sets of patterns (prime patterns, strong patterns, . . . ). We propose a comparison between these two methodologies as well as algorithms for generating patterns. The purpose is also to precisely study the properties of the solutions that are computed with regards to the topological properties of the instances. Experiments are thus conducted on real biological data.
|
Page generated in 0.0875 seconds