Genes, the fundamental building blocks of life, act together (often through their derived proteins) in modules such as protein complexes and molecular pathways to achieve a cellular function such as DNA repair and cellular transport. A current emphasis in genomics research is to identify gene modules from gene profiles, which are measurements (such as a mutant phenotype or an expression level), associated with the individual genes under conditions of interest; genes in modules often have similar gene profiles. Clustering groups of genes with similar profiles can hence deliver candidate gene modules.
Pairwise similarity measures derived from these profiles are used as input to the popular hierarchical agglomerative clustering algorithms; however, these algorithms offer little guidance on how to choose candidate modules and how to improve a clustering as new data becomes available. As an alternative, there are methods based on thresholding the similarity values to obtain a graph; such a graph can be analyzed through (probabilistic) methods developed in the social sciences. However, thresholding the data discards valuable information and choosing the threshold is difficult.
Extending binary relational analysis, we exploit ranked relational data as the basis for two distinct approaches for identifying modules from genomic data, both based on the theory of random graph processes. We propose probabilistic models for ranked relational data that allow candidate modules to be accompanied by objective confidence scores and that permit an elegant integration of external information on gene-gene relationships.
We first followed theoretical work by Ling to objectively select exceptionally isolated groups as candidate gene modules. Secondly, inspired by stochastic block models used in the social sciences, we construct a novel model for ranked relational data, where all genes have hidden module parameters which govern the strength of all gene-gene relationships. Adapting a classical likelihood often used for the analysis of horse races, clustering is performed by estimating the module parameters using standard Bayesian methods. The method allows the incorporation of prior information on gene-gene relationships; the utility of using prior information in the form of protein-protein interaction data in clustering of yeast mutant phenotype profiles is demonstrated.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:BVAU.2429/688 |
Date | 05 1900 |
Creators | Brumm, Jochen |
Publisher | University of British Columbia |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Page generated in 0.0021 seconds