Return to search

Non Parametric Unsupervised Clustering of ChIP Enrichment Regions Provides Isolation Vectors for Differential Functional Analysis

Gene transcription rates are influenced by proteins, known as Transcription Factors (TFs), that interact with DNA. The locations of TFs on the genome directly influence gene expression and the functional characteristics of a cell. TF binding locations can be estimated for entire genomes using high throughput chromatin immunoprecipitation sequencing (ChIP-Seq). While the analysis of ChIP-Seq binding locations is standardized for a single experiment, complications arise when data sets, taken from different labs and experimental conditions, are combined. In this thesis, I present my method for the simultaneous comparison of multiple ChIP-Seq data sets. My method of comparing multiple ChIP-Seq data sets extends the analysis of a single data set through the addition of two stages, a combination stage, and an extraction stage. Typically, one of two approaches are used to combine information from multiple datasets. Either estimated binding sites are extracted from each dataset and then combined (e.g. by various intersections or unions) or the "raw" genomic signals are analyzed by clustering or dimensionality reduction methods. Both approaches have strengths, but also substantial drawbacks. The method presented here relies both on estimating the binding sites and comparing the “raw” genomic signals between data sets. Once the binding locations have been found, the first step in the combination stage is to define an alternate feature space (AFS). The AFS is the union of all binding locations determined for all data sets. The AFS represents a subset of the genome that is likely to have TF binding in any condition where the protein is active. Once the AFS is defined, the read density is determined from the “raw” genomic signal of each of the data sets. The density is determined for all locations in the AFS resulting in a unified density matrix (UDM). The UDM is the final product of the combination stage of the analysis. After the data sets are homogenized into the UDM, the extraction stage is applied to the matrix. The extraction stage consists of applying machine learning techniques and other methods used to analyze the “raw” genomic signal, to help elucidate underlying similarities and differences between the data sets. I applied this method to the binding locations of the TF TAL1 across 22 ChIP-Seq data sets from the hematopoietic and endothelial lineages. Once the UDM had been generated and normalized, using quantile normalization, hierarchical clustering and principle component analysis (PCA) were applied. Clusters, formed by hematopoietic stem cells (HSCs), Erythroid, and T-cell acute lymphoblastic leukemia (T-ALL), were found using hierarchical clustering. The principle components (PCs) of the UDM provided weights for each peak. Using those weights I could separate groups of cellular conditions including T-ALL, Erythroid, HSC, and Endothelial Colony Forming Cells (ECFCs.) The weights also provided a quantitative measure of importance for each peak in the AFS based on how much weight they provided towards the group of interest. Functional analysis techniques, including de novo motif search and Gene Ontology, were applied to the peak partitions defined using the PCs. Motifs that were enriched in the T-ALL TAL1 partition, and not the Erythroid, were annotated and found to be similar to those that had previously been published, including Runx1 motif and a preference for the CC Ebox (CACCTG). In addition to finding the CC Ebox in T-ALL, I also show that it does not form a composite motif with GATA, indicating an alternative mechanism for the binding of TAL1 in T-ALL. This thesis establishes that heterogeneous collections of ChIP-Seq datasets, from multiple labs and experimental conditions, can be meaningfully combined, and provides an algorithmic template for doing so.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/35084
Date January 2016
CreatorsGriffith, Alexander
ContributorsPerkins, Theodore, Brand, Marjorie
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0018 seconds