It is of interest to know where in the genome DNA binding proteins act in order to effect their gene regulatory function. For many sequence specific DNA binding proteins we plan to predict the location of their action by having a model of their affinity to short DNA sequences. Existing and new models of protein sequence specificty are investigated and their ability to predict genomic locations is evaluated. Public data from a micro-fluidic experiment is used to fit a matrix model of binding specificity for a single transcription factor. Physical association and disassociation constants from the experiment enable a biophysical interpretation of the data to be made in this case. The matrix model is shown to provide a better fit to the experimental data than a model initially published with the data. Public data from 172 protein binding micro-array experiments is used to fit a new type of model to 82 unique proteins. Each experiment provides measurements of the binding specificity of an individual protein to approximately 40000 DNA probes. Statistical, `DNA word', models are assessed for their ability to predict held back data and perform very well in many cases. Where available, ChIP-seq data from the ENCODE project is used to assess the ability of a selection of the DNA word models to predict ChIP-seq peaks and how they compare to matrix models in doing so. This $\textit{in vitro}$ data is the closest proxy to the true sites of the proteins' regulatory action that we have.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:744951 |
Date | January 2018 |
Creators | James, Daniel Peter |
Contributors | Hubbard, Tim ; Down, Thomas |
Publisher | University of Cambridge |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | https://www.repository.cam.ac.uk/handle/1810/277257 |
Page generated in 0.0017 seconds