Global ETD Search

81	Quantitative tool for in vivo analysis of DNA-binding proteins using High Resolution Sequencing Data Filatenkova, Milana S. January 2016 (has links) DNA-binding proteins (DBPs) such as repair proteins, DNA polymerases, re- combinases, transcription factors, etc. manifest diverse stochastic behaviours dependent on physiological conditions inside the cell. Now that multiple independent in vitro studies have extensively characterised different aspects of the biochemistry of DBPs, computational and mathematical tools that would be able to integrate this information into a coherent framework are in huge demand, especially when attempting a transition to in vivo characterisation of these systems. ChIP-Seq is the method commonly used to study DBPs in vivo. This method generates high resolution sequencing data { population scale readout of the activity of DBPs on the DNA. The mathematical tools available for the analysis of this type of data are at the moment very restrictive in their ability to extract mechanistic and quantitative details on the activity of DBPs. The main trouble that researchers experience when analysing such population scale sequencing data is effectively disentangling complexity in these data, since the observed output often combines diverse outcomes of multiple unsynchronised processes reflecting biomolecular variability. Although being a static snapshot ChIP-Seq can be effectively utilised as a readout for the dynamics of DBPs in vivo. This thesis features a new approach to ChIP-Seq analysis { namely accessing the concealed details of the dynamic behaviour of DBPs on DNA using probabilistic modelling, statistical inference and numerical optimisation. In order to achieve this I propose to integrate previously acquired assumptions about the behaviour of DBPs into a Markov- Chain model which would allow to take into account their intrinsic stochasticity. By incorporating this model into a statistical model of data acquisition, the experimentally observed output can be simulated and then compared to in vivo data to reverse engineer the stochastic activity of DBPs on the DNA. Conventional tools normally employ simple empirical models where the parameters have no link with the mechanistic reality of the process under scrutiny. This thesis marks the transition from qualitative analysis to mechanistic modelling in an attempt to make the most of the high resolution sequencing data. It is also worth noting that from a computer science point of view DBPs are of great interest since they are able to perform stochastic computation on DNA by responding in a probabilistic manner to the patterns encoded in the DNA. The theoretical framework proposed here allows to quantitatively characterise complex responses of these molecular machines to the sequence features. 572.8
82	The role of Rtr1 and Rrp6 in RNAPII in transcription termination Fox, Melanie Joy 31 August 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / RNA Polymerase II (RNAPII) is responsible for transcription of messenger RNA (mRNA) and many small non-coding RNAs. Progression through the RNAPII transcription cycle is orchestrated by combinatorial posttranslational modifications of the C-terminal domain (CTD) of the largest subunit of RNAPII, Rpb1, consisting of the repetitive sequence (Y1S2P3T4S5P6S7)n. Disruptions of proteins that control CTD phosphorylation, including the phosphatase Rtr1, cause defects in gene expression and transcription termination. There are two described RNAPII termination mechanisms. Most mRNAs are terminated by the polyadenylation-dependent cleavage and polyadenylation complex. Most short noncoding RNAs are terminated by the Nrd1 complex. Nrd1-dependent termination is coupled to RNA 3' end processing and/or degradation by Rrp6, a nuclear specific subunit of the exosome. The Rrp6-containing form a 3'-5' exonuclease complex that regulates diverse aspects of nuclear RNA biology including 3' end processing and degradation of a variety of noncoding RNAs (ncRNAs). It remains unclear whether Rrp6 is directly involved in termination. We discovered that deletion of RRP6 promotes extension of multiple Nrd1-dependent transcripts resulting from improperly processed 3' RNA ends and faulty transcript termination at specific target genes. Defects in RNAPII termination cause transcriptome-wide changes in mRNA expression through transcription interference and/or antisense repression, similar to previously reported effects of Nrd1 depletion from the nucleus. Our data indicate Rrp6 acts with Nrd1 globally to promote transcription termination in addition to RNA processing and/or degradation. Furthermore, we found that deletion of the CTD phosphatase Rtr1 shortens the distance of transcription before Nrd1-dependent termination of specific regulatory antisense transcripts (ASTs), increases Nrd1 occupancy at these sites, and increases the interaction between Nrd1 and RNAPII. The RTR1/RRP6 double deletion phenocopies an RRP6 deletion, indicating that the regulation of ASTs by Rtr1 requires Rrp6 activity and the Nrd1 termination pathway. ChIP-Seq RNAPII Transcription RNA-Sequencing Transcription Termination RNA polymerases -- Research Gene expression Protein kinases Proteins -- Metabolism Phosphorylation
83	CIS REGULATORY MODULE DISCOVERY IN TH1 CELL DEVELOPMENT Ganakammal, Satishkumar Ranganathan January 2010 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Immune response enables the body to resist foreign invasions. The Inflammatory response is an important aspect in the immune response which is articulated by elements such as cytokines, APC, T-cell and B-cell, effector cell or natural killer. Of these elements, T-cells especially T-helper cells; a sub class of T-cells plays a pivotal role in stimulating the immune response by participating in various biological reactions such as, the transcription regulatory network. Transcriptional regulatory mechanisms are mediated by a set of transcription factors (TFs), that bind to a specific region (motifs or transcription factor binding sites, TFBS), on the target gene(s) controlling the expression of genes that are involved in T-helper cell mediated immune response. Eukaryotic regulatory motifs, referred to as cis regulatory modules (CRMs) or cistrome, co-occur with the regulated gene’s transcription start site (TSS) thus, providing all the essential components for building the transcriptional regulatory networks that depends on the relevant TF-TFBS interactions. Here, we study IL-12 stimulated transcriptional regulators in STAT4 mediated T helper 1 (Th1) cell development by focusing on the identification of TFBS and CRMs using a set of Stat4 ChIP-on-chip target genes. A region containing 2000 bases of Mus musculus sequences with the Stat4 binding site, derived from the ChIP-on-chip data, has been characterized for enrichment of other motifs and, thus CRMs. Our experiments identify some potential motifs, (such as NF-κB and PPARγ/RXR) being enriched in the Stat4 binding sequences compared to neighboring background sequences. Furthermore, these predicted CRMs were observed to be associated with biologically relevant target genes in the ChIP-on-chip data set by meaningful gene ontology annotations. These analyses will enable us to comprehend the complicated transcription regulatory network and at the same time categorically analyze the IL-12 stimulated Stat4 mediated Th1 cell differentiation. Th1 cells -- Development -- Research Immune response -- Research
84	Wide Scale Analysis of Transcription Factor Biases and Specificity Awdeh, Aseel R. 23 November 2022 (has links) There are approximately 30 trillion cells in the human body, and nearly every cell has the same genomic sequence. Yet, due to differential gene expression, we have around 200 distinct cell types each with varying functionalities. The cell type specific states are maintained via the binding of multiple regulatory proteins to different locations along the genome in a process known as transcriptional regulation. Additionally, disruptions to the transcriptional regulation process may lead to the development of disease. Hence, uncovering the complex interplay of protein-DNA interactions along the genome is of critical importance. The advent of technologies probing the genomic sequence, as well as the development of powerful computational modeling techniques to relate DNA sequences to molecular phenotype, has enabled the understanding of many molecular processes genome wide. However, these computational methods require significant adaptation to biological systems - to accurately and fully account for the biology behind the molecular processes, as well as the biases associated with the data generating systems and processes. In this thesis, we address three main issues that arise from the use of omics data, more specifically ChIP-seq data, when identifying regulatory proteins along the genome. The first part of the thesis involves the study of the biases and noise associated with ChIP-seq experiments. Each experiment is prone to noise and bias, and as such we propose the use of a customized set of weighted controls, instead of equally weighted controls, for each ChIP-seq experiment in the peak calling process to mitigate the noise and bias. To do this, we implement a peak calling algorithm, called Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2, to incorporate the weighted controls in the peak calling process. We show that our approach assists in a better approximation of the noise distribution in controls, and fundamentally improves our understanding of ChIP-seq signals and their biases. Another aspect we explore in this thesis is the ability to uncover cell type specificity of transcription factor binding from the ChIP-seq data. A transcription factor may bind to various parts of the genome in different cell types, due to modifications in the DNA-binding preferences of the transcription factor, or other mechanisms, such as chromatin accessibility or cooperative binding, thus leading to a "DNA signature" of differential binding. We develop a deep learning approach, called SigTFB (Signatures of TF Binding) and conduct a wide scale analysis of hundreds of transcription factors to identify and quantify the varying degrees of cell type specific DNA signatures of various transcription factors across cell types. We also assess the consistency of cell type specificity for a specific transcription factor when assayed by different antibodies. We show that many transcription factors are indeed cell type specific, while others are more general with lower cell type specificity. Finally, to further explain the biology behind a transcription factor's cell type specificity, or lack that of, we conduct a wide scale motif enrichment analysis of all transcription factors in question. We show that cell type specific transcription factors are typically associated with corresponding differences in motif enrichment and gene expression. Together, these contributions deepen our knowledge of transcription factor binding, and how experimental and cell type specific variations can be uncovered. transcription factor DNA-binding deep learning machine learning differential binding cell type specificity noise bias ChIP-seq controls
85	Chromatin-associated functions of the APC tumor suppressor protein Hankey, William C., IV January 2016 (has links) No description available. Biomedical Research APC tumor suppressor colorectal cancer canonical WNT signaling chromatin transcription AP-1 transcription factor ChIP-seq beta-catenin
86	Droplet-Based Microfluidics for High-Throughput Single-Cell Omics Profiling Zhang, Qiang 06 September 2022 (has links) Droplet-based microfluidics is a powerful tool permitting massive-scale single-cell analysis in pico-/nano-liter water-in-oil droplets. It has been integrated into various library preparation techniques to accomplish high-throughput scRNA-seq, scDNA-seq, scATAC-seq, scChIP-seq, as well as scMulti-omics-seq. These advanced technologies have been providing unique and novel insights into both normal differentiation and disease development at single-cell level. In this thesis, we develop four new droplet-based tools for single-cell omics profiling. First, the developed Drop-BS is the first droplet-based platform to construct single-cell bisulfite sequencing libraries for DNA methylome profiling and allows production of BS library of 2,000-10,000 single cells within 2 d. We applied the technology to separately profile mixed cell lines, mouse brain tissues, and human brain tissues to reveal cell type heterogeneity. Second, the new Drop-ChIP platform only requires two steps of droplet generation to achieve multiple steps of reactions in droplets such as single-cell lysis, chromatin fragmentation, ChIP, and barcoding. Third, we aim to establish a droplet-based platform to accomplish high-throughput full-length RNA-seq (Drop-full-seq), which both current tube-based and droplet-based methods cannot realize. Last, we constructed an in-house droplet-based tool to assist single-cell ATAC-seq library preparation (Drop-ATAC), which provided a low-cost and facile protocol to conduct scATAC-seq in laboratories without the expensive instrument. / Doctor of Philosophy / Microfluidics is a collection of techniques to manipulate fluids in the micrometer scale. One of microfluidic techniques is called "droplet-based microfluidics". It can manipulate (i.e., generate, merge, sort, split, etc) pico-/nano-liter of water-in-oil droplets. First, since the water phase is separated by the continuous oil phase, these droplets are discrete and individual reactors. Second, droplet-based microfluidics can achieve highly parallel manipulation of thousands to millions of droplets. These two advantages make droplet-based microfluidics an ideal tool to perform single-cell assays. Over the past 10 years, various droplet-based platforms have been developed to study single-cell transcriptome, genome, epigenome, as well as multi-ome. To expand droplet-based tools for single-cell analysis, we aim to develop four novel platforms in this thesis. First, Drop-BS, by integrating droplet generation and droplet fusion techniques, can achieve high-throughput single-cell bisulfite sequencing library preparation. It can generate 10,000 single-cell BS libraries within 2 days which is difficult to achieve for conventional library preparation in tubes/microwells. Second, we developed a novel and facile Drop-ChIP platform to prepare single-cell ChIP-seq library. It is easy to operate since it only requires two steps of droplet generation. It also generates higher quality of data compared to previous work. In addition, we are working on the development and characterization of the other two droplet-based tools to achieve full-length single-cell RNA-seq and single-cell ATAC-seq. droplet-based microfluidics single-cell analysis ChIP-seq BS-seq RNA-seq ATAC-seq next generation sequencing library preparation
87	Le complexe TFIIH dans la transcription effectuée par l'ARN polymèrase II et l'ARN polymèrase III Zadorin, Anton 28 September 2012 (has links) (PDF) Deux phénomènes liés au TFIIH ont été étudiés : l'influence des mutations spécifiques dans la sous-unité XPD de TFIIH sur la réponse transcriptionnelle de certains gènes après l'irradiation UV, et l'interaction entre le TFIIH et la transcription des gènes de classe III. Une analyse détaillée de la dynamique du transcriptome a été effectuée pour la réponse des cellules humaines mutantes XP-D/CS à l'UV. Il a été démontré que la dysrégulation sélective observée de l'expression des gènes était liée à l'incapacité pour la ré-initiation transcriptionnelle et à l'hétérochromatinisation suivante, où l'histonedésacétylase SIRT1 a été identifiée comme le principal facteur. Son inhibition a permis de recouvrer l'expression normale d'un nombre substantiel des gènes affectés. Une étude de la participation pangénomique du coeur de TFIIH dans latranscription a découvert son association avec les gènes actifs de classe III. Cette association a été démontrée être indépendante de Pol II. Le coeur de TFIIH a été montré participer directement à la transcription effectuée in vitro par Pol III. TFIIH RNA-SEQ Chromatine STRESS UV Xeroderma pigmentosum Cockayne syndrome SIRT1 ARN polymérase III CHIP-SEQ
88	Génomique fonctionnelle des cellules corticotropes hypophysaires : contrôle génétique de la gestion systémique des stress Langlais, David 08 1900 (has links) L'axe hypothalamo-hypophyso-surrénalien (HPA) permet de maintenir l'homéostasie de l'organisme face à divers stress. Qu'ils soient de nature psychologique, physique ou inflammatoire/infectieux, les stress provoquent la synthèse et la libération de CRH par l'hypothalamus. Les cellules corticotropes hypophysaires perçoivent ce signal et en réaction, produisent et sécrètent l'ACTH. Ceci induit la synthèse des glucocorticoïdes (Gc) par le cortex surrénalien; ces stéroïdes mettent le système métabolique en état d’alerte pour la réponse au stress et à l’agression. Les Gc ont le rôle essentiel de contrôler les défenses de l'organisme, en plus d'exercer une rétro-inhibition sur l'axe HPA. L'ACTH est une petite hormone peptidique produite par le clivage d'un précurseur: la pro-opiomélanocortine (POMC). À cause de sa position critique dans la normalisation de l'homéostasie, le contrôle transcriptionnel du gène Pomc a fait l'objet d'études approfondies au cours des dernières décennies. Nous savons maintenant que la région promotrice du gène Pomc permet une expression ciblée dans les cellules POMC hypophysaires. L'étude du locus Pomc par des technologies génomiques m'a permis de découvrir un nouvel élément de régulation qui est conservé à travers l'évolution des mammifères. La caractérisation de cet enhancer a démontré qu'il dirige une expression restreinte à l'hypophyse, et plus particulièrement dans les cellules corticotropes. De façon intéressante, l'activité de cet élément dépend d'un nouveau site de liaison recrutant un homodimère du facteur de transcription Tpit, dont l'expression est également limitée aux cellules POMC de l'hypophyse. La découverte de cet enhancer ajoute une toute nouvelle dimension à la régulation de l'expression de POMC. Les cytokines pro-inflammatoires IL6/LIF et les Gc sont connus pour leur antagonisme sur la réaction inflammatoire et sur le promoteur Pomc via l'action des facteurs de transcription Stat3 et GR respectivement. L'analyse génomique des sites liés ii par ces deux facteurs nous a révélé une interrelation complexe et a permis de définir un code transcriptionnel entre ces voies de signalisation. En plus de leur action par interaction directe avec l’ADN au niveau des séquences régulatrices, ces facteurs interagissent directement entre eux avec des résultats transcriptionnels différents. Ainsi, le recrutement de GR par contact protéine:protéine (tethering) sur Stat3 étant lié à l'ADN provoque un antagonisme transcriptionnel. Inversement, le tethering de Stat3 sur GR supporte une action synergique, tout comme leur co-recrutement à l'ADN sur des sites contigus ou composites. Lors d'une activation soutenue, ce synergisme entre les voies IL6/LIF et Gc induit une réponse innée de défense cellulaire. Ainsi lors d'un stress majeur, ce mécanisme de défense est mis en branle dans toutes les cellules et tissus. En somme, les travaux présentés dans cette thèse définissent les mécanismes transcriptionnels engagés dans le combat de l'organisme contre les stress. Plus particulièrement, ces mécanismes ont été décrits au niveau de la réponse globale des corticotropes et du gène Pomc. Il est essentiel pour l'organisme d'induire adéquatement ces mécanismes afin de faire face aux stress et d'éviter des dérèglements comme les maladies inflammatoires et métaboliques. / The hypothalamo-pituitary-adrenal (HPA) axis regulates homeostasis in various conditions of stress contributing to both the stress response and its termination. Psychological, physical or inflammatory/infectious stresses all prompt the synthesis and secretion of hypothalamic CRH. The pituitary corticotrope cells receive this signal and in turn, secrete ACTH which triggers the synthesis of glucocorticoids (Gc) by the adrenal cortex; these steroids induce a general state of alertness in order to fight or flight aggressions and stresses. Glucocorticoids have the critical role to restrict the stress response by exerting a negative feedback on the HPA axis. ACTH is a small peptidic hormone produced after cleavage of a precursor protein: pro-opiomelanocortin (POMC). Due to its critical role in homeostasis, transcriptional control of the Pomc gene has been intensely studied during the last decades. Previous investigations identified a promoter region that is sufficient for expression of Pomc in the appropriate pituitary cells. Genome-wide studies of the Pomc locus led me to discover a novel regulatory element that is conserved throughout mammalian evolution. The activity of this enhancer is restricted to the pituitary, and more precisely to the corticotrope lineage. Interestingly, its activity depends on a novel transcription factor binding motif that binds homodimers of Tpit, a transcription factor that is only found in pituitary POMC cells. The discovery of this enhancer adds a new dimension in the control of pituitary Pomc expression. The IL6/LIF pro-inflammatory cytokines and the glucocorticoids are well known for their antagonism in control of the inflammatory response; at the Pomc promoter, their action is mediated by the transcription factors Stat3 and GR, respectively. The analysis of genomic sites bound by these two factors revealed a complex relationship and led us to define a transcription regulatory code linking these signalling pathways. In addition to their direct DNA interaction with cognate regulatory sequences, these factors iv interact with each other with different outcomes. Thus, the recruitment of GR on DNAbound Stat3 through protein:protein contacts (tethering) results in transcriptional antagonism. Conversely, Stat3 tethering to GR produces synergism; this is also the case when the two factors are co-recruited to DNA on contiguous or composite binding sites. Prolonged activation of the IL6/LIF and Gc pathways elicits a synergistic innate cell defense response in all cells and tissues. In summary, this doctoral work has defined transcriptional mechanisms that mediate and control the stress response. In particular, pituitary components of the stress response were defined at the level of the Pomc gene and as a global response of corticotrope cells. This response is critical for appropriate organism defense during stresses such as those produced in inflammatory and metabolic diseases. POMC Stat3 GR Tethering ChIP-seq microarray Hypophyse Défense Cytokine Glucocorticoïde Pituitary Defense Glucocorticoid
89	INVESTIGATING THE FUNCTIONAL ROLE OF MED5 AND CDK8 IN ARABIDOPSIS MEDIATOR COMPLEX Xiangying Mao (6714896) 02 August 2019 (has links) <p>The Mediator (Med) complex comprises about 30 subunits and is a transcriptional co-regulator in eukaryotic systems. The core Mediator complex, consisting of the head, middle and tail modules, functions as a bridge between transcription factors and basal transcription machinery, whereas the CDK8 kinase module can attenuate Mediator’s ability to function as either a co-activator or co-repressor. Many Arabidopsis Mediator subunit has been functionally characterized, which reveals critical roles of Mediator in many aspects of plant growth and development, responses to biotic and abiotic stimuli, and metabolic homeostasis. Traditional genetic and biochemical approaches laid the foundation for our understanding of Mediator function, but recent transcriptomic and metabolomic studies have provided deeper insights into how specific subunits cooperate in the regulation of plant metabolism. In Chapter 1, we highlight recent developments in the investigation of Mediator and plant metabolism, with emphasis on the large-scale biology studies of <i>med</i> mutants.</p> <p>We previously found that MED5, an Arabidopsis Mediator tail subunit, is required for maintaining phenylpropanoid homeostasis. A semi-dominant mutation (<i>reduced epidermal fluorescence 4-3</i>, <i>ref4-3</i>) that causes a single amino acid substitution in MED5b functions as a strong suppressor of the pathway, leading to <a>decreased soluble phenylpropanoid accumulation, reduced lignin content and dwarfism</a>. In contrast, loss of MED5a and MED5b (<i>med5</i>) results in increased levels of phenylpropanoids. In Chapter 2, we present our finding that <i>ref4-3</i> requires CDK8, a Mediator kinase module subunit, to repress plant growth even though the repression of phenylpropanoid metabolism in <i>ref4-3 </i>is CDK8-independent. Transcriptome profiling revealed that salicylic acid (SA) biosynthesis genes are up-regulated in a CDK8-dependent manner in <i>ref4-3,</i> resulting in hyper-accumulation of SA and up-regulation of SA response genes. Both growth repression and hyper-accumulation of SA in <i>ref4-3</i> require CDK8 with intact kinase activity, but these SA phenotypes are not connected with dwarfing. In contrast, mRNA-sequencing (RNA-seq) analysis revealed the up-regulation of a DNA J protein-encoding gene in <i>ref4-3</i>, the elimination of which partially suppresses dwarfing. Together, our study reveals genetic interactions between Mediator tail and kinase module subunits and enhances our understanding of dwarfing in phenylpropanoid pathway mutants.</p> <p>In Chapter 3, we characterize other phenotypes of <i>med5</i> and <i>ref4-3</i>, and find that in addition to the up-regulated phenylpropanoid metabolism, <i>med5</i> show other interesting phenotypes including hypocotyl and petiole elongation as well as accelerated flowering, all of which are known collectively as the shade avoidance syndrome (SAS), suggesting that MED5 antagonize shade avoidance in wild-type plants. In contrast, the constitutive <i>ref4-3 </i>mutant protein inhibits the process, and the stunted growth of <i>ref4-3 </i>mutants is substantially alleviated by the light treatment that triggers SAS. Moreover, <i>ref4-3</i> mimics the loss-of-function <i>med5</i> mutants in maintaining abscisic acid (ABA) levels under both normal and drought growth conditions. The phenotypic characterization of <i>med5</i> mutants extend our understanding of the role of Mediator in SAS and ABA signaling, providing further insight into the physiological and metabolic responses that require MED5.</p> <p>In Chapter 4, we explore the function of MED5 and CDK8 in gene expression regulation by investigating the effect of mutations in Mediator including <i>med5</i>, <i>ref4-3</i>, <i>cdk8-1</i> and <i>ref4-3 cdk8-1</i> on genome-wide Pol II distribution. We find that loss of MED5 results in loss of Pol II occupancy at many target genes. In contrast, many genes show enriched Pol II levels in <i>ref4-3</i>, some of which overlap with those showing reduced Pol II occupancy in <i>med5</i>. In addition, Pol II occupancy is significantly reduced when CDK8 is disrupted in <i>ref4-3</i>. Our results help to narrow down the direct gene targets of MED5 and identify genes that may be closely related to the growth deficiency observed in <i>ref4-3</i> plants, providing a critical foundation to elucidate the molecular function of Mediator in transcription regulation.</p> Molecular Biology Bioinformatics Plant Biology phenylpropanoid biosynthetic pathway Mediator complex Salicylic acid dwarfing shade avoidance syndrome Chip-Seq
90	Machine learning for epigenetics : algorithms for next generation sequencing data Mayo, Thomas Richard January 2018 (has links) The advent of Next Generation Sequencing (NGS), a little over a decade ago, has led to a vast and rapid increase in the generation of genomic data. The drastically reduced cost has in turn enabled powerful modifications that can be used to investigate not just genetic, but epigenetic, phenomena. Epigenetics refers to the study of mechanisms effecting gene expression other than the genetic code itself and thus, at the transcription level, incorporates DNA methylation, transcription factor binding and histone modifications amongst others. This thesis outlines and tackles two major challenges in the computational analysis of such data using techniques from machine learning. Firstly, I address the problem of testing for differential methylation between groups of bisulfite sequencing data sets. DNA methylation plays an important role in genomic imprinting, X-chromosome inactivation and the repression of repetitive elements, as well as being implicated in numerous diseases, such as cancer. Bisulfite sequencing provides single nucleotide resolution methylation data at the whole genome scale, but a sensitive analysis of such data is difficult. I propose a solution that uses a powerful kernel-based machine learning technique, the Maximum Mean Discrepancy, to leverage well-characterised spatial correlations in DNA methylation, and adapt the method for this particular use. I use this tailored method to analyse a novel data set from a study of ageing in three different tissues in the mouse. This study motivates further modifications to the method and highlights the utility of the underlying measure as an exploratory tool for methylation analysis. Secondly, I address the problem of predictive and explanatory modelling of chromatin immunoprecipitation sequencing data (ChIP-Seq). ChIP-Seq is typically used to assay the binding of a protein of interest, such as a transcription factor or histone, to the DNA, and as such is one of the most widely used sequencing assays. While peak callers are a powerful tool in identifying binding sites of sparse and clean ChIPSeq profiles, more broad signals defy analysis in this framework. Instead, generative models that explain the data in terms of the underlying sequence can help uncover mechanisms that predicting binding or the lack thereof. I explore current problems with ChIP-Seq analysis, such as zero-inflation and the use of the control experiment, known as the input. I then devise a method for representing k-mers that enables the use of longer DNA sub-sequences within a flexible model development framework, such as generalised linear models, without heavy programming requirements. Finally, I use these insights to develop an appropriate Bayesian generative model that predicts ChIP-Seq count data in terms of the underlying DNA sequence, incorporating DNA methylation information where available, fitting the model with the Expectation-Maximization algorithm. The model is tested on simulated data and real data pertaining to the histone mark H3k27me3. This thesis therefore straddles the fields of bioinformatics and machine learning. Bioinformatics is both plagued and blessed by the plethora of different techniques available for gathering data and their continual innovations. Each technique presents a unique challenge, and hence out-of-the-box machine learning techniques have had little success in solving biological problems. While I have focused on NGS data, the methods developed in this thesis are likely to be applicable to future technologies, such as Third Generation Sequencing methods, and the lessons learned in their adaptation will be informative for the next wave of computational challenges.

Search results