11 |
Applications of Machine Learning in Source Attribution and Gene Function PredictionChinnareddy, Sandeep 07 June 2024 (has links)
This research investigates the application of machine learning techniques in computational genomics across two distinct domains: (1) the predicting the source of bacterial pathogen using whole genome sequencing data, and (2) the functional annotation of genes using single- cell RNA sequencing data. This work proposes the development of a bioinformatics pipeline tailored for identifying genomic variants, including gene presence/absence and single nu- cleotide polymorphism. This methodology is applied to specific strains such as Salmonella enterica serovar Typhimurium and the Ralstonia solanacearum species complex. Phylo- genetic analyses along with pan-genome and positive selection studiesshow that genomic variants and evolutionary patterns of S. Typhimurium vary across sources, which suggests that sources can be accurately attributed based on genomic variants empowered by machine learning. We benchmarked seven traditional machine learning algorithms, achieving a no- table accuracy of 94.6% in host prediction for S. Typhimurium using the Random Forest model, underscored by SHAP value analyses which elucidated key predictive features. Next, the focus is shifted to the prediction of Gene Ontology terms for Arabidopsis genes using single-cell RNA-seq data. This analysis offers a detailed comparison of gene expression in root versus shoot tissues, juxtaposed with insights from bulk RNA-seq data. The integration of regulatory network data from DAP-seq significantly enhances the prediction accuracy of gene functions. / Master of Science / This work applies machine learning techniques to two areas in computational biology: pre- dicting the hosts of bacterial pathogens based on their genome data, and predicting the func- tions of plant genes using single-cell gene expression data. The first part develops a method to analyze genome sequences from bacterial pathogens like Salmonella enterica serovar Ty- phimurium and the Ralstonia solanacearum species complex, identifying genomic variants, including gene presence/absence and single nucleotide polymorphism, which are variations in genetic code. By studying the evolutionary relationships and genetic diversity among dif- ferent strains, the motivation for using machine learning models to predict the sources (e.g., poultry, swine) of the pathogen genomes is established. Several machine learning models are then trained on these datasets, and the most important factors contributing to the predic- tions are identified. The second part focuses on predicting the functions of genes in the model plant species Arabidopsis thaliana using the gene expression data measured at the single-cell level to train machine learning models for identifying standardized gene function descrip- tions called Gene Ontology (GO) terms. By comparing results from single-cell and bulk tissue data, the study evaluates whether the higher resolution of single-cell data improves gene function prediction accuracy. Additionally, by incorporating information about gene regulation from a specialized experiment, the role of gene expression control in determining gene functions is explored.
|
12 |
Computational Prediction of Gene Function From High-throughput Data SourcesMostafavi, Sara 31 August 2011 (has links)
A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the prediction of function for uncharacterized genes by combining multiple datasets, in order to exploit the associations between such genes and genes of known function--all in a query-specific fashion.
Commonly, heterogeneous datasets are represented as networks in order to facilitate their combination. Here, I show that it is possible to accurately predict gene function in seconds by combining multiple large-scale networks. This facilitates function prediction on-demand, allowing users to take advantage of the persistent improvement and proliferation of genomics and proteomics datasets and continuously make up-to-date predictions for large genomes such as humans.
Our algorithm, GeneMANIA, uses constrained linear regression to combine multiple association networks and uses label propagation to make predictions from the combined network. I introduce extensions that result in improved predictions when the number of labeled examples for training is limited, or when an ontological structure describing a hierarchy of gene function categorization scheme is available. Further, motivated by our empirical observations on predicting node labels for general networks, I propose a new label propagation algorithm that exploits common properties of real-world networks to increase both the speed and accuracy of our predictions.
|
13 |
Computational Prediction of Gene Function From High-throughput Data SourcesMostafavi, Sara 31 August 2011 (has links)
A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the prediction of function for uncharacterized genes by combining multiple datasets, in order to exploit the associations between such genes and genes of known function--all in a query-specific fashion.
Commonly, heterogeneous datasets are represented as networks in order to facilitate their combination. Here, I show that it is possible to accurately predict gene function in seconds by combining multiple large-scale networks. This facilitates function prediction on-demand, allowing users to take advantage of the persistent improvement and proliferation of genomics and proteomics datasets and continuously make up-to-date predictions for large genomes such as humans.
Our algorithm, GeneMANIA, uses constrained linear regression to combine multiple association networks and uses label propagation to make predictions from the combined network. I introduce extensions that result in improved predictions when the number of labeled examples for training is limited, or when an ontological structure describing a hierarchy of gene function categorization scheme is available. Further, motivated by our empirical observations on predicting node labels for general networks, I propose a new label propagation algorithm that exploits common properties of real-world networks to increase both the speed and accuracy of our predictions.
|
14 |
Analysis of optimal differential gene expressionLiebermeister, Wolfram 30 March 2004 (has links)
Diese Doktorarbeit behandelt die Beobachtung, daß Koregulationsmuster in Genexpressionsdaten häufig Funktionsstrukturen der Zelle widerspiegeln. Zunächst werden simulierte Genexpressionsdaten und Expressionsdaten aus Hefeexperimenten mit Hilfe von Independent Component Analysis (ICA) und verwandten Faktormodellen untersucht. In einem eher theoretischen Zugang werden anschließend Beziehungen zwischen den Expressionsmustern und der biologischen Funktion der Gene aus einem Optimalitätsprinzip hergeleitet. Lineare Faktormodelle, beispielsweise ICA, zerlegen Genexpressionsmatrizen in statistische Komponenten: die Koeffizienten bezüglich der Komponenten können als Profile von verborgenen Variablen ("Expressionsmoden") interpretiert werden, deren Werte zwischen den Proben variieren. Im Gegensatz zu Clustermethoden beschreiben solche Faktormodelle eine überlagerung biologischer Effekte und die individuellen Reaktionen der einzelnen Gene: jedes Genprofil besteht aus einer überlagerung der Expressionsmoden, die so die gemeinsamen Schwankungen vieler Gene erklären. Die linearen Komponenten werden blind, also ohne zusätzliches biologisches Wissen, aus den Daten geschätzt, und die meisten der hier betrachteten Methoden erlauben es, nahezu schwach besetzte Komponenten zu rekonstruieren. Beim Ausdünnen einer Komponente werden Gene sichtbar, die stark auf die entsprechende Mode reagieren, ganz in Analogie zu Genen, die differentielle Expression zwischen einzelnen Proben zeigen. Verschiedene Faktormodelle werden in dieser Arbeit auf simulierte und experimentelle Expressionsdaten angewendet. Bei der Simulation von Expressionsdaten wird angenommen, daß die Genexpression von einigen unbeobachteten Variablen ("biologischen Expressionsmoden") abhängt, die den Zellzustand beschreiben und deren Einfluss auf die Gene sich durch nichtlineare Funktionen, die sogenannten Genprogramme, beschreiben läßt. Besteht Hoffnung, solche Expressionsmoden durch blinde Datenanalyse wiederzufinden? Die Tests in dieser Arbeit zeigen, daß die Moden mit ICA recht zuverlässig gefunden werden, selbst wenn die Daten verrauscht oder leicht nichtlinear sind und die Anzahl der wahren und der geschätzten Komponenten nicht übereinstimmt. Regressionsmodelle werden an Profile einzelner Gene angepasst, um ihre Expression durch Expressionsmoden aus Faktormodellen oder durch die Expression einzelner Transkriptionsfaktoren zu erklären. Nichtlineare Genprogramme werden mit Hilfe von nichtlinearer ICA ermittelt: solche effektiven Genprogramme könnten zur Beschreibung von Genexpression in großen Zellmodellen Verwendung finden. ICA und verwandte Methoden werden auf Expressionsdaten aus Zellzyklusexperimenten angewendet: neben biologisch interpretierbaren Moden werden experimentelle Artefakte identifiziert, die vermutlich Hybridisierungseffekte oder eine Verunreinigung der Proben widerspiegeln. Für einzelne Komponenten wird gezeigt, daß die koregulierten Gene gemeinsame biologische Funktionen besitzen und daß die entsprechenden Enzyme bevorzugt in bestimmten Bereichen des Stoffwechselnetzes zu finden sind. Die Expressionmechanismen scheinen also - als Ergebnis der Evolution - Funktionsbeziehungen zwischen den Genen widerzuspiegeln: es wäre unter ökonomischen Gesichtspunkten vermutlich ineffizient, wenn kooperierende Gene nicht auch koreguliert würden. Um diese teleologische Vorstellung von Genexpression zu formalisieren, wird in dieser Arbeit ein mathematisches Modell zur Analyse der optimalen differentiellen Expression (ANODE) vorgeschlagen: das Modell beschreibt Regulatoren, also beispielsweise Gene oder Enzyme, und die von ihnen gesteuerten Variablen, zum Beispiel metabolische Flüsse. Das Systemverhalten wird durch eine Fitnessfunktion bewertet, die beispielsweise vom bestimmten Stoffwechselflüssen abhängt und die es zu optimieren gilt. Dieses Optimalitätsprinzip definiert eine optimale Reaktion der Regulatoren auf kleine äußeren Störungen. Zur Berechnung optimaler Regulationsmuster braucht das zu regulierende System nur teilweise bekannt zu sein: es genügt, sein mögliches Verhalten in der Nähe des optimalen Zustandes sowie die lokale Form der Fitnesslandschaft zu kennen. Die Methode wird auf zeitabhängige Störungen erweitert: um die Antwort von Stoffwechselsystemen auf kleine oszillatorische Störungen zu beschreiben, werden frequenzabhängige Kontrollkoeffizienten definiert und durch Summations- und Konnektivitätstheoreme charakterisiert. Um die vorhergesagte Beziehung zwischen Expression und Funktion zu prüfen, werden Kontrollkoeffizienten für ein großes Stoffwechselnetz simuliert, und ihre statistischen Eigenschaften werden untersucht: die Struktur der Kontrollkoeffizientenmatrix bildet die Netztopologie ab, das bedeutet, chemische Reaktionen haben gewöhnlich einen geringen Einfluss auf weit entfernte Teile des Netzes. Außerdem hängen die Kontrollkoeffizienten innerhalb eines Teilnetzes nur schwach von der Modellierung des umgebenden Netzes ab. Verschiedene plausible Annahmen über sinnvolle Expressionsmuster lassen sich formal aus dem Optimalitätsprinzip herleiten: das Hauptergebnis ist eine allgemeine Beziehung zwischen dem Verhalten und der biologischen Funktion von Regulatoren, aus der sich zum Beispiel die Koregulation von Enzymen in Komplexen oder Funktionsmodulen ergibt. Die Funktionen der Gene werden in diesem Zusammenhang durch ihre linearen Einflüsse (die sogenannten Responsekoeffizienten) auf fitnessrelevante Zellvariable beschrieben. Für Stoffwechselenzyme werden aus den Theoremen der metabolischen Kontrolltheorie Summenregeln hergeleitet, die die Expressionsmuster mit der Struktur des Stoffwechselnetzes verknüpfen. Weitere Vorhersagen betreffen eine symmetrische Kompensation von Gendeletionen und eine Beziehung zwischen Genexpression und dem Fitnessverlust aufgrund von Deletionen. Wenn die optimale Steuerung durch eine Rückkopplung zwischen Zellvariablen und den Regulatoren verwirklicht ist, dann spiegeln sich funktionale Beziehungen auch in den Rückkopplungskoeffizienten wider. Daher ist zu erwarten, daß Gene mit ähnlicher Funktion durch Eingangssignale aus denselben Signalwegen gesteuert werden. Das Modell der optimalen Steuerung sagt voraus, daß Expressionsprofile aus Linearkombinationen von Responsekoeffizientenprofilen bestehen: Tests mit experimentellen Expressionsdaten und simulierten Kontrollkoeffizienten stützen diese Hypothese, und die gemeinsamen Komponenten, die aus diesen beiden Arten von Daten geschätzt werden, liefern ein anschauliches Bild der Stochwechselvorgänge, die zur Anpassung an unterschiedliche Umgebungen notwendig sind. Alles in allem werden in dieser Arbeit empirische Beziehungen zwischen der Expression and der Funktion von Genen bestätigt. Darüber hinaus werden solche Beziehungen aus theorischen Gründen vorhergesagt. Ein Hauptziel ist es, teleologische Aussagen über Genexpression auf explizite Annahmen zurückzuführen und dadurch klarer zu formulieren, und so einen theoretischen Rahmen für die Integration von Expressionsdaten und Funktionsannotationen zu liefern. Während andere Autoren die Expression mit Funktionskategorien der Gene oder topologisch definierten Stoffwechselwegen verglichen haben, schlage ich vor, die Funktionen von Genen durch ihre Responsekoeffizienten auszudrücken. Als ein Hauptergebnis dieser Arbeit werden allgemeine Beziehungen zwischen der Funktion, der optimalen Expression und dem Programm eines Gens vorhergesagt. Soweit die Optimalitätsannahme gilt, rechtfertigt das Modell die Verwendung von Expressionsdaten zur Funktionsannotation und zur Rekonstruktion von Stoffwechselwegen und liefert außerdem eine funktionsbezogene Interpretation für die linearen Komponenten in Expressionsdaten. Die Methoden aus dieser Arbeit sind nicht auf Genexpressionsdaten beschränkt: die Faktormodelle lassen sich auch auf Protein- und Metabolitdaten anwenden, und das Optimalitätsprinzip könnte ebenfalls auf andere Steuerungsmechanismen angewendet werden, beispielsweise auf die allosterische Steuerung von Enzymen. / This thesis is concerned with the observation that coregulation patterns in gene expression data often reflect functional structures of the cell. First, simulated gene expression data and expression data from yeast experiments are studied with independent component analysis (ICA) and with related factor models. Then, in a more theoretical approach, relations between gene expression patterns and the biological function of the genes are derived from an optimality principle. Linear factor models such as ICA decompose gene expression matrices into statistical components. The coefficients with respect to the components can be interpreted as profiles of hidden variables (called "expression modes") that assume different values in the different samples. In contrast to clusterings, such factor models account for a superposition of effects and for individual responses of the different genes: each gene profile consists of a superposition of the expression modes, which thereby account for the common variation of many genes. The components are estimated blindly from the data, that is, without further biological knowledge, and most of the methods considered here can reconstruct almost sparse components. Thresholding a component reveals genes that respond strongly to the corresponding mode, in comparison to genes showing differential expression among individual samples. In this work, different factor models are applied to simulated and experimental expression data. To simulate expression data, it is assumed that gene expression depends on several unobserved variables ("biological expression modes") which characterise the cell state and that the genes respond to them according to nonlinear functions called "gene programs". Is there a chance to reconstruct such expression modes with a blind data analysis? The tests in this work show that the modes can be found with ICA even if the data are noisy or weakly nonlinear, or if the numbers of true and estimated components do not match. Regression models are fitted to the profiles of single genes to explain their expression by expression modes from factor models or by the expression of single transcription factors. Nonlinear gene programs are estimated by nonlinear ICA: such effective gene programs may be used for describing gene expression in large cell models. ICA and similar methods are applied to expression data from cell-cycle experiments: besides biologically interpretable modes, experimental artefacts, probably caused by hybridisation effects and contamination of the samples, are identified. It is shown for single components that the coregulated genes share biological functions and the corresponding enzymes are concentrated in particular regions of the metabolic network. Thus the expression machinery seems to portray - as an outcome of evolution - functional relationships between the genes: regarding the economy of resources, it would probably be inefficient if cooperating genes were not coregulated. To formalise this teleological view on gene expression, a mathematical model for the analysis of optimal differential expression (ANODE) is proposed in this work: the model describes regulators, such as genes or enzymes, and output variables, such as metabolic fluxes. The system´s behaviour is evaluated by a fitness function, which, for instance, rates some of the metabolic fluxes in the cell and which is supposed to be optimised. This optimality principle defines an optimal response of regulators to small external perturbations. For calculating the optimal regulation patterns, the system to be controlled needs to be known only partially: it suffices to predefine its possible behaviour around the optimal state and the local shape of the fitness function. The method is extended to time-dependent perturbations: to describe the response of metabolic systems to small oscillatory perturbations, frequency-dependent control coefficients are defined and characterised by summation and connectivity theorems. For testing the predicted relation between expression and function, control coefficients are simulated for a large-scale metabolic network and their statistical properties are studied: the structure of the control coefficients matrix portrays the network topology, that is, chemical reactions tend to have little control on distant parts of the network. Furthermore, control coefficients within subnetworks depend only weakly on the modelling of the surrounding network. Several plausible assumptions about appropriate expression patterns can be formally derived from the optimality principle: the main result is a general relation between the behaviour of regulators and their biological functions, which implies, for example, the coregulation of enzymes acting in complexes or functional modules. In this context, the functions of genes are quantified by their linear influences (called ``response coefficients'') on fitness-relevant cell variables. For enzymes controlling metabolism, the theorems of metabolic control theory lead to sum rules that relate the expression patterns to the structure of the metabolic network. Further predictions concern a symmetric compensation for gene deletions and a relation between gene expression and the fitness loss caused by gene deletions. If optimal regulation is realised by feedback signals between the cell variables and the regulators, then functional relations are also portrayed in the linear feedback coefficients, so genes of similar function may be expected to share inputs from the same signalling cascades. According to the model of optimal regulation, expression profiles are linear combinations of response coefficient profiles: tests with experimental expression profiles and simulated control coefficients support this hypothesis, and the common components which are estimated from both kinds of data provide a vivid picture of the metabolic adaptations that are required in different environments. To summarise, empirical relations between gene expression and function have been confirmed in this work. Furthermore, such relations have been predicted on theoretical grounds. A main aim is to clarify teleological assertions about gene expression by deriving them from explicit assumptions, and thus to provide a theoretical framework for the integration of expression data and functional annotations. While other authors have compared expression to functional gene categories or topologically defined metabolic pathways, I propose to relate it to the response coefficients. A main result of this work is that general relations are predicted between a gene's function, its optimal expression behaviour, and its regulatory program. Where the assumption of optimality is valid, the model justifies the use of expression data for functional annotation and pathway reconstruction, and it provides a function-related interpretation for the linear components behind expression data. The methods from this work are not limited to gene expression data: the factor models are applicable to protein and metabolite data as well, and the optimality principle may also apply to other regulatory mechanisms, such as the allosteric control of enzymes.
|
15 |
A Functional Genomics Approach for Characterizing the Role of Six Transcription Factors in Muscle DevelopmentChu, Alphonse 14 May 2012 (has links)
Proper development of skeletal muscle occurs through a highly complex process where activation and repression of genes are essential. Control of this process is regulated by timely and spatial expression of specific transcription factors (TFs). Six1 and Six4 are homeodomain TFs known to be essential for skeletal muscle development in mice. Using the C2C12 cell line, a model for skeletal muscle differentiation, I used a functional genomics approach, employing siRNA specific to both these TFs, to characterize their role in skeletal myogenesis. To identify the genes that are regulated by both these TFs, gene expression profiling by microarray of cells treated with siRNA against Six1 and/or Six4 was performed. The knock-down of these TFs caused lower expression of markers of terminal differentiation genes in addition to an impairment of myoblast fusion and differentiation. Interestingly, transcript profiling of cells treated with siRNA against myogenin revealed that several of the Six1 and Six4 target genes are also regulated by myogenin. Through a combination of bioinformatic analyses it was also found that specific knock-down of Six4 causes an up-regulation of genes involved in mitosis and the cell cycle. In summary, these results show that Six1 and Six4 can both independently regulate different genes, but can also cooperate together with other TFs where they play an important role in the proper regulation of skeletal myogenesis.
|
16 |
Teaching genetics - a linguistic challenge : A classroom study of secondary teachers' talk about genes, traits and proteinsThörne, Karin January 2012 (has links)
The overall aim of this thesis is to investigate how teachers talk about genetics in actual classroom situations. An understanding of how language is used in action can give detailed information about how the subject matter is presented to the students as well as insights in linguistic challenges. From the viewpoint of seeing language to be at the very core of teaching and learning, this study investigates teachers’ spoken language in the classroom in topics within genetics that are known to be both crucial and problematic. Four lower secondary school teachers in compulsory school grade 9 (15-16 years old) were observed and recorded through a whole sequence of genetic teaching. The empirical data consisted of 45 recorded lessons. The teachers’ verbal communication was analyzed using thematic pattern analysis, which is based on the framework of systemic functional linguistics (SFL). The focus of the thesis is to determine how teachers talk about the relationships between the concepts of gene, protein and trait, i.e. the functional aspects of genetics. Prior research suggests that this is a central aspect of genetics education, but at the same time it is problematic for students to understand because the concepts belong to different organizational levels. In the first study I investigated how the concepts of gene and trait were related in the context of Mendelian genetics. My results revealed that the teachers’ way of talking resulted in different meanings regarding the relationship between gene and trait: 1) the gene as an active entity causing the trait 2) the gene as a passive entity identified by the trait 3) the gene as having the trait, and 4) the gene as being the trait. Moreover it was found that the old term anlag was regularly used by the teachers as synonym for both gene and trait. In the second study I examined how teachers included proteins in their lessons, and if and how they discussed proteins as a link between different organizational levels. This study showed that teachers commonly did not emphasize the many functions of proteins in our body. The main message of all teachers was that proteins are built. Two of the teachers used proteins as a link between gene and trait, whereas two of them did not. None of the teachers talked explicitly about genes as exclusively coding for proteins, which implies that the gene codes for both proteins and traits. The linguistic analysis of teachers’ talk in action revealed that small nuances in language used by the teachers resulted in different meanings of the spoken language. Thus, my work identifies several linguistic challenges in the teaching of genetics. / <p>This thesis is written within the framework of the Hasselblad Foundation Graduate School, a four-year programme financed by the Hasselblad Foundation.</p>
|
17 |
A Functional Genomics Approach for Characterizing the Role of Six Transcription Factors in Muscle DevelopmentChu, Alphonse January 2012 (has links)
Proper development of skeletal muscle occurs through a highly complex process where activation and repression of genes are essential. Control of this process is regulated by timely and spatial expression of specific transcription factors (TFs). Six1 and Six4 are homeodomain TFs known to be essential for skeletal muscle development in mice. Using the C2C12 cell line, a model for skeletal muscle differentiation, I used a functional genomics approach, employing siRNA specific to both these TFs, to characterize their role in skeletal myogenesis. To identify the genes that are regulated by both these TFs, gene expression profiling by microarray of cells treated with siRNA against Six1 and/or Six4 was performed. The knock-down of these TFs caused lower expression of markers of terminal differentiation genes in addition to an impairment of myoblast fusion and differentiation. Interestingly, transcript profiling of cells treated with siRNA against myogenin revealed that several of the Six1 and Six4 target genes are also regulated by myogenin. Through a combination of bioinformatic analyses it was also found that specific knock-down of Six4 causes an up-regulation of genes involved in mitosis and the cell cycle. In summary, these results show that Six1 and Six4 can both independently regulate different genes, but can also cooperate together with other TFs where they play an important role in the proper regulation of skeletal myogenesis.
|
18 |
Étude de l’influence des éléments transposables sur la régulation des gènes chez les mammifères / Study of transposable element influence on gene regulation in mammalsMortada, Hussein 04 October 2011 (has links)
Les éléments transposables sont des séquences génomiques capables de se répliquer et de se déplacer dans les génomes. Leur capacité à s’insérer près des gènes et à produire des réarrangements chromosomiques par recombinaison entre copies, font des éléments transposables des agents mutagènes. Les éléments transposables sont de plus capables de modifier l’expression des gènes voisins grâce aux régions promotrices qu’ils possèdent. Les éléments transposables ont été trouvés dans la plupart des génomes dans lesquels ils ont été recherchés. Ils forment ainsi 45 % du génome de l’homme et peuvent représenter jusqu’à 90 % du génome de certaines plantes. Dans la première partie de ma thèse, je me suis penché sur les facteurs qui déterminent la distribution de ces éléments. Je me suis intéressé à un facteur particulier, qui est la fonction des gènes dans le voisinage des insertions d’éléments transposables. Dans la deuxième partie, j’ai essayé de déterminer l’impact de l’altération des modifications épigénétiques (modifications d’histones plus précisément) associées aux différents composants géniques, dont les éléments transposables, sur la variation de l’expression des gènes en condition tumorale. / Transposable elements are genomic sequences able to replicate themselves and to move within genomes. Their ability to integrate near genes and to produce chromosomal rearrangements by recombination between copies, make transposable elements mutagens. Moreover, transposable elements are able to alter the expression of neighboring genes through their promoter regions. Transposable elements form 45% of the human genome and may represent up to 90% of certain plant genomes. In the first part of my thesis, I examined the factors that determine the distribution of these elements. I have been interested in a particular factor, which is the function of the genes in the vicinity of transposable element insertions. In the second part, I determined the impact of epigenetic modifications alterations (histone modifications) in different gene components, including transposable elements, on the variation of gene expression in tumoral conditions.
|
19 |
Knowledge management and discovery for genotype/phenotype dataGroth, Philip 02 December 2009 (has links)
Die Untersuchung des Phänotyps bringt z.B. bei genetischen Krankheiten ein Verständnis der zugrunde liegenden Mechanismen mit sich. Aufgrund dessen wurden neue Technologien wie RNA-Interferenz (RNAi) entwickelt, die Genfunktionen entschlüsseln und mehr phänotypische Daten erzeugen. Interpretation der Ergebnisse solcher Versuche ist insbesondere bei heterogenen Daten eine große Herausforderung. Wenige Ansätze haben bisher Daten über die direkte Verknüpfung von Genotyp und Phänotyp hinaus interpretiert. Diese Dissertation zeigt neue Methoden, die Entdeckungen in Phänotypen über Spezies und Methodik hinweg ermöglichen. Es erfolgt eine Erfassung der verfügbaren Datenbanken und der Ansätze zur Analyse ihres Inhalts. Die Grenzen und Hürden, die noch bewältigt werden müssen, z.B. fehlende Datenintegration, lückenhafte Ontologien und der Mangel an Methoden zur Datenanalyse, werden diskutiert. Der Ansatz zur Integration von Genotyp- und Phänotypdaten, PhenomicDB 2, wird präsentiert. Diese Datenbank assoziiert Gene mit Phänotypen durch Orthologie über Spezies hinweg. Im Fokus sind die Integration von RNAi-Daten und die Einbindung von Ontologien für Phänotypen, Experimentiermethoden und Zelllinien. Ferner wird eine Studie präsentiert, in der Phänotypendaten aus PhenomicDB genutzt werden, um Genfunktionen vorherzusagen. Dazu werden Gene aufgrund ihrer Phänotypen mit Textclustering gruppiert. Die Gruppen zeigen hohe biologische Kohärenz, da sich viele gemeinsame Annotationen aus der Gen-Ontologie und viele Protein-Protein-Interaktionen innerhalb der Gruppen finden, was zur Vorhersage von Genfunktionen durch Übertragung von Annotationen von gut annotierten Genen zu Genen mit weniger Annotationen genutzt wird. Zuletzt wird der Prototyp PhenoMIX präsentiert, in dem Genotypen und Phänotypen mit geclusterten Phänotypen, PPi, Orthologien und weiteren Ähnlichkeitsmaßen integriert und deren Gruppierungen zur Vorhersage von Genfunktionen, sowie von phänotypischen Wörtern genutzt. / In diseases with a genetic component, examination of the phenotype can aid understanding the underlying genetics. Technologies to generate high-throughput phenotypes, such as RNA interference (RNAi), have been developed to decipher functions for genes. This large-scale characterization of genes strongly increases phenotypic information. It is a challenge to interpret results of such functional screens, especially with heterogeneous data sets. Thus, there have been only few efforts to make use of phenotype data beyond the single genotype-phenotype relationship. Here, methods are presented for knowledge discovery in phenotypes across species and screening methods. The available databases and various approaches to analyzing their content are reviewed, including a discussion of hurdles to be overcome, e.g. lack of data integration, inadequate ontologies and shortage of analytical tools. PhenomicDB 2 is an approach to integrate genotype and phenotype data on a large scale, using orthologies for cross-species phenotypes. The focus lies on the uptake of quantitative and descriptive RNAi data and ontologies of phenotypes, assays and cell-lines. Then, the results of a study are presented in which the large set of phenotype data from PhenomicDB is taken to predict gene annotations. Text clustering is utilized to group genes based on their phenotype descriptions. It is shown that these clusters correlate well with indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. The clusters are then used to predict gene function by carrying over annotations from well-annotated genes to less well-characterized genes. Finally, the prototype PhenoMIX is presented, integrating genotype and phenotype data with clustered phenotypes, orthologies, interaction data and other similarity measures. Data grouped by these measures are evaluated for theirnpredictiveness in gene functions and phenotype terms.
|
20 |
Evolutionary mechanisms of plant adaptation illustrated by cytochrome P450 genes under purifying or relaxed selection / Mécanismes évolutifs de l'adaptation des plantes illustrés par les gènes de P450s sous sélection purifiante ou pression de sélection relâchéeLiu, Zhenhua 21 March 2014 (has links)
Les plantes produisent une remarquable diversité de métabolites pour faire face aux contraintes d’un environnement en constante fluctuation. Cependant la manière dont les plantes ont atteint un tel degré de complexité métabolique et les forces responsables de cette diversité chimique reste largement incomprise. On considère généralement que le mécanisme de duplication des gènes contribue pour une grande part à l’évolution naturelle. En absence de transfert horizontal, les gènes d’évolution récente se cantonnent généralement chez quelques espèces et sont soumis à une évolution rapide, alors que les gènes conservés et plus anciens ont une distribution beaucoup plus large et sont porteurs de fonctions essentielles. Il est donc intéressant d’étudier l’adaptation des plantes en analysant parallèlement les gènes qui présentent soit une large distribution taxonomique, soit une distribution plus restreinte, de type lignée-spécifique. Les cytochromes P450 (CYP) constituent l’une des plus vastes familles de protéines chez les plantes, présentant des phylogénies très conservées ou très branchées qui illustrent la plasticité métabolique et la diversité chimique. Pour illustrer l’évolution des fonctions des cytochromes P450 dans le métabolisme végétal, nous avons sélectionné trois gènes, l’un très conservé au cours de l’évolution, CYP715A1 et les deux autres, CYP98A8 et CYP98A9, très récemment spécialisés de manière lignée spécifique chez les Brassicaceae. Les gènes appartenant à la famille CYP715 ont évolué avant la divergence entre gymnospermes et angiospermes, et sont le plus souvent présent en copie unique dans les génomes végétaux. Ceci suggère que leur fonction est essentielle et très conservée chez les plantes à graines (spermaphytes). Sur la base d’une analyse transcriptionnelle et de l’expression du gène GUS sous le contrôle du promoteur de CYP715A1, il est apparu que ce gène est spécifiquement exprimé au cours du développement floral, dans les cellules tapétales des jeunes boutons floraux ainsi que dans les filaments lors de l’anthèse. CYP715A1 est également fortement induit dans les cellules du péricycle de la zone d’élongation racinaire en réponse au stress salin. L’induction par le sel nécessite une région promotrice située entre 2 et 3 kb en amont de la région codante (i.e ; codon START), ce qui suggère la présence d’un facteur cis à cet endroit. Afin de déterminer la fonction de CYP715A1 chez Arabidopsis thaliana, j’ai identifié deux mutants d’insertion de T-DNA par génotypage et complémenté ces mutants avec le gène natif. La perte de fonction de CYP715A1 n’a pas d’impact sur la croissance et la fertilité de la plante en conditions de laboratoire. Cependant, une analyse par microscopie électronique en transmission montre un phénotype d’intine ondulée. La perte de fonction du gène CYP715A1 a également entraîné une réduction de la taille des pétales et un défaut d’anthèse. [...] / Plants produce a remarkable diversity of secondary metabolites to face continually challenging and fluctuating environmental constraints. However, how plants have reached such a high degree of metabolic complexity and what are the evolutionary forces responsible for this chemodiversity still remain largely unclarified. Gene evolution based on gene birth and extinction has been reported to nicely reflect the natural evolution. Without horizontal gene transfer, young genes are often restricted to a few species and have undergone rapid evolution, whereas old genes can be broadly distributed and are always indicative of essential housekeeping functions. It is thus of interest to study plant adaptation with parallel focus on both taxonomically widespread and lineage-specific genes. P450s are one of the largest protein families in plants, featuring both conserved and branched phylogenies. Examples of P450 properties reflecting metabolic versatility, chemodiversity and thus plant adaptation have been reported. To illustrate evolution of P450 functions in plant metabolism, we selected two P450 genes, one evolutionary conserved CYP715A1 and the second a recently specialized lineage-specific gene CYP98A9 in Arabidopsis thaliana.CYP715s evolved before the divergence between gymnosperms and angiosperms and are present in single copy in most sequenced plant genomes, suggesting an essential housekeeping function highly conserved across seed plants. Based on transcriptome analysis and promoter-driven GUS expression, CYP715A1 is selectively expressed in tapetal cells of young buds and filaments of open flowers during flower development. In addition, CYP715A1 is highly induced in the pericycle cells of the root elongation zone upon salt stress. The salt induction relies on the 2-3kb region of CYP715A1 promoter, suggesting some salt-response elements may exist in this area. To characterize the function of CYP715A1 in Arabidopsis, I identified two T-DNA insertion mutants by genotyping and confirmed by complementation with native CYP715A1 gene. Loss of function of CYP715A1 has no impact on plant growth and fertility in laboratory conditions. However, transmission electron microscopy (TEM) analysis has shown constant undulated intine phenotype in two knockout mutants and also the petal growth is significantly inhibited. These two phenotypes nicely match the native expression pattern of CYP715A1. Gene co-expression analysis suggests involvement of CYP715A1 in gibberellin (GA) metabolism under salt treatment. GAs profiling on mutant flowers also indicates reduced accumulation specific GAs. Unfortunately, no significant phenotype either related to root growth or root architecture under salt treatment can be observed. Recombinant expression of the CYP715A1 enzyme in yeast so far does not allow confirming GAmetabolism. However, metabolic profiling of inflorescences in mutants and over-expression lines, together with transcriptome analysis of the loss of function cyp715a1 mutants strongly support a CYP715A1 role in signaling, hormone homeostasis and volatile emission in agreement with the purifying selection leading to gene conservation observed in spermatophytes.[...]
|
Page generated in 0.2778 seconds