Global ETD Search

321	Detection and Classification of Cancer and Other Noncommunicable Diseases Using Neural Network Models Gore, Steven Lee 07 1900 (has links) Here, we show that training with multiple noncommunicable diseases (NCDs) is both feasible and beneficial to modeling this class of diseases. We first use data from the Cancer Genome Atlas (TCGA) to train a pan cancer model, and then characterize the information the model has learned about the cancers. In doing this we show that the model has learned concepts that are relevant to the task of cancer classification. We also test the model on datasets derived independently of the TCGA cohort and show that the model is robust to data outside of its training distribution such as precancerous legions and metastatic samples. We then utilize the cancer model as the basis of a transfer learning study where we retrain it on other, non-cancer NCDs. In doing so we show that NCDs with very differing underlying biology contain extractible information relevant to each other allowing for a broader model of NCDs to be developed with existing datasets. We then test the importance of the samples source tissue in the model and find that the NCD class and tissue source may not be independent in our model. To address this, we use the tissue encodings to create augmented samples. We test how successfully we can use these augmented samples to remove or diminish tissue source importance to NCD class through retraining the model. In doing this we make key observations about the nature of concept importance and its usefulness in future neural network explainability efforts. Cancer Neural network VAE generative augmented data methylation variational autoencoder CpG island TCGA schizophrenia asthma arthritis transfer learning TCAV Biology, Bioinformatics Computer Science
322	Inference of Gene Regulatory Networks with integration of prior knowledge Maresi, Emiliano 17 June 2024 (has links) Gene regulatory networks (GRNs) are crucial for understanding complex biological processes and disease mechanisms, particularly in cancer. However, GRN inference remains challenging due to the intricate nature of gene interactions and limitations of existing methods. Traditionally, prior knowledge in GRN inference simplifies the problem by reducing the search space, but its full potential is unrealized. This research aims to develop a method that uses prior knowledge to guide the GRN inference process, enhancing accuracy and biological plausibility of the resulting networks. We extended the Fused Sparse Structural Equation Models (FSSEM) framework to create the Fused Lasso Adaptive Prior (FLAP) method. FSSEM incorporates gene expression data and genetic variants in the form of expression quantitative trait loci (eQTLs) perturbations. FLAP enhances FSSEM by integrating prior knowledge of gene-gene interactions into the initial network estimate, guiding the selection of relevant gene interactions in the final inferred network. We evaluated FLAP using synthetic data to assess the impact of incorrect prior knowledge and real lung cancer data, using prior knowledge from various gene network databases (GIANT, TissueNexus, STRING, ENCODE, hTFtarget). Our findings demonstrate that integrating prior knowledge improves the accuracy of inferred networks, with FLAP showing tolerance for incorrect prior knowledge. Using real lung cancer data, functional enrichment analysis and literature validation confirmed the biological plausibility of the networks inferred by FLAP. Different sources of prior knowledge impacted the results, with GIANT providing the most biologically relevant networks, while other sources showed less consistent performance. FLAP improves GRN inference by effectively integrating prior knowledge, demonstrating robustness against incorrect prior knowledge. The method’s application to lung cancer data indicates that high-quality prior knowledge sources enhance the biological relevance of inferred networks. Future research should focus on improving the quality and integration of prior knowledge, possibly by developing consensus methods that combine multiple sources. This approach has potential applications in cancer research and drug sensitivity studies, offering a more accurate understanding of gene regulatory mechanisms and potential therapeutic targets. Settore INF/01 - Informatica
323	Phylogénomique des Archées Grenier, Jean-Christophe 07 1900 (has links) Les transferts horizontaux de gènes (THG) ont été démontrés pour jouer un rôle important dans l'évolution des procaryotes. Leur impact a été le sujet de débats intenses, ceux-ci allant même jusqu'à l'abandon de l'arbre des espèces. Selon certaines études, un signal historique dominant est présent chez les procaryotes, puisque les transmissions horizontales stables et fonctionnelles semblent beaucoup plus rares que les transmissions verticales (des dizaines contre des milliards). Cependant, l'effet cumulatif des THG est non-négligeable et peut potentiellement affecter l'inférence phylogénétique. Conséquemment, la plupart des chercheurs basent leurs inférences phylogénétiques sur un faible nombre de gènes rarement transférés, comme les protéines ribosomales. Ceux-ci n'accordent cependant pas autant d'importance au modèle d'évolution utilisé, même s'il a été démontré que celui-ci est important lorsqu'il est question de résoudre certaines divergences entre ancêtres d'espèces, comme pour les animaux par exemple. Dans ce mémoire, nous avons utilisé des simulations et analyser des jeux de données d'Archées afin d'étudier l'impact relatif des THG ainsi que l'impact des modèles d'évolution sur la précision phylogénétique. Nos simulations prouvent que (1) les THG ont un impact limité sur les phylogénies, considérant un taux de transferts réaliste et que (2) l'approche super-matrice est plus précise que l'approche super-arbre. Nous avons également observé que les modèles complexes expliquent non seulement mieux les données que les modèles standards, mais peuvent avoir un impact direct sur différents groupes phylogénétiques et sur la robustesse de l'arbre obtenu. Nos résultats contredisent une publication récente proposant que les Thaumarchaeota apparaissent à la base de l'arbre des Archées. / Horizontal gene transfer (HGT) had been demonstrated to play an important role in the evolution of prokaryotes. Their impact on phylogeny was the subject of a heated debate, with some proposing that the concept of a species tree should be abandoned. The phylogeny of prokaryotes does contain a major part of the historical signal, because stable and functional horizontal transmissions appear to be by far rarer than vertical transmissions (tens versus billions). However, the cumulative effect of HGT is non-negligible and can potentially affect phylogenetic inference. Therefore, most researchers base their phylogenetic inference on a low number of rarely transferred genes such as ribosomal proteins, but they assume the selection of the model of evolution as less important, this despite the fact that it has been shown of prime importance for much less deep divergences, e.g. like animals. Here, we used a combination of simulations and of real data from Archaea to study the relative impact of HGT and of the inference methods on the phylogenetic accuracy. Our simulations prove that (1) HGTs have a limited impact on phylogeny, assuming a realistic rate and (2) the supermatrix is much more accurate than the supertree approach. We also observed that more complex models of evolution not only have a better fit to the data, but can also have a direct impact on different phylogenetic groups and on the robustness of the tree. Our results are in contradiction to a recent publication proposing that the Thaumarchaeota are at the base of the Archaeal tree. phylogénie phylogeny phylogénomique phylogenomics procaryotes prokaryotes Archées Archaea transfert horizontal de gènes horizontal gene transfer évolution moléculaire molecular evolution simulations simulation modèles évolutifs evolutionary models super-matrice supermatrix super-arbre supertree
324	Amélioration de l'exactitude de l'inférence phylogénomique Roure, Béatrice 04 1900 (has links) L’explosion du nombre de séquences permet à la phylogénomique, c’est-à-dire l’étude des liens de parenté entre espèces à partir de grands alignements multi-gènes, de prendre son essor. C’est incontestablement un moyen de pallier aux erreurs stochastiques des phylogénies simple gène, mais de nombreux problèmes demeurent malgré les progrès réalisés dans la modélisation du processus évolutif. Dans cette thèse, nous nous attachons à caractériser certains aspects du mauvais ajustement du modèle aux données, et à étudier leur impact sur l’exactitude de l’inférence. Contrairement à l’hétérotachie, la variation au cours du temps du processus de substitution en acides aminés a reçu peu d’attention jusqu’alors. Non seulement nous montrons que cette hétérogénéité est largement répandue chez les animaux, mais aussi que son existence peut nuire à la qualité de l’inférence phylogénomique. Ainsi en l’absence d’un modèle adéquat, la suppression des colonnes hétérogènes, mal gérées par le modèle, peut faire disparaître un artéfact de reconstruction. Dans un cadre phylogénomique, les techniques de séquençage utilisées impliquent souvent que tous les gènes ne sont pas présents pour toutes les espèces. La controverse sur l’impact de la quantité de cellules vides a récemment été réactualisée, mais la majorité des études sur les données manquantes sont faites sur de petits jeux de séquences simulées. Nous nous sommes donc intéressés à quantifier cet impact dans le cas d’un large alignement de données réelles. Pour un taux raisonnable de données manquantes, il appert que l’incomplétude de l’alignement affecte moins l’exactitude de l’inférence que le choix du modèle. Au contraire, l’ajout d’une séquence incomplète mais qui casse une longue branche peut restaurer, au moins partiellement, une phylogénie erronée. Comme les violations de modèle constituent toujours la limitation majeure dans l’exactitude de l’inférence phylogénétique, l’amélioration de l’échantillonnage des espèces et des gènes reste une alternative utile en l’absence d’un modèle adéquat. Nous avons donc développé un logiciel de sélection de séquences qui construit des jeux de données reproductibles, en se basant sur la quantité de données présentes, la vitesse d’évolution et les biais de composition. Lors de cette étude nous avons montré que l’expertise humaine apporte pour l’instant encore un savoir incontournable. Les différentes analyses réalisées pour cette thèse concluent à l’importance primordiale du modèle évolutif. / The explosion of sequence number allows for phylogenomics, the study of species relationships based on large multi-gene alignments, to flourish. Without any doubt, phylogenomics is essentially an efficient way to eliminate the problems of single gene phylogenies due to stochastic errors, but numerous problems remain despite obvious progress realized in modeling evolutionary process. In this PhD-thesis, we are trying to characterize some consequences of a poor model fit and to study their impact on the accuracy of the phylogenetic inference. In contrast to heterotachy, the variation in the amino acid substitution process over time did not attract so far a lot of attention. We demonstrate that this heterogeneity is frequently observed within animals, but also that its existence can interfere with the quality of phylogenomic inference. In absence of an adequate model, the elimination of heterogeneous columns, which are poorly handled by the model, can eliminate an artefactual reconstruction. In a phylogenomic framework, the sequencing strategies often result in a situation where some genes are absent for some species. The issue about the impact of the quantity of empty cells was recently relaunched, but the majority of studies on missing data is performed on small datasets of simulated sequences. Therefore, we were interested on measuring the impact in the case of a large alignment of real data. With a reasonable amount of missing data, it seems that the accuracy of the inference is influenced rather by the choice of the model than the incompleteness of the alignment. For example, the addition of an incomplete sequence that breaks a long branch can at least partially re-establish an artefactual phylogeny. Because, model violations are always representing the major limitation of the accuracy of the phylogenetic inference, the improvement of species and gene sampling remains a useful alternative in the absence of an adequate model. Therefore, we developed a sequence-selection software, which allows the reproducible construction of datasets, based on the quantity of data, their evolutionary speed and their compositional bias. During this study, we did realize that the human expertise still furnishes an indispensable knowledge. The various analyses performed in the course of this PhD thesis agree on the primordial importance of the model of sequence evolution. Phylogénomique Exactitude de l’inférence Hétéropécilie Échantillonnage des espèces Sélection des séquences Données manquantes Violation de modèle Phylogenomics Accuracy of the inference Heteropecilly Species sampling Sequence sorting Missing data Model violation
325	Dynamic epigenetic changes in immune responses to infection in human dendritic cells Pacis, Alain 05 1900 (has links) La méthylation de l'ADN est une marque épigénétique importante chez les mammifères. Malgré le fait que la méthylation de la cytosine en 5' (5mC) soit reconnue comme une modification épigénétique stable, il devient de plus en plus reconnu qu'elle soit un processus plus dynamique impliquant des voies de méthylation et de déméthylation actives. La dynamique de la méthylation de l'ADN est désormais bien caractérisée dans le développement et dans le fonctionnement cellulaire des mammifères. Très peu est cependant connu concernant les implications régulatrices dans les réponses immunitaires. Pour se faire, nous avons effectué des analyses du niveau de transcription des gènes ainsi que du profilage épigénétique de cellules dendritiques (DCs) humaines. Ceux-ci ont été faits avant et après infection par le pathogène Mycobacterium tuberculosis (MTB). Nos résultats fournissent le premier portrait génomique du remodelage épigénétique survenant dans les DCs en réponse à une infection bactérienne. Nous avons constaté que les changements dans la méthylation de l'ADN sont omniprésents, identifiant 3,926 régions différentiellement méthylées lors des infections par MTB (MTB-RDMs). Les MTB-RDMs montrent un chevauchement frappant avec les régions génomiques marquées par les histones associées avec des régions amplificatrices. De plus, nos analyses ont révélées que les MTB-RDMs sont activement liées par des facteurs de transcription associés à l'immunité avant même d'être infecté par MTB, suggérant ces domaines comme étant des éléments d'activation dans un état de dormance. Nos données suggèrent que les changements actifs dans la méthylation jouent un rôle essentiel pour contrôler la réponse cellulaire des DCs à l'infection bactérienne. / DNA methylation is an important epigenetic mark in mammals. Although methylation at the 5’ position of cytosine (5mC) is recognized as a stable epigenetic modification, it is becoming increasingly viewed as a more dynamic process that involves both active methylation and demethylation pathways. While the dynamics of DNA methylation has been well characterized in mammalian development and normal cellular function, little is known about its regulatory implications in immune responses. To that end, we performed comprehensive transcriptional and epigenetic profiling of primary dendritic cell (DC) samples from humans, before and after infection with Mycobacterium tuberculosis (MTB). Our results provide the first complete genomic portrait of the extensive epigenetic remodeling occurring in primary DCs in response to a bacterial infection. We found that active changes in DNA methylation are pervasive, identifying 3,926 MTB-induced differentially methylated regions (MTB-DMRs). MTB-DMRs show a striking overlap with genomic regions marked by histones associated with enhancer activity. ATAC-seq footprinting analysis revealed that regions that change methylation were actively bound by immune-related TFs prior to MTB-infection suggesting that these domains are likely to represent enhancer elements in a poised state. Our data suggests that active changes in DNA methylation play an essential and previously unappreciated role at controlling of the regulatory programs engaged by DCs in response to a bacterial infection. Epigenetics DNA methylation Chromatin dynamics Enhancers Bacterial infection Inflammation Mycobacterium tuberculosis Epigénétique Méthylation de l'ADN Modifications des histones Dynamique de la chromatine Régions amplificatrices Infection bactérienne Inflammation Bacille de Koch
326	Analyse transcriptomique et applications en développement préclinique des médicaments El-Hachem, Nehme 12 1900 (has links) L’émergence des Mégadonnées (« Big Data ») en biologie moléculaire, surtout à travers la transcriptomique, a révolutionné la façon dont nous étudions diverses disciplines telles que le processus de développement du médicament ou la recherche sur le cancer. Ceci fut associé à un nouveau concept, la médecine de précision, dont le principal but est de comprendre les mécanismes moléculaires entraînant une meilleure réponse thérapeutique chez le patient. Cette thèse est à mi-chemin entre les études pharmaco — et toxicogénomiques expérimentales, et les études cliniques et translationnelles. Le but de cette thèse est surtout de montrer le potentiel et les limites de ces jeux de données et leur pertinence pour la découverte de biomarqueurs de réponse ainsi que la compréhension des mécanismes d’action/toxicité de médicaments, en vue d’utiliser ces informations à des fins thérapeutiques. L’originalité de cette thèse réside dans son approche globale pour analyser les plus larges jeux de données pharmaco/toxicogénomiques publiés à ce jour et ceci pour : 1) Aborder la notion de biomarqueurs de réponse aux médicaments en pharmacogénomique du cancer, en étudiant les facteurs discordants entre deux grandes études publiées en 2012; 2) Comprendre le mécanisme d’action des médicaments et construire une taxonomie performante en utilisant une approche intégrative; et 3) Créer un répertoire toxicogénomique à partir des hépatocytes humains, exposés à différentes classes de médicaments et composés chimiques. Mes contributions principales sont les suivantes : • J’ai développé une approche bioinformatique pour étudier les facteurs discordants entre deux grandes études pharmacogénomiques et suggérées que les différences observées émergeaient plutôt de l’absence de standardisation des mesures pharmacologiques qui pourrait limiter la validation de biomarqueurs de réponse aux médicaments. • J’ai implémenté une approche bioinformatique qui montre la supériorité de l’intégration tenant en compte des différents paramètres pour les médicaments (structure, cytotoxicité, perturbation du transcriptome) afin d’élucider leur mécanisme d’action (MoA). • J’ai développé un pipeline bioinformatique pour étudier le niveau de conservation des mécanismes moléculaires entre les études toxicogénomiques in vivo et in vitro démontrant que les hépatocytes humains sont un modèle fiable pour détecter les produits toxiques hépatocarcinogènes. Au total, nos études ont permis de fournir un cadre de travail original pour l’exploitation de différents types de données transcriptomiques pour comprendre l’impact des produits chimiques sur la biologie cellulaire. / The emergence of Big Data in molecular biology, especially through the study of transcriptomics, has revolutionized the way we look at various disciplines, such as drug development and cancer research. Big data analysis is an important part of the concept of precision medicine, which primary purpose is to understand the molecular mechanisms leading to better therapeutic response in patients. This thesis is halfway between pharmaco-toxicogenomics experimental studies, and clinical and translational studies. The aim of this thesis is mainly to show the potential and limitations of these studies and their relevance, especially for the discovery of drug response biomarkers and understanding the drug mechanisms (targets, toxicities). This thesis is an original work since it proposes a global approach to analyzing the largest pharmaco-toxicogenomic datasets available to date. The key aims were: 1) Addressing the challenge of reproducibility for biomarker discovery in cancer pharmacogenomics, by comparing two large pharmacogenomics studies published in 2012; 2) Understanding drugs mechanism of action using an integrative approach to generate a superior drug-taxonomy; and 3) Evaluating the conservation of toxicogenomic responses in primary hepatocytes vs. in vivo liver samples in order to check the feasability of cell models in toxicology studies. My main contributions can be summarized as follow: - I developed a bioinformatics pipeline to study the factors that trigger (in)consistency between two major pharmacogenomic studies. I suggested that the observed differences emerged from the non-standardization of pharmacological measurements, which could limit the validation of drug response biomarker. - I implemented a bioinformatics pipeline that demonstrated the superiority of the integrative approach, since it takes into account different parameters for the drug (structure, cytotoxicity, transcriptional perturbation) to elucidate the mechanism of action (MoA). - I developed a bioinformatics pipeline to study the level of conservation of toxicity mechanisms between the in vivo and in vitro system, showing that human hepatocytes is a reliable model for hepatocarcinogens testing. Overall, our studies have provided a unique framework to leverage various types of transcriptomic data in order to understand the impact of chemicals on cell biology. Transcriptomique médicaments mécanisme d’action toxicité pharmaco-toxicogénomique biomarqueurs de réponse Transcriptomics bioinformatics chemical compounds response biomarkers mechanism of action toxicity cell lines microarrays pharmaco-toxicogenomics lignée cellulaire microarrays bioinformatique médicaments
327	Within-host evolution of HIV-1 and the analysis of transmissible diversity English, Suzanne Elizabeth January 2012 (has links) The central problem for researchers of HIV-1 evolution is explaining the apparent design of the virus for causing pandemic infection in humans: understanding how HIV-1 spreads is key to halting the pandemic. Current knowledge of how HIV-1 spreads from host to host is based upon experimental observation and indirect inferences informed by theory. The hypothesis of this thesis is that diversity of HIV-1 around the time of transmission is important for viral adaptation to a new human host, rather than intrinsic superiority of particular strains found in infectious fluids from human donor hosts, and that studying recombination is important for understanding this behaviour. To demonstrate the apparent randomness of transmission, I test the null-hypothesis that hard selection accounts for between-host viral divergence in a rare case study of contemporaneous infection. I explain how the experimental data that I have generated and the analyses I have carried out address certain basic assumptions and predictions about HIV-1 transmission and may inform current strategies for vaccine design. Specifically, my approach contributes to the current literature on HIV-1, by investigating an alternative hypothesis to the single virion theory of sexual transmission and by characterizing the role of recombination in a pseudodiploid virus following multiple-infection. 616.979201
328	MODELING HETEROTACHY IN PHYLOGENETICS Zhou, Yan 04 1900 (has links) Il a été démontré que l’hétérotachie, variation du taux de substitutions au cours du temps et entre les sites, est un phénomène fréquent au sein de données réelles. Échouer à modéliser l’hétérotachie peut potentiellement causer des artéfacts phylogénétiques. Actuellement, plusieurs modèles traitent l’hétérotachie : le modèle à mélange des longueurs de branche (MLB) ainsi que diverses formes du modèle covarion. Dans ce projet, notre but est de trouver un modèle qui prenne efficacement en compte les signaux hétérotaches présents dans les données, et ainsi améliorer l’inférence phylogénétique. Pour parvenir à nos fins, deux études ont été réalisées. Dans la première, nous comparons le modèle MLB avec le modèle covarion et le modèle homogène grâce aux test AIC et BIC, ainsi que par validation croisée. A partir de nos résultats, nous pouvons conclure que le modèle MLB n’est pas nécessaire pour les sites dont les longueurs de branche diffèrent sur l’ensemble de l’arbre, car, dans les données réelles, le signaux hétérotaches qui interfèrent avec l’inférence phylogénétique sont généralement concentrés dans une zone limitée de l’arbre. Dans la seconde étude, nous relaxons l’hypothèse que le modèle covarion est homogène entre les sites, et développons un modèle à mélanges basé sur un processus de Dirichlet. Afin d’évaluer différents modèles hétérogènes, nous définissons plusieurs tests de non-conformité par échantillonnage postérieur prédictif pour étudier divers aspects de l’évolution moléculaire à partir de cartographies stochastiques. Ces tests montrent que le modèle à mélanges covarion utilisé avec une loi gamma est capable de refléter adéquatement les variations de substitutions tant à l’intérieur d’un site qu’entre les sites. Notre recherche permet de décrire de façon détaillée l’hétérotachie dans des données réelles et donne des pistes à suivre pour de futurs modèles hétérotaches. Les tests de non conformité par échantillonnage postérieur prédictif fournissent des outils de diagnostic pour évaluer les modèles en détails. De plus, nos deux études révèlent la non spécificité des modèles hétérogènes et, en conséquence, la présence d’interactions entre différents modèles hétérogènes. Nos études suggèrent fortement que les données contiennent différents caractères hétérogènes qui devraient être pris en compte simultanément dans les analyses phylogénétiques. / Heterotachy, substitution rate variation across sites and time, has shown to be a frequent phenomenon in the real data. Failure to model heterotachy could potentially cause phylogenetic artefacts. Currently, there are several models to handle heterotachy, the mixture branch length model (MBL) and several variant forms of the covarion model. In this project, our objective is to find a model that efficiently handles heterotachous signals in the data, and thereby improves phylogenetic inference. In order to achieve our goal, two individual studies were conducted. In the first study, we make comparisons among the MBL, covarion and homotachous models using AIC, BIC and cross validation. Based on our results, we conclude that the MBL model, in which sites have different branch lengths along the entire tree, is an over-parameterized model. Real data indicate that the heterotachous signals which interfere with phylogenetic inference are generally limited to a small area of the tree. In the second study, we relax the assumption of the homogeneity of the covarion parameters over sites, and develop a mixture covarion model using a Dirichlet process. In order to evaluate different heterogeneous models, we design several posterior predictive discrepancy tests to study different aspects of molecular evolution using stochastic mappings. The posterior predictive discrepancy tests demonstrate that the covarion mixture +Γ model is able to adequately model the substitution variation within and among sites. Our research permits a detailed view of heterotachy in real datasets and gives directions for future heterotachous models. The posterior predictive discrepancy tests provide diagnostic tools to assess models in detail. Furthermore, both of our studies reveal the non-specificity of heterogeneous models. Our studies strongly suggest that different heterogeneous features in the data should be handled simultaneously. Heterotachy Hétérotachie covarion covarion MBL MLB posterior predictive postérieur prédictif non-specificity non-spécificité discrepancy non-conformité heterogeneity hétérogénéité AIC AIC BIC BIC cross validation validation croisée
329	Approches algorithmiques pour l’inférence d’histoires de duplication en tandem avec inversions et délétions pour des familles multigéniques Lajoie, Mathieu 08 1900 (has links) [Français] Une fraction importante des génomes eucaryotes est constituée de Gènes Répétés en Tandem (GRT). Un mécanisme fondamental dans l’évolution des GRT est la recombinaison inégale durant la méiose, entrainant la duplication locale (en tandem) de segments chromosomiques contenant un ou plusieurs gènes adjacents. Différents algorithmes ont été proposés pour inférer une histoire de duplication en tandem pour un cluster de GRT. Cependant, leur utilisation est limitée dans la pratique, car ils ne tiennent pas compte d’autres événements évolutifs pourtant fréquents, comme les inversions, les duplications inversées et les délétions. Cette thèse propose différentes approches algorithmiques permettant d’intégrer ces événements dans le modèle de duplication en tandem classique. Nos contributions sont les suivantes: • Intégrer les inversions dans un modèle de duplication en tandem simple (duplication d’un gène à la fois) et proposer un algorithme exact permettant de calculer le nombre minimal d’inversions s’étant produites dans l’évolution d’un cluster de GRT. • Généraliser ce modèle pour l’étude d’un ensemble de clusters orthologues dans plusieurs espèces. • Proposer un algorithme permettant d’inférer l’histoire évolutive d’un cluster de GRT en tenant compte des duplications en tandem, duplications inversées, inversions et délétions de segments chromosomiques contenant un ou plusieurs gènes adjacents. / [English] Tandemly arrayed genes (TAGs) represent an important fraction of most genomes. A fundamental mechanism at the origin of TAG clusters is unequal crossing-over during meiosis, leading to the duplication of chromosomal segments containing one or many adjacent genes. Such duplications are called tandem duplications, as the duplicated segment is placed next to the original one on the chromosome. Different algorithms have been proposed to infer the tandem duplication history of a TAG cluster. However, their applicability is limited in practice since they do not take into account other frequent evolutionary events such as inversion, inverted duplication and deletion. In this thesis, we propose different algorithmic approaches allowing to integrate these evolutionary events in the original tandem duplication model of evolution. Our contributions are summarized as follows: • We integrate inversion events in a tandem duplication model restricted to single gene duplications, and we propose an exact algorithm allowing to compute the minimum number of inversions explaining the evolution of a TAG cluster. • We generalize this model to the study of orthologous TAG clusters in different species. • We propose an algorithm allowing to infer the evolutionary history of a TAG cluster through tandem duplication, inverted duplication, inversion and deletion of chromosomal segments containing one or many adjacent genes. arbre de duplication arbre de gènes duplication inversée famille de gènes médiane perte de gène réarrangement génomique réconciliation duplication tree gene tree inverted duplication gene family median gene loss genomic rearrangement reconciliation
330	Statistical potentials for evolutionary studies Kleinman, Claudia L. 06 1900 (has links) Les séquences protéiques naturelles sont le résultat net de l’interaction entre les mécanismes de mutation, de sélection naturelle et de dérive stochastique au cours des temps évolutifs. Les modèles probabilistes d’évolution moléculaire qui tiennent compte de ces différents facteurs ont été substantiellement améliorés au cours des dernières années. En particulier, ont été proposés des modèles incorporant explicitement la structure des protéines et les interdépendances entre sites, ainsi que les outils statistiques pour évaluer la performance de ces modèles. Toutefois, en dépit des avancées significatives dans cette direction, seules des représentations très simplifiées de la structure protéique ont été utilisées jusqu’à présent. Dans ce contexte, le sujet général de cette thèse est la modélisation de la structure tridimensionnelle des protéines, en tenant compte des limitations pratiques imposées par l’utilisation de méthodes phylogénétiques très gourmandes en temps de calcul. Dans un premier temps, une méthode statistique générale est présentée, visant à optimiser les paramètres d’un potentiel statistique (qui est une pseudo-énergie mesurant la compatibilité séquence-structure). La forme fonctionnelle du potentiel est par la suite raffinée, en augmentant le niveau de détails dans la description structurale sans alourdir les coûts computationnels. Plusieurs éléments structuraux sont explorés : interactions entre pairs de résidus, accessibilité au solvant, conformation de la chaîne principale et flexibilité. Les potentiels sont ensuite inclus dans un modèle d’évolution et leur performance est évaluée en termes d’ajustement statistique à des données réelles, et contrastée avec des modèles d’évolution standards. Finalement, le nouveau modèle structurellement contraint ainsi obtenu est utilisé pour mieux comprendre les relations entre niveau d’expression des gènes et sélection et conservation de leur séquence protéique. / Protein sequences are the net result of the interplay of mutation, natural selection and stochastic variation. Probabilistic models of molecular evolution accounting for these processes have been substantially improved over the last years. In particular, models that explicitly incorporate protein structure and site interdependencies have recently been developed, as well as statistical tools for assessing their performance. Despite major advances in this direction, only simple representations of protein structure have been used so far. In this context, the main theme of this dissertation has been the modeling of three-dimensional protein structure for evolutionary studies, taking into account the limitations imposed by computationally demanding phylogenetic methods. First, a general statistical framework for optimizing the parameters of a statistical potential (an energy-like scoring system for sequence-structure compatibility) is presented. The functional form of the potential is then refined, increasing the detail of structural description without inflating computational costs. Always at the residue-level, several structural elements are investigated: pairwise distance interactions, solvent accessibility, backbone conformation and flexibility of the residues. The potentials are then included into an evolutionary model and their performance is assessed in terms of model fit, compared to standard evolutionary models. Finally, this new structurally constrained phylogenetic model is used to better understand the selective forces behind the differences in conservation found in genes of very different expression levels. Évolution moléculaire structure des protéines Markov chain Monte Carlo maximum de vraisemblance statistique Bayesienne potentiels statistiques molecular evolution protein structure Markov chain Monte Carlo maximum likelihood Bayesian statistics statistical potentials

Search results