1 |
From the inside out : determining sequence conservation within the context of relative solvent accessibilityScherrer, Michael Paul 17 October 2013 (has links)
Evolutionary rates vary vastly across intraspecific genes and the determinants of these rates is of central concern to the field of comparative genomics. Tradition has held that preservation of protein function conserved the sequence, however mounting evidence implicates the biophysical properties of proteins themselves as the elements that constrain sequence evolution. Of these properties, the exposure of a residue to solvent is the most prevalent determinant of its evolutionary rate due to pressures to maintain proper synthesis and folding of the structure. In this work, we have developed a model that considers the microenvironment of a residue in the estimation of its evolutionary rate. By working within the structural context of a protein's residues, we show that our model is better able to capture the overall evolutionary trends affecting conservation of both the coding sequences and the protein structures from a genomic level down to individual genes. / text
|
2 |
The origin of the Hox and ParaHox loci and animal homeobox evolutionMendivil Ramos, Olivia January 2013 (has links)
The homeobox superfamily is one of the most significant gene families in the evolution of developmental processes in animals. Within this superfamily the ANTP class has expanded exclusively in animals and, therefore, the reconstruction of its origin and diversification into the different ‘modern' families have become prominent questions in the ‘evo-devo' field. The current burgeoning availability of animal genome sequences is improving the resolution of these questions, putting them in a genome evolution context, as well as providing the field with a large, detailed and diverse catalogue of animal homeobox complements. Here I have contributed with a new hypothesis on the origin and evolution of the Hox and ParaHox loci and the new term, ghost loci, referring to homologous genome regions that have lost their homeobox genes. This hypothesis proposes that the last common ancestor of all animals had a much more complex genome (i.e. differentiated Hox, ParaHox and NK loci) that underwent a simplification in the early animal lineages of sponges and placozoans. In collaboration with the Adamska group I resolved the orthology of the first ever ParaHox genes reported in calcareous sponges. This finding serves as an independent confirmation of the ghost loci hypothesis and further resolves the events of secondary simplification within the sponge lineage. Finally, I have catalogued the homeobox complement of the newly sequenced arthropod, the myriapod Strigamia maritima, and examined the linkage and clustering of these genes. This has furthered our understanding of the evolution of the ANTP class. The diversity of the homeobox complement and the retention in this myriapod and the retention of some homeobox genes not previously described within arthropods, in combination with the interesting phylogenetic position that this lineage occupies relative to other arthropods, makes this complement an important point of reference for comparison within the arthropods and in a broader perspective in the ecdyzosoans. These findings have provided significant further insights into the origin and evolution of the homeobox superfamily, with important implications for animal evolution and the evolution of development.
|
3 |
White dwarfs and the ages of open clustersJeffery, Elizabeth Jane 23 March 2011 (has links)
Open clusters have long been objects of interest in astronomy. As a good approximation of essentially pure stellar populations, they have proved very useful for studies in a wide range of astrophysically interesting questions, including stellar evolution and atmosphere, the chemical and dynamical evolution of our Galaxy, and the structure of our Galaxy. Of fundamental importance to our understanding of open clusters, as well as many other questions in astrophysics, is the accurate determination of ages. Currently there are two main techniques for independently determining the ages of stellar populations: main sequence evolution theory (via cluster isochrones) and white dwarf cooling theory. Open clusters provide the ideal environment for the calibration of these two important clocks, as well as the unique opportunity to directly compare and refine our understanding of both theories. Here I present a photometric study of six open clusters, including both ground-based data, and new, deep photometric data from the Hubble Space Telescope. From the former I derive main sequence turn off ages, while the latter will be used to search for faint cluster white dwarfs. From these data I measure a white dwarf age for each cluster and directly compare these ages with those I find from the main sequence turn off age. For this analysis I employ a new Bayesian statistical technique that has been developed by our group. Additionally, I use this new technique to explore the feasibility of a new method to determine cluster white dwarf ages from the hot (bright) white dwarfs alone, and its first successful application to the Hyades. / text
|
4 |
Characterization of an Evolutionarily Old Human Alphoid DNACarnahan, Susan L., Palamidis-Bourtsos, Eleni, Musich, Phillip R., Doering, Jeffrey L. 30 January 1993 (has links)
A recently isolated human alphoid DNA (in plasmid pHH550) has been sequenced and found to have an exceptionally high degree of similarity to the human alphoid consensus sequence, while its component monomers are unusually heterogeneous in sequence. In contrast to other alphoid DNAs, this DNA is found in all primates tested. Thus this may be an evolutionarily old sequence similar to the one from which other human alphoid DNAs diverged. The pHH550 sequences are found on a number of human chromosomes, including 21 and 22. On chromosome 21 most members of this new sequence group are located distal to other alphoid DNAs.
|
5 |
Analysis And Predictions Of DNA Sequence Transformations On GridsJoshi, Yadnyesh R 08 1900 (has links)
Phylogenetics is the study of evolution of organisms. Evolution occurs due to mutations of DNA sequences. The reasons behind these seemingly random mutations are largely unknown. There are many algorithms that build phylogenetic trees from DNA sequences. However, there are certain uncertainties associated with these phylogenetic trees. Fine level analysis of these phylogenetic trees is both important and interesting for evolutionary biologists. In this thesis, we try to model evolutions of DNA sequences using Cellular Automata and resolve the uncertainties associated with the phylogenetic trees. In particular, we determine the effect of neighboring DNA base-pairs on the mutation of a base-pair. Cellular Automata can be viewed as an array of cells which modifies itself in discrete time-steps according to a governing rule. The state of the cell at the next time-step depends on its current state and state of its neighbors. We have used cellular automata rules for analysis and predictions of DNA sequence transformations on Computational grids.
In the first part of the thesis, DNA sequence evolution is modeled as a cellular automaton with each cell having one of the four possible states, corresponding to four bases. Phylogenetic trees are explored in order to find out the cellular automata rules that may have guided the evolutions. Master-client paradigm is used to exploit the parallelism in the sequence transformation analysis. Load balancing and fault-tolerance techniques are developed to enable the execution of the explorations on grid resources. The analysis of the sequence transformations is used to resolve uncertainties associated with the phylogenetic trees namely, intermediate sequences in the phylogenetic tree and the exact number of time-steps required for the evolution of a branch. The model is further used to find out various statistics such as most popular rules at a particular time-step in the evolution history of a branch in a phylogenetic tree. We have observed some interesting statistics regarding the unknown base pairs in the intermediate sequences of the phylogenetic tree and the most popular rules used for sequence transformations.
Next part of the thesis deals with predictions of future sequences using the previous sequences. First, we try to find out the preserved sequences so that cellular automata rules can be applied selectively. Then, random strategies are developed as base benchmarks. Roulette Wheel strategy is used for predicting future DNA sequences. Though the prediction strategies are able to better the random benchmarks in most of the cases, average performance improvement over the random strategies is not significant. The possible reasons are discussed.
|
6 |
Birds as a Model for Comparative Genomic StudiesKünstner, Axel January 2011 (has links)
Comparative genomics provides a tool to investigate large biological datasets, i.e. genomic datasets. In my thesis I focused on inferring patterns of selection in coding and non-coding regions of avian genomes. Until recently, large comparative studies on selection were mainly restricted to model species with sequenced genomes. This limitation has been overcome with advances in sequencing technologies and it is now possible to gather large genomic data sets for non-model species. Next-generation sequencing data was used to study patterns of nucleotide substitutions and from this we inferred how selection has acted in the genomes of 10 non-model bird species. In general, we found evidence for a negative correlation between neutral substitution rate and chromosome size in birds. In a follow up study, we investigated two closely related bird species, to study expression levels in different tissues and pattern of selection. We found that between 2% and 18% of all genes were differentially expressed between the two species. We showed that non-coding regions adjacent to genes are under evolutionary constraint in birds, which suggests that noncoding DNA plays an important functional role in the genome. Regions downstream to genes (3’) showed particularly high level of constraint. The level of constraint in these regions was not correlated to the length of untranslated regions, which suggests that other causes play also a role in sequence conservation. We compared the rate of nonsynonymous substitutions to the rate of synonymous substitutions in order to infer levels of selection in protein-coding sequences. Synonymous substitutions are often assumed to evolve neutrally. We studied synonymous substitutions by estimating constraint on 4-fold degenerate sites of avian genes and found significant evolutionary constraint on this category of sites (between 24% and 43%). These results call for a reappraisal of synonymous substitution rates being used as neutral standards in molecular evolutionary analysis (e.g. the dN/dS ratio to infer positive selection). Finally, the problem of sequencing errors in next-generation sequencing data was investigated. We developed a program that removes erroneous bases from the reads. We showed that low coverage sequencing projects and large genome sequencing projects will especially gain from trimming erroneous reads.
|
7 |
Modélisation des biais mutationnels et rôle de la sélection sur l’usage des codonsLaurin-Lemay, Simon 10 1900 (has links)
L’acquisition de données génomiques ne cesse de croître, ainsi que l’appétit pour les interpréter. Mais déterminer les processus qui ont façonné l’évolution des séquences codantes (et leur importance relative) est un défi scientifique passant par le développement de modèles statistiques de l’évolution prenant en compte de plus en plus d’hétérogénéités au niveau des processus mutationnels et de sélection.
Identifier la sélection est une tâche qui nécessite typiquement de détecter un écart entre deux modèles : un modèle nulle ne permettant pas de régime évolutif adaptatif et un modèle alternatif qui lui en permet. Lorsqu’un test entre ces deux modèles rejette le modèle nulle, on considère avoir détecter la présence d’évolution adaptative. La tâche est d’autant plus difficile que le signal est faible et confondu avec diverses hétérogénéités négligées par les modèles.
La détection de la sélection sur l’usage des codons spécifiquement est controversée, particulièrement chez les Vertébrés. Plusieurs raisons peuvent expliquer cette controverse : (1) il y a un biais sociologique à voir la sélection comme moteur principal de l’évolution, à un tel point que les hétérogénéités relatives aux processus de mutation sont historiquement négligées ; (2) selon les principes de la génétique des populations, la petite taille efficace des populations des Vertébrés limite le pouvoir de la sélection sur les mutations synonymes conférant elles-mêmes un avantage minime ; (3) par contre, la sélection sur l’usage des codons pourrait être très localisée le long des séquences codantes, à des sites précis, relevant de contraintes de sélection relatives à des motifs utilisés par la machinerie d’épissage, par exemple.
Les modèles phylogénétiques de type mutation-sélection sont les outils de prédilection pour aborder ces questions, puisqu’ils modélisent explicitement les processus mutationnels ainsi que les contraintes de sélection. Toutes les hétérogénéités négligées par les modèles mutation-sélection de Yang and Nielsen [2008] peuvent engendrer de faux positifs allant de 20% (préférence site-spécifique en acides aminés) à 100% (hypermutabilité des transitions en contexte CpG) [Laurin-Lemay et al., 2018b]. En particulier, l’hypermutabilité des transitions du contexte CpG peut à elle seule expliquer la sélection détectée par Yang and Nielsen [2008] sur l’usage des codons.
Mais, modéliser des phénomènes qui prennent en compte des interdépendances dans les données (par exemple l’hypermutabilité du contexte CpG) augmente de beaucoup la complexité des fonctions de vraisemblance. D’autre part, aujourd’hui le niveau de sophistication des modèles fait en sorte que des vecteurs de paramètres de haute dimensionnalité sont nécessaires pour modéliser l’hétérogénéité des processus étudiés, dans notre cas de contraintes de sélection sur la protéine.
Le calcul bayésien approché (Approximate Bayesian Computation ou ABC) permet de contourner le calcul de la vraisemblance. Cette approche diffère de l’échantillonnage par Monte Carlo par chaîne de Markov (MCMC) communément utilisé pour faire l’approximation de la distribution a posteriori. Nous avons exploré l’idée de combiner ces approches pour une problématique spécifique impliquant des paramètres de haute dimensionnalité et de nouveaux paramètres prenant en compte des dépendances entre sites. Dans certaines conditions, lorsque les paramètres de haute dimensionnalité sont faiblement corrélés aux nouveaux paramètres d’intérêt, il est possible d’inférer ces mêmes paramètres de haute dimensionnalité avec la méthode MCMC, et puis les paramètres d’intérêt au moyen de l’ABC. Cette nouvelle approche se nomme CABC [Laurin-Lemay et al., 2018a], pour calcul bayésien approché conditionnel (Conditional Approximate Bayesian Computation : CABC).
Nous avons pu vérifier l’efficacité de la méthode CABC en étudiant un cas d’école, soit celui de l’hypermutabilité des transitions en contexte CpG chez les Eutheria [Laurin-Lemay et al., 2018a]. Nous trouvons que 100% des 137 gènes testés possèdent une hypermutabilité des transitions significative. Nous avons aussi montré que les modèles incorporant l’hypermutabilité des transitions en contexte CpG prédisent un usage des codons plus proche de celui des gènes étudiés. Ceci suggère qu’une partie importante de l’usage des codons peut être expliquée à elle seule par les processus mutationnels et non pas par la sélection.
Finalement nous explorons plusieurs pistes de recherche suivant nos développements méthodologiques : l’application de la détection de l’hypermutabilité des transitions en contexte CpG à l’échelle des Vertébrés ; l’expansion du modèle pour reconnaître des contextes autres que seul le CpG (e.g., hypermutabilité des transitions et transversions en contexte CpG et TpA) ; ainsi que des perspectives méthodologiques d’amélioration de la performance du CABC. / The acquisition of genomic data continues to grow, as does the appetite to interpret them. But determining the processes that shaped the evolution of coding sequences (and their relative importance) is a scientific challenge that requires the development of statistical models of evolution that increasingly take into account heterogeneities in mutation and selection processes.
Identifying selection is a task that typically requires comparing two models: a null model that does not allow for an adaptive evolutionary regime and an alternative model that allows it. When a test between these two models rejects the null, we consider to have detected the presence of adaptive evolution. The task is all the more difficult as the signal is weak and confounded with various heterogeneities neglected by the models.
The detection of selection on codon usage is controversial, particularly in Vertebrates. There are several reasons for this controversy: (1) there is a sociological bias in seeing selection as the main driver of evolution, to such an extent that heterogeneities relating to mutation processes are historically neglected; (2) according to the principles of population genetics, the small effective size of vertebrate populations limits the power of selection over synonymous mutations conferring a minimal advantage; (3) On the other hand, selection on the use of codons could be very localized along the coding sequences, at specific sites, subject to selective constraints related to DNA patterns used by the splicing machinery, for example.
Phylogenetic mutation-selection models are the preferred tools to address these issues, as they explicitly model mutation processes and selective constraints. All the heterogeneities neglected by the mutation-selection models of Yang and Nielsen [2008] can generate false positives, ranging from 20% (site-specific amino acid preference) to 100% (hypermutability of transitions in CpG context)[Laurin-Lemay et al., 2018b]. In particular, the hypermutability of transitions in the CpG context alone can explain the selection on codon usage detected by Yang and Nielsen [2008].
However, modelling phenomena that take into account data interdependencies (e.g., hypermutability of the CpG context) greatly increases the complexity of the likelihood function. On the other hand, today’s sophisticated models require high-dimensional parameter vectors to model the heterogeneity of the processes studied, in our case selective constraints on the protein.
Approximate Bayesian Computation (ABC) is used to bypass the calculation of the likelihood function. This approach differs from the Markov Chain Monte Carlo (MCMC) sampling commonly used to approximate the posterior distribution. We explored the idea of combining these approaches for a specific problem involving high-dimensional parameters and new parameters taking into account dependencies between sites. Under certain conditions, when the high dimensionality parameters are weakly correlated to the new parameters of interest, it is possible to infer the high dimensionality parameters with the MCMC method, and then the parameters of interest using the ABC. This new approach is called Conditional Approximate Bayesian Computation (CABC) [Laurin-Lemay et al., 2018a]. We were able to verify the effectiveness of the CABC method in a case study, namely the hypermutability of transitions in the CpG context within Eutheria [Laurin-Lemay et al.,2018a]. We find that 100% of the 137 genes tested have significant hypermutability of transitions. We have also shown that models incorporating hypermutability of transitions in CpG contexts predict a codon usage closer to that of the genes studied. This suggests that a significant part of codon usage can be explained by mutational processes alone.
Finally, we explore several avenues of research emanating from our methodological developments: the application of hypermutability detection of transitions in CpG contexts to the Vertebrate scale; the expansion of the model to recognize contexts other than only CpG (e.g., hypermutability of transitions and transversions in CpG and TpA context); and methodological perspectives to improve the performance of the CABC approach.
|
Page generated in 0.1069 seconds