1 |
Comparative genomics of microsatellite abundance: a critical analysis of methods and definitionsJentzsch, Iris Miriam Vargas January 2009 (has links)
This PhD dissertation is focused on short tandemly repeated nucleotide patterns which
occur extremely often across DNA sequences, called microsatellites. The main characteristic
of microsatellites, and probably the reason why they are so abundant across genomes, is the
extremely high frequency of specific replication errors occurring within their sequences,
which usually cause addition or deletion of one or more complete tandem repeat units. Due
to these errors, frequent fluctuations in the number of repetitive units can be observed
among cellular and organismal generations. The molecular mechanisms as well as the
consequences of these microsatellite mutations, both, on a generational as well as on an
evolutionary scale, have sparked debate and controversy among the scientific community.
Furthermore, the bioinformatic approaches used to study microsatellites and the ways
microsatellites are referred to in the general literature are often not rigurous, leading to
misinterpretations and inconsistencies among studies. As an introduction to this complex
topic, in Chapter I I present a review of the knowledge accumulated on microsatellites
during the past two decades. A major part of this chapter has been published in the
Encyclopedia of Life Sciences in a Chapter about microsatellite evolution (see Publication 1
in Appendix II).
The ongoing controversy about the rates and patterns of microsatellite mutation was
evident to me since before starting this PhD thesis. However, the subtler problems inherent
to the computational analyses of microsatellites within genomes only became apparent
when retrieving information on microsatellite distribution and abundance for the design of
comparative genomic analyses. There are numerous publications analyzing the
microsatellite content of genomes but, in most cases, the results presented can neither be
reliably compared nor reproduced, mainly due to the lack of details on the microsatellite
search process (particularly the program’s algorithm and the search parameters used) and
because the results are expressed in terms that are relative to the search process (i.e.
measures based on the absolute number of microsatellites). Therefore, in Chapter II I
present a critical review of all available software tools designed to scan DNA sequences for
microsatellites. My aim in undertaking this review was to assess the comparability of search
results among microsatellite programs, and to identify the programs most suitable for the
generation of microsatellite datasets for a thorough and reproducible comparative analysis
of microsatellite content among genomic sequences. Using sequence data where the
number and types of microsatellites were empirical know I compared the ability of 19
programs to accurately identify and report microsatellites. I then chose the two programs
which, based on the algorithm and its parameters as well as the output informativity,
offered the information most suitable for biological interpretation, while also reflecting as
close as possible the microsatellite content of the test files.
From the analysis of microsatellite search results generated by the various programs
available, it became apparent that the program’s search parameters, which are specified by
the user in order to define the microsatellite characteristics to the program, influence
dramatically the resulting datasets. This is especially true for programs suited to allow
imperfections within tandem repeats, because imperfect repetitions can not be defined
accurately as is the case for perfect ones, and because several different algorithms have
been proposed to address this problem. The detection of approximate microsatellites is,
however, essential for the study of microsatellite evolution and for comparative analyses
based on microsatellites. It is now well accepted that small deviations from perfect tandem
repeat structure are common within microsatellites and larger repeats, and a number of
different algorithms have been developed to confront the challenge of finding and
registering microsatellites with all expectable kinds of imperfection. However, biologists
have still to apply these tools to their full potential. In biological analyses single tandem
repeat hits are consistently interpreted as isolated and independent repeats. This
interpretation also depends on the search strategy used to report the microsatellites in DNA
sequences and, therefore, I was particularly interested in the capacity of repeat finding
programs to report imperfect microsatellites allowing interpretations that are useful in a
biological sense. After analzying a series of tandem repeat finding programs I optimized my
microsatellite searches to yield the best possible datasets for assessing and comparing the
degree of imperfection of microsatellites among different genomes (Chapter III)
During the program comparisons performed in Chapter II, I show that the most critical
search parameter influencing microsatellite search results is the minimum length threshold.
Biologically speaking, there is no consensus with respect to the minimum length, beyond
which a short tandem repeat is expected to become prone to microsatellite-like mutations.
Usually, a single absolute value of ~12 nucleotides is assigned irrespective of motif length..
In other cases thresholds are assigned in terms of number of repeat units (i.e. 3 to 5 repeats
or more), which are better applied individually for each motif. The variation in these
thresholds is considerable and not always justifiable. In addition, any current minimum
length measures are likely naïve because it is clear that different microsatellite motifs
undergo replication slippage at different length thresholds. Therefore, in Chapter III, I apply
two probabilistic models to predict the minimum length at which microsatellites of varying
motif types become overrepresented in different genomes based on the individual
oligonucleotide frequency data of these genomes.
Finally, after a range of optimizations and critical analyses, I performed a preliminary
analysis of microsatellite abundance among 24 high quality complete eukaryotic genomes,
including also 8 prokaryotic and 5 archaeal genomes for contrast. The availability of the
methodologies and the microsatellite datasets generated in this project will allow informed
formulation of questions for more specific genome research, either about microsatellites, or
about other genomic features microsatellites could influence. These datasets are what I
would have needed at the beginning of my PhD to support my experimental design, and are
essential for the adequate data interpretation of microsatellite data in the context of the
major evolutionary units; chromosomes and genomes.
|
2 |
Evolution and applications of pine microsatellitesKarhu, A. (Auli) 27 February 2001 (has links)
Abstract
The evolution of microsatellites was studied within and between the pine species. Sequences showed that microsatellites
do not necessarily mutate in a stepwise fashion and that size homoplasy is common due to flanking sequence and repeat area
changes within and between the species. Thus, some assumptions of statistical methods based on changes in repeat numbers may
not hold.
Sequences from cross-species amplifications revealed evidence of duplications of microsatellite loci in pines. On two
independent occasions, the repeat area of the microsatellite had undergone a rapid expansion during the last 10-25 million of
years.
Microsatellite markers were used together with other molecular markers (allozymes, RFLPs, RAPDs, rDNA RFLPs) and an
adaptive trait (date of bud set) to study patterns of genetic variation in Scots pine (Pinus sylvestris)
in Finland. All molecular markers showed high level of within population variation, while differentiation among populations
was low (FST = 0.02). Of the total variation in bud set, 36.4 % was found among the populations which
experience a steep climatic gradient. Thus, the markers applied were poor predictors of population differentiation of the
quantitative trait studied
The distribution of genetic variation was studied in five natural populations of radiata pine (Pinus
radiata), species which has gone through bottlenecks in the past. Null allele frequencies were estimated and used
in later analyses. Microsatellites showed high level of variability within populations (He =
0.68-0.77). Allele length distributions and average number of alleles per locus showed some traces of bottlenecks. Instead,
comparison of observed genetic diversities and expected diversities suggested post-bottleneck expansion of populations.
Genetic differentiation (FST and RST) among populations was over 10 %,
reflecting situation in the isolated radiata pine populations.
Using microsatellites and a newly developed Bayesian method, individual inbreeding coefficients were estimated in five
populations of radiata pine. Most individuals were outbred while some were selfed. Presumably, in ancestral radiata pine
populations the recessive deleterious alleles have been eliminated after bottlenecks and the mating system has changed as a
consequence.
|
3 |
An Investigation of Links Between Simple Sequences and Meiotic Recombination HotspotsBagshaw, Andrew Tobias Matthew January 2008 (has links)
Previous evidence has shown that the simple sequences microsatellites and poly-purine/poly-pyrimidine tracts (PPTs) could be both a cause, and an effect, of meiotic recombination. The causal link between simple sequences and recombination has not been much explored, however, probably because other evidence has cast doubt on its generality, though this evidence has never been conclusive. Several questions have remained unanswered in the literature, and I have addressed aspects of three of them in my thesis. First, what is the scale and magnitude of the association between simple sequences and recombination? I found that microsatellites and PPTs are strongly associated with meiotic double-strand break (DSB) hotspots in yeast, and that PPTs are generally more common in human recombination hotspots, particularly in close proximity to hotspot central regions, in which recombination events are markedly more frequent. I also showed that these associations can't be explained by coincidental mutual associations between simple sequences, recombination and other factors previously shown to correlate with both. A second question not conclusively answered in the literature is whether simple sequences, or their high levels of polymorphism, are an effect of recombination. I used three methods to address this question. Firstly, I investigated the distributions of two-copy tandem repeats and short PPTs in relation to yeast DSB hotspots in order to look for evidence of an involvement of recombination in simple sequence formation. I found no significant associations. Secondly, I compared the fraction of simple sequences containing polymorphic sites between human recombination hotspots and coldspots. The third method I used was generalized linear model analysis, with which I investigated the correlation between simple sequence variation and recombination rate, and the influence on the correlation of additional factors with potential relevance including GC-content and gene density. Both the direct comparison and correlation methods showed a very weak and inconsistent effect of recombination on simple sequence polymorphism in the human genome.Whether simple sequences are an important cause of recombination events is a third question that has received relatively little previous attention, and I have explored one aspect of it. Simple sequences of the types I studied have previously been shown to form non-B-DNA structures, which can be recombinagenic in model systems. Using a previously described sodium bisulphite modification assay, I tested for the presence of these structures in sequences amplified from the central regions of hotspots and cloned into supercoiled plasmids. I found significantly higher sensitivity to sodium bisulphite in humans in than in chimpanzees in three out of six genomic regions in which there is a hotspot in humans but none in chimpanzees. In the DNA2 hotspot, this correlated with a clear difference in numbers of molecules showing long contiguous strings of converted cytosines, which are present in previously described intramolecular quadruplex and triplex structures. Two out of the five other hotspots tested show evidence for secondary structure comparable to a known intramolecular triplex, though with similar patterns in humans and chimpanzees. In conclusion, my results clearly motivate further investigation of a functional link between simple sequences and meiotic recombination, including the putative role of non-B-DNA structures.
|
4 |
An Investigation of Links Between Simple Sequences and Meiotic Recombination HotspotsBagshaw, Andrew Tobias Matthew January 2008 (has links)
Previous evidence has shown that the simple sequences microsatellites and poly-purine/poly-pyrimidine tracts (PPTs) could be both a cause, and an effect, of meiotic recombination. The causal link between simple sequences and recombination has not been much explored, however, probably because other evidence has cast doubt on its generality, though this evidence has never been conclusive. Several questions have remained unanswered in the literature, and I have addressed aspects of three of them in my thesis. First, what is the scale and magnitude of the association between simple sequences and recombination? I found that microsatellites and PPTs are strongly associated with meiotic double-strand break (DSB) hotspots in yeast, and that PPTs are generally more common in human recombination hotspots, particularly in close proximity to hotspot central regions, in which recombination events are markedly more frequent. I also showed that these associations can't be explained by coincidental mutual associations between simple sequences, recombination and other factors previously shown to correlate with both. A second question not conclusively answered in the literature is whether simple sequences, or their high levels of polymorphism, are an effect of recombination. I used three methods to address this question. Firstly, I investigated the distributions of two-copy tandem repeats and short PPTs in relation to yeast DSB hotspots in order to look for evidence of an involvement of recombination in simple sequence formation. I found no significant associations. Secondly, I compared the fraction of simple sequences containing polymorphic sites between human recombination hotspots and coldspots. The third method I used was generalized linear model analysis, with which I investigated the correlation between simple sequence variation and recombination rate, and the influence on the correlation of additional factors with potential relevance including GC-content and gene density. Both the direct comparison and correlation methods showed a very weak and inconsistent effect of recombination on simple sequence polymorphism in the human genome.Whether simple sequences are an important cause of recombination events is a third question that has received relatively little previous attention, and I have explored one aspect of it. Simple sequences of the types I studied have previously been shown to form non-B-DNA structures, which can be recombinagenic in model systems. Using a previously described sodium bisulphite modification assay, I tested for the presence of these structures in sequences amplified from the central regions of hotspots and cloned into supercoiled plasmids. I found significantly higher sensitivity to sodium bisulphite in humans in than in chimpanzees in three out of six genomic regions in which there is a hotspot in humans but none in chimpanzees. In the DNA2 hotspot, this correlated with a clear difference in numbers of molecules showing long contiguous strings of converted cytosines, which are present in previously described intramolecular quadruplex and triplex structures. Two out of the five other hotspots tested show evidence for secondary structure comparable to a known intramolecular triplex, though with similar patterns in humans and chimpanzees. In conclusion, my results clearly motivate further investigation of a functional link between simple sequences and meiotic recombination, including the putative role of non-B-DNA structures.
|
5 |
Statistical inference in population genetics using microsatellitesCsilléry, Katalin January 2009 (has links)
Statistical inference from molecular population genetic data is currently a very active area of research for two main reasons. First, in the past two decades an enormous amount of molecular genetic data have been produced and the amount of data is expected to grow even more in the future. Second, drawing inferences about complex population genetics problems, for example understanding the demographic and genetic factors that shaped modern populations, poses a serious statistical challenge. Amongst the many different kinds of genetic data that have appeared in the past two decades, the highly polymorphic microsatellites have played an important role. Microsatellites revolutionized the population genetics of natural populations, and were the initial tool for linkage mapping in humans and other model organisms. Despite their important role, and extensive use, the evolutionary dynamics of microsatellites are still not fully understood, and their statistical methods are often underdeveloped and do not adequately model microsatellite evolution. In this thesis, I address some aspects of this problem by assessing the performance of existing statistical tools, and developing some new ones. My work encompasses a range of statistical methods from simple hypothesis testing to more recent, complex computational statistical tools. This thesis consists of four main topics. First, I review the statistical methods that have been developed for microsatellites in population genetics applications. I review the different models of the microsatellite mutation process, and ask which models are the most supported by data, and how models were incorporated into statistical methods. I also present estimates of mutation parameters for several species based on published data. Second, I evaluate the performance of estimators of genetic relatedness using real data from five vertebrate populations. I demonstrate that the overall performance of marker-based pairwise relatedness estimators mainly depends on the population relatedness composition and may only be improved by the marker data quality within the limits of the population relatedness composition. Third, I investigate the different null hypotheses that may be used to test for independence between loci. Using simulations I show that testing for statistical independence (i.e. zero linkage disequilibrium, LD) is difficult to interpret in most cases, and instead a null hypothesis should be tested, which accounts for the “background LD” due to finite population size. I investigate the utility of a novel approximate testing procedure to circumvent this problem, and illustrate its use on a real data set from red deer. Fourth, I explore the utility of Approximate Bayesian Computation, inference based on summary statistics, to estimate demographic parameters from admixed populations. Assuming a simple demographic model, I show that the choice of summary statistics greatly influences the quality of the estimation, and that different parameters are better estimated with different summary statistics. Most importantly, I show how the estimation of most admixture parameters can be considerably improved via the use of linkage disequilibrium statistics from microsatellite data.
|
Page generated in 0.1212 seconds