Paralog reduction, the loss of duplicate genes after whole genome duplication (WGD)
is a pervasive process. Whether this loss proceeds gene by gene or through deletion
of multi-gene DNA segments is controversial, as is the question of fractionation bias,
namely whether one homeologous chromosome is more vulnerable to gene deletion
than the other. As a null hypothesis, we first assume deletion events, on one homeolog
only, excise a geometrically distributed number of genes with unknown mean mu, and
these events combine to produce deleted runs of length l, distributed approximately
as a negative binomial with unknown parameter r; itself a random variable with
distribution pi(.). A biologically more realistic model requires deletion events on both
homeologs distributed as a truncated geometric. We simulate the distribution of run
lengths l in both models, as well as the underlying pi(r), as a function of mu, and
show how sampling l allows us to estimate mu. We apply this to data on a total of 15
genomes descended from 6 distinct WGD events and show how to correct the bias
towards shorter runs caused by genome rearrangements. Because of the difficulty in
deriving pi(.) analytically, we develop a deterministic recurrence to calculate each pi(r)
as a function of mu and the proportion of unreduced paralog pairs. This is based on a
computing formula containing nested sums. The parameter mu can be estimated based
on run lengths of single-copy regions. We then reduce the computing formulae, at least
in the one-sided case, to closed form. This virtually eliminates computing time due
to highly nested summations. We formulate a continuous version of the fractionation
process, deleting line segments of exponentially distributed lengths in analogy to
geometric distributed numbers of genes. We derive nested integrals and discover that
the number of previously deleted regions to be skipped by a new deletion event is
exactly geometrically distributed. We undertook a large simulation experiment to
show how to discriminate between the gene-by-gene duplicate deletion model and the
deletion of a geometrically distributed number of genes. This revealed the importance
of the effects of genome size N, the mean of the geometric distribution, the progress
towards completion of the fractionation process, and whether the data are based on
runs of deleted genes or undeleted genes.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OOU.#10393/31001 |
Date | 01 May 2014 |
Creators | Wang, Baoyong |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Thèse / Thesis |
Page generated in 0.0013 seconds