Orthology detection is an important task within the context of genome an-
notation, gene nomenclature, and the understanding of gene evolution. With
the rapidly accelerating pace at which new genomes become available, highly
efficient methods are urgently required. As demonstrated in a large body of
literature, reciprocal best match (RBH) methods are reasonably accurate and
scale to large data sets. Nevertheless, they are far from perfect and prone to
both, false positive and false negative, orthology calls.
This work gives a complete characterization of best match as well as reciprocal
best match graphs (BMGs and RBMGs) that arise at the first step of RBH
methods. While BMGs as well as RBMGs with at most three species can be
recognized in polynomial time, RBMGs with more than three species have a
surprisingly complicated structure and it remains an open problem whether
there exist polynomial time algorithms for the recognition of these RBMGs.
In contrast to RBMGs, for which many (often mutually inconsistent) least re-
solved trees may exist, there is a unique least resolved tree for BMGs. This
tree is a homeomorphic image of the true, but typically unknown, gene tree.
Furthermore, in the absence of horizontal gene transfer (HGT), the reciprocal
best match graph contains the orthology relation suggesting that RBMGs can
only contain false positive but no false negative orthology assignments. Simu-
lation scenarios reveal that so-called good quartets, a certain graph pattern on
four vertices in BMGs, can be used to successfully identify almost all false pos-
itive edges in RBMGs. Together with the existence of a unique least resolved
tree, this suggests that BMGs contain a lot of valuable information for orthol-
ogy inference that would be lost by exclusively considering RBMGs. These
insights motivate to include additional BMG and RBMG editing steps in or-
thology detection pipelines based on the presented theoretical insights.
Moreover, a workflow is introduced to infer best matches from sequence data by
retrieving quartet structures from local information instead of reconstructing
the whole gene tree. A crucial prerequisite for this pipeline is the choice of
suitable outgroups.
However, the empirical simulations also reveal that HGT events cause strong
deviations of the orthology relation from the RBMG as well as good quartets
that are no longer associated with false positive orthologs, suggesting the need
for further investigation of the xenology relation.
The directed Fitch’s xenology relation is characterized in terms of forbidden
3-vertex subgraphs and moreover, a polynomial time algorithm for the recog-
nition and the reconstruction of a unique least resolved tree is presented. The
undirected Fitch relation, in contrast, is shown to be a complete multipartite
graph, which does not provide any interesting phylogenetic information.
In summary, the results of this work can be used to develop new methods for
inferring orthology, paralogy, and HGT. They promise major improvements in
the accuracy and the computational performance of RBH-based approaches.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:36032 |
Date | 11 November 2019 |
Creators | Geiß, Manuela |
Contributors | Universität Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/acceptedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0019 seconds