Phylogenetic analysis can produce easily tens of thousands of equally plausible evolutionary trees. Consensus trees and topological distance matrices are often used to summarize the evolutionary relationships among the trees of interest. However,
current approaches are not designed to analyze very large tree collections. In this dissertation, we present two fast algorithms— HashCS and HashRF —for analyzing large
collections of evolutionary trees based on a novel hash table data structure, which provides a convenient and fast approach to store and access the bipartition information collected from the tree collections.
Our HashCS algorithm is a fast ( ) technique for constructing consensus trees, where is the number of taxa and is the number of trees. By reprocessing the bipartition
information in our hash table, HashCS constructs strict and majority consensus trees. In addition to a consensus algorithm, we design a fast topological distance algorithm called HashRF to compute the × Robinson-Foulds distance matrix, which
requires ( ^ 2) running time. A RF distance matrix provides plenty of data-mining opportunities to help researchers understand the evolutionary relationships contained
in their collection of trees. We also introduce a series of extensions based on HashRF to provide researchers with more convenient set of tools for analyzing their trees. We
provide extensive experimentation regarding the practical performance of our hash-based algorithms across a diverse collection of biological and artificial trees. Our
results show that both algorithms easily outperform existing consensus and RF matrix implementations. For example, on our biological trees, HashCS and HashRF are 1.8 and 100 times faster than PAUP*, respectively.
We show two real-world applications of our fast hashing algorithms: (i) comparing
phylogenetic heuristic implementations, and (ii) clustering and visualizing
trees. In our first application, we design novel methods to compare the PaupRat
and Rec-I-DCM3, two popular phylogenetic heuristics that use the Maximum Parsimony
criterion, and show that RF distances are more effective than parsimony scores
at identifying heterogeneity within a collection of trees. In our second application,
we empirically show how to determine the distinct clusters of trees within large tree
collections. We use two different techniques to identify distinct tree groups. Both
techniques show that partitioning the trees into distinct groups and summarizing
each group separately is a better representation of the data. Additional benefits of
our approach are better consensus trees as well as insightful information regarding
the convergence behavior of phylogenetic heuristics.
Our fast hash-based algorithms provide scientists with a very powerful tools for
analyzing the relationships within their large phylogenetic tree collections in new and
exciting ways. Our work has many opportunities for future work including detecting
convergence and designing better heuristics. Furthermore, our hash tables have lots of
potential future extensions. For example, we can also use our novel hashing structure
to design algorithms for computing other distance metrics such as Nearest Neighbor
Interchange (NNI), Subtree Pruning and Regrafting (SPR), and Tree Bisection and
Reconnection (TBR) distances.
Identifer | oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/ETD-TAMU-2009-12-7591 |
Date | 2009 December 1900 |
Creators | Sul, Seung Jin |
Contributors | Williams, Tiffani L. |
Source Sets | Texas A and M University |
Language | en_US |
Detected Language | English |
Type | Book, Thesis, Electronic Dissertation, text |
Format | application/pdf |
Page generated in 0.0018 seconds