Return to search

On the Neutralome of Great Apes and Nearest Neighbor Search in Metric Spaces

Problems of population genetics are magnified by problems of big data. My dissertation spans the disciplines of computer science and population genetics, leveraging computational approaches to biological problems to address issues in genomics research. In this dissertation I develop more efficient metric search algorithms. I also show that vast majority of the genomes of great apes are impacted by the forces of natural selection. Finally, I introduce a heuristic to identify neutralomes—regions that are evolving with minimal selective pressures—and use these neutralomes for inferences on effective population size in great apes. We begin with a formal and far-reaching problem that impacts a broad array of disciplines including biology and computer science; the 𝑘-nearest neighbors problem in generalized metric spaces. The 𝑘-nearest neighbors (𝑘-NN) problem is deceptively simple. The problem is as follows: given a query q and dataset D of size 𝑛, find the 𝑘-closest points to q. This problem can be easily solved by algorithms that compute 𝑘th order statistics in O(𝑛) time and space. It follows that if D can be ordered, then it is perhaps possible to solve 𝑘-NN queries in sublinear time. While this is not possible for an arbitrary distance function on the points in D, I show that if the points are constrained by the triangle inequality (such as with metric spaces), then the dataset can be properly organized into a dispersion tree (Appendix A). Dispersion trees are a hierarchical data structure that is built around a large dispersed set of points. Dispersion trees have construction times that are sub-quadratic (O(𝑛¹·⁵ log⁡ 𝑛)) and use O(𝑛) space, and they use a provably optimal search strategy that minimizes the number of times the distance function is invoked. While all metric data structures have worst-case O(𝑛) search times, dispersion trees have average-case search times that are substantially faster than a large sampling of comparable data structures in the vast majority of spaces sampled. Exceptions to this include extremely high dimensional space (d>20) which devolve into near-linear scans of the dataset, and unstructured low-dimensional (d<6) Euclidean spaces. Dispersion trees have empirical search times that appear to scale as O(𝑛ᶜ) for 0<c<1. As solutions to the 𝑘-NN problem are in general too slow to be used effectively in the arena of big data in genomics, it is my hope that dispersion trees may help lift this barrier. With source-code that is freely available for academic use, dispersion trees may be useful for nearest neighbor classification problems in machine learning, fast read-mapping against a reference genome, and as a general computational tool for problems such clustering. Next, I turn to problems in population genomics. Genomic patterns of diversity are a complex function of the interplay between demographics, natural selection and mechanistic forces. A central tenet of population genetics is the neutral theory of molecular evolution which states the vast majority of changes at the molecular level are (relatively) selectively neutral; that is, they do not effect fitness. A corollary of the neutral theory is that the frequency of most alleles in populations are dictated by neutral processes and not selective processes. The forces of natural selection impact not just the site of selection, but linked neutral sites as well. I proposed an empirical assessment of the extents of linked selection in the human genome (Appendix B). Recombination decouples sites of selection from the genomic background, thus it serves to mitigate the effects of linked selection. I use two metrics on recombination, both the minimum genetic distance to genes and local rates of recombination, to parse the effects of linked selection into selection from genic and nongenic sources in the human genome. My empirical assessment shows profound linked selective effects from nongenic sources, with these effects being greater than that of genic sources on the autosomes, as well as generally greater effects on the X chromosome than on the autosomes. I quantify these trends using multiple linear regression, and then I model the effects of linked selection to conserved elements across the whole of the genome. Places predicted to be neutral by my model do not, unlike the vast majority of the genome, show these linked selective effects. This demonstrates that linkage to these regulatory elements, and not some other mechanistic force, accounts for our findings. Further, neutrally evolving regions are extremely rare (~1%) in the genome, and despite generally larger linked selective effects on the X chromosome, the size of this “neutralome” is proportionally larger on the X chromosome than on the autosomes. To account for this and to extend my findings to other great apes I improve on my procedure to find neutralomes, and apply this procedure to the genome of humans, Nigerian chimpanzees, bonobos, and western lowland gorillas (Appendix C). In doing so I show that like humans, these other apes are also enormously impacted by linked selection, with their neutralomes being substantially smaller than the neutralomes of humans. I then use my genomic predictions on neutrality to see how the landscape of linked selection changes across the X chromosome and the autosomes in regions close to, and far from, genes. While I had previously demonstrated the linked selective forces near genes are stronger on the X chromosome than on the autosomes in these taxa, I show that regions far from genes show the opposite; regions far from genes show more selection from noncoding targets on the autosomes than on the X chromosome. This finding is replicated across our great ape samples. Further, inferences on the relative effective population size of the X chromosome and the autosomes both near and far from genes can be biased as a result.

Identiferoai:union.ndltd.org:arizona.edu/oai:arizona.openrepository.com:10150/621578
Date January 2016
CreatorsWoerner, August Eric, Woerner, August Eric
ContributorsHammer, Michael, Kececioglu, John, Kececioglu, John, Hammer, Michael, Watkins, Joseph, Gutenkunst, Ryan
PublisherThe University of Arizona.
Source SetsUniversity of Arizona
Languageen_US
Detected LanguageEnglish
Typetext, Electronic Dissertation
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Page generated in 0.0029 seconds