• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Efficient Haplotype Matching on Biobank-Scale Reference Graphs

Villalobos, Seba 01 January 2023 (has links) (PDF)
The positional Burrows-Wheeler transform (PBWT) is a foundational data structure for representing haplotype matches of biobank scale. Once the PBWT panel of a set of haplotypes are constructed, efficient algorithms are available for “All vs. All” positional substring matching, finding exact matches of substrings in pre-aligned strings, for haplotypes within the panel, and “One vs. All” positional substring match query for an out-of-panel haplotype against all haplotypes in the panel. While the original PBWT was designed from linear reference genomes, GBWT was proposed to extend PBWT to genome graphs that allow large insertions and deletions. However, there are no GBWT algorithms for haplotype matching. In this work, we develop the efficient algorithms for “All vs. All” and “One vs. All” haplotype set-maximal and long matching algorithms for GBWT. For a GBWT containing a panel of paths P, we show algorithms similar to the matching algorithms of PBWT. Our algorithms achieves theoretically optimal time complexity to output all “All vs. All” matches in time linear to the size of the input panel (O(∑|Pi| + |out put|)), and quasilinear time to the length of the query path for “One vs. All” path match queries (O(|Q| log σ + |out put| log σ ), where σ is the maximum out- degree in the GBWT and out put is the set of discovered path matches). Under the constant σ assumption made by gPBWT and GBWT, these algorithms are in fact linear. Our algorithms open the possibilities for applications of efficient positional substring matching in pangenome references such as identical-by-descent (IBD) segment identification and genotype imputation.

Page generated in 0.0737 seconds