Return to search

Efficient Haplotype Matching on Biobank-Scale Reference Graphs

The positional Burrows-Wheeler transform (PBWT) is a foundational data structure for representing haplotype matches of biobank scale. Once the PBWT panel of a set of haplotypes are constructed, efficient algorithms are available for “All vs. All” positional substring matching, finding exact matches of substrings in pre-aligned strings, for haplotypes within the panel, and “One vs. All” positional substring match query for an out-of-panel haplotype against all haplotypes in the panel. While the original PBWT was designed from linear reference genomes, GBWT was proposed to extend PBWT to genome graphs that allow large insertions and deletions. However, there are no GBWT algorithms for haplotype matching. In this work, we develop the efficient algorithms for “All vs. All” and “One vs. All” haplotype set-maximal and long matching algorithms for GBWT. For a GBWT containing a panel of paths P, we show algorithms similar to the matching algorithms of PBWT. Our algorithms achieves theoretically optimal time complexity to output all “All vs. All” matches in time linear to the size of the input panel (O(∑|Pi| + |out put|)), and quasilinear time to the length of the query path for “One vs. All” path match queries (O(|Q| log σ + |out put| log σ ), where σ is the maximum out- degree in the GBWT and out put is the set of discovered path matches). Under the constant σ assumption made by gPBWT and GBWT, these algorithms are in fact linear. Our algorithms open the possibilities for applications of efficient positional substring matching in pangenome references such as identical-by-descent (IBD) segment identification and genotype imputation.

Identiferoai:union.ndltd.org:ucf.edu/oai:stars.library.ucf.edu:honorstheses-2608
Date01 January 2023
CreatorsVillalobos, Seba
PublisherSTARS
Source SetsUniversity of Central Florida
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceHonors Undergraduate Theses

Page generated in 0.0024 seconds