Global ETD Search

Return to search

Computational modeling for identification of low-frequency single nucleotide variants

Indiana University-Purdue University Indianapolis (IUPUI) / Reliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.

Low-frequency variants

Machine-learning

Next generation sequencing

SNVs

Somatic mutations

Statistical modeling

Cancer -- Genetic aspects

Genetics -- Statistics

Genomics

Machine learning -- Mathematical models

Mathematical optimization

Biopsy

Population genetics

Identifer	oai:union.ndltd.org:IUPUI/oai:scholarworks.iupui.edu:1805/8891
Date	16 November 2015
Creators	Hao, Yangyang
Contributors	Liu, Yunlong, Edenberg, Howard J., Li, Lang, Nakshatr, Harikrishna
Source Sets	Indiana University-Purdue University Indianapolis
Language	en_US
Detected Language	English

Page generated in 0.002 seconds

Computational modeling for identification of low-frequency single nucleotide variants

Description

Links & Downloads

Tags

Additional Fields