Return to search

A fast and accurate model to detect germline SNPs and somatic SNVs with high-throughput sequencing

The rapid development of high-throughput sequencing technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently and accurately call genetic variants in single base level (germline single nucleotide polymorphisms (SNPs) or somatic single nucleotide variants (SNVs)) is the fundamental challenge in sequencing data analysis, because these variants reported to influence transcriptional regulation, alternative splicing, non-coding RNA regulation and protein coding. Many applications have been developed to tackle this challenge. However, the shallow depth and cellular heterogeneity make those tools cannot attain satisfactory accuracy, and the huge volume of sequencing data itself cause this process inefficient.

In this dissertation, firstly the performance of prevalent reads aligners and SNP callers for second-generation sequencing (SGS) is evaluated. And due to the high GC-content, the significantly lower coverage and poorer SNP calling performance in the regulatory regions of human genome by SGS is investigated.

To enhance the capability to call SNPs, especially within the lower-depth regions, a fast and accurate SNP detection (FaSD) program that uses a binomial distribution based algorithm and a mutation probability is proposed. Based on the comparison with popular software and benchmarked by SNP arrays and high-depth sequencing data, it is demonstrated that FaSD has the best SNP calling accuracy in the aspects of genotype concordance rate and AUC. Furthermore, FaSD can finish SNP calling within four hours for 10X human genome SGS data on a standard desktop computer.

Lastly, combined with the joint genotype likelihoods, an updated version of FaSD is proposed to call the cancerous somatic SNVs between paired tumor and normal samples. With extensive assessments on various types of cancer, it is demonstrated that no matter benchmarked by the known somatic SNVs and germline SNPs from database, or somatic SNVs called from higher-depth data, FaSD-somatic has the best overall performance. Inherited and improved from FaSD, FaSD-somatic is also the fastest somatic SNV caller among current programs, and can finish calling somatic mutations within 14 hours for 50X paired tumor and normal samples on normal server. / published_or_final_version / Biochemistry / Doctoral / Doctor of Philosophy

Identiferoai:union.ndltd.org:HKU/oai:hub.hku.hk:10722/197115
Date January 2014
CreatorsWang, Weixin, 王煒欣
ContributorsWang, JJ, Lam, TW
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Source SetsHong Kong University Theses
LanguageEnglish
Detected LanguageEnglish
TypePG_Thesis
RightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works., Creative Commons: Attribution 3.0 Hong Kong License
RelationHKU Theses Online (HKUTO)

Page generated in 0.0026 seconds