Global ETD Search

1	Evaluating the Application of Allele Frequency in the Saudi Population Variant Detection Alsaedi, Sakhaa 26 April 2020 (has links) Human Mendelian disease in Saudi Arabia is both significant and challenging. Next-generation sequencing (NGS) has resulted in important discoveries of the genetic variants responsible for inherited disease. However, the success of clinical genomics using NGS requires accurate and consistent identification of rare genome variants. Rarity is one very important criterion for pathogenicity. Here we describe a model to detect variants by analyzing allele frequencies of a Saudi population. This work will enhance the opportunity to improve variant calling workflow to gain robust frequency estimates in order to better detect rare and unusual variants which are frequently associated with inherited disease. variants variant calling saudi genome population mendelians diseases allele frequency
2	Analysis of RNA and DNA sequencing data : Improved bioinformatics applications Sigurgeirsson, Benjamín January 2016 (has links) Massively parallel sequencing has rapidly revolutionized DNA and RNA research. Sample preparations are steadfastly advancing, sequencing costs have plummeted and throughput is ever growing. This progress has resulted in exponential growth in data generation with a corresponding demand for bioinformatic solutions. This thesis addresses methodological aspects of this sequencing revolution and applies it to selected biological topics. Papers I and II are technical in nature and concern sample preparation and data anal- ysis of RNA sequencing data. Paper I is focused on RNA degradation and paper II on generating strand specific RNA-seq libraries. Paper III and IV deal with current biological issues. In paper III, whole exomes of cancer patients undergoing chemotherapy are sequenced and their genetic variants associ- ated to their toxicity induced adverse drug reactions. In paper IV a comprehensive view of the gene expression of the endometrium is assessed from two time points of the menstrual cycle. Together these papers show relevant aspects of contemporary sequencing technologies and how it can be applied to diverse biological topics. / <p>QC 20160329</p> RNA sequencing exome sequencing bioinformatics gene expression differential expression variant calling
3	Určování genetických variant z masivně paralelních sekvenačních dat pomocí lokálních reassembly / Variant calling using local reference-helped assemblies Dráb, Martin January 2017 (has links) Despite active development during past years, the task of sequencing a genome still remains a challenge. Our current technologies are not able to read the whole genome in one piece. Instead, we shatter the target genome into a large amounts of small pieces that are then sequenced separately. The process of assembling these small pieces together, in order to obtain sequence of the whole genome, is painful and rsource-consuming. Multiple algorithms to solve the assembly problem were developed. This thesis presents yet another assembly algorithm, based on the usage of de Bruijn graphs, and focusing on sequencing short genome regions. The algorithm is compared to well-known solutions in the field. 1
4	Methods for Detecting Mutations in Non-model Organisms January 2020 (has links) abstract: Next-generation sequencing is a powerful tool for detecting genetic variation. How-ever, it is also error-prone, with error rates that are much larger than mutation rates. This can make mutation detection difficult; and while increasing sequencing depth can often help, sequence-specific errors and other non-random biases cannot be de- tected by increased depth. The problem of accurate genotyping is exacerbated when there is not a reference genome or other auxiliary information available. I explore several methods for sensitively detecting mutations in non-model or- ganisms using an example Eucalyptus melliodora individual. I use the structure of the tree to find bounds on its somatic mutation rate and evaluate several algorithms for variant calling. I find that conventional methods are suitable if the genome of a close relative can be adapted to the study organism. However, with structured data, a likelihood framework that is aware of this structure is more accurate. I use the techniques developed here to evaluate a reference-free variant calling algorithm. I also use this data to evaluate a k-mer based base quality score recalibrator (KBBQ), a tool I developed to recalibrate base quality scores attached to sequencing data. Base quality scores can help detect errors in sequencing reads, but are often inaccurate. The most popular method for correcting this issue requires a known set of variant sites, which is unavailable in most cases. I simulate data and show that errors in this set of variant sites can cause calibration errors. I then show that KBBQ accurately recalibrates base quality scores while requiring no reference or other information and performs as well as other methods. Finally, I use the Eucalyptus data to investigate the impact of quality score calibra- tion on the quality of output variant calls and show that improved base quality score calibration increases the sensitivity and reduces the false positive rate of a variant calling algorithm. / Dissertation/Thesis / Doctoral Dissertation Molecular and Cellular Biology 2020 Bioinformatics Computer science Biology DNA Sequencing Mutation Quality Scores Sequencing Error Variant Calling
5	Les défis du séquençage à haut débit dans l'exploration génétique des cancers du sein et de l'ovaire. / Challenges of Next Generation Sequencing in the exploration of genetic predispositions to breast and/or ovarian cancers Muller, Etienne 12 December 2017 (has links) Les cancers du sein et de l’ovaire apparaissent dans 5 à 10% dans un contexte de prédisposition génétique, dont seule une faible part est expliquée par la présence d’un variant pathogène sur les gènes BRCA1, BRCA2 et PALB2. Le séquençage à haut-débit permet d’explorer cette hérédité manquante, mais représente un nouveau défi à la fois informatique, statistique et biologique. Trois approches utilisant cette nouvelle technologie ont été employées pour rechercher de nouveaux facteurs de prédisposition. En premier lieu, les risques associés à 34 gènes connus ou suspectés d’être impliqués dans les prédispositions ont été estimés à partir de l’analyse de 5 131 cas index et le développement d’une nouvelle approche statistique. Aussi la participation des néo-mutations en mosaïque dans le syndrome a été explorée à partir de 1 750 cas index issus de l’étude précédente, avec un logiciel de détection des variants faiblement représentés développé spécifiquement: outLyzer. Enfin, l’exploration par séquençage de l’hérédité manquante a été étendue à un panel de 201 gènes impliqués dans le cancer, à partir de 118 patientes sélectionnées pour la précocité d’apparition de leur maladie, élément fortement évocateur d’un facteur de prédisposition. Les résultats de ces travaux ont permis de valider la pertinence de l’étude de PALB2, RAD51C et RAD51D pour la prise en charge des patients, et suggèrent aussi une implication sous-estimée des variants en mosaïque. Cependant il reste encore très probablement d’autres facteurs génétiques fortement pénétrants à découvrir mais dont la modulation du risque répond à un modèle oligogénique. / Breast and ovarian cancers appear in 5 to 10% of cases in a context of genetic predisposition, of which only a small proportion is explained by the presence of a pathogenic variant on the BRCA1, BRCA2 and PALB2 genes. High throughput sequencing can explore this missing heredity, but represents a new challenge both in computing, statistics and biology. Three approaches using this new technology have been used to investigate new predisposition factors. First, the risks associated with 34 known or suspected genes involved in predispositions were estimated from the analysis of 5,131 index cases and the development of a new statistical approach. Also, the participation of mosaic neo-mutations in the syndrome was explored from 1,750 index cases from the previous study, with a software developed specifically for detecting poorly represented variants: outLyzer. Finally, the exploration by sequencing of the missing heredity was extended to a panel of 201 genes involved in cancer, from 118 patients selected for the early onset of their disease, a highly suggestive element of a predisposition factor. The results of this work validated the relevance of the PALB2, RAD51C and RAD51D study for patient management, and also suggested an underestimated involvement of mosaic variants. However, there are still very likely other highly penetrating genetic factors to be discovered, but whose risk modulation is based on an oligogenic model. Séquençage de nouvelle génération Séquençage à haut-débit Bio-informatique Mosaïque Variant-calling Next Generation Sequencing High Throughput Sequencing Bioinformatics Mosaic Variant-calling 660.65
6	Filtering of Clinical NGS Data to Improve Low Allele Frequency Variant Calling Cumlin, Tomas January 2022 (has links) Massive parallel sequencing (NGS) is useful in detecting and later classifying somatic driver mutations in cancer tumours. False-positive variants occur in the NGS workflow and they may be mistaken for low frequency somatic cancer mutations in a patient sample. This pushes the need for decreasing the noise rate in the NGS workflow since it may improve the detection of rare allele frequency variants, in particular cancer mutations. In this project, the aim was to reduce the level of false-positive variants in an NGS workflow. The scope was limited to looking at substitution errors and their neighbouring nucleotides. Alongside this, it was also a way to understand how different types of substitution errors are distributed in the data, if their frequencies are affected by neighbouring nucleotides and how data processing may affect these substitution rates. A bioinformatic pipeline was set up where a commercially available genomic DNA sample with known variants was subjected to different trimming and filtering settings. The goal was to reduce the substitution error rate as much as possible, without removing any true variants from the data. The optimised settings were trimming the sequencing reads with 5 bp from the tail and filtering sequencing reads that contained 5 or more substitutions. Three additional samples, whereof two were clinical and the third commercial, were tested with these settings. The results showed that in all samples, C:G>T:A substitutions were of a higher frequency compared to the rest of the substitution types. For all samples, A:T>C:G substitutions, where the neighbouring nucleotide was a C or a G on each side, had a higher frequency compared to A:T>C:G substitutions with other neighbouring nucleotides on both sides. Those substitution types were especially targeted by the trimming. For the two commercial samples, substitutions that resulted in the nucleotide combinations >XAA or >XTT were of a higher frequency compared to the same substitution types that did not result in those nucleotide combinations. Filtering reads with 5 or more substitutions particularly targeted these substitution types. Consequently, filtering had a greater effect on the commercial samples, compared to the clinical samples. Overall, trimming and filtering helped reduce transversions more than the transitions, increasing the transition/transversion ratio after processing the data. The results suggest that trimming and filtering can be a useful method to computationally reduce the transversion errors introduced in an NGS workflow, but transition errors to a lesser extent, in particular A:T>G:C transitions. To confirm these findings, more samples should be tested using this methodology. To better understand the effect of trimming and filtering on variant calling, the scope could in the future be expanded to also look at small insertions and deletions. Next-generation sequencing Variant calling Clinical sequencing Substitution error Cancer sequencing Error rate substitution Somatic mutation Bioinformatics and Systems Biology Bioinformatik och systembiologi
7	Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data Tang, Man 03 January 2020 (has links) The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data, allowing the identification of biomarkers in early disease diagnosis and driving the transformation of most disciplines in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. This dissertation focuses on modeling ``omics'' data in various NGS applications with a primary goal of developing novel statistical methods to identify sequence variants, find transcription factor (TF) binding patterns, and decode the relationship between TF and gene expression levels. Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in NGS applications. Existing methods for calling these variants often make simplified assumption of positional independence and fail to leverage the dependence of genotypes at nearby loci induced by linkage disequilibrium. We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short read data. Simulation experiments show that, under various sequencing depths, vi-HMM outperforms existing methods in terms of sensitivity and F1 score. When applied to the human whole genome sequencing data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. One important NGS application is chromatin immunoprecipitation followed by sequencing (ChIP-seq), which characterizes protein-DNA relations through genome-wide mapping of TF binding sites. Multiple TFs, binding to DNA sequences, often show complex binding patterns, which indicate how TFs with similar functionalities work together to regulate the expression of target genes. To help uncover the transcriptional regulation mechanism, we propose a novel nonparametric Bayesian method to detect the clustering pattern of multiple-TF bindings from ChIP-seq datasets. Simulation study demonstrates that our method performs best with regard to precision, recall, and F1 score, in comparison to traditional methods. We also apply the method on real data and observe several TF clusters that have been recognized previously in mouse embryonic stem cells. Recent advances in ChIP-seq and RNA sequencing (RNA-Seq) technologies provides more reliable and accurate characterization of TF binding sites and gene expression measurements, which serves as a basis to study the regulatory functions of TFs on gene expression. We propose a log Gaussian cox process with wavelet-based functional model to quantify the relationship between TF binding site locations and gene expression levels. Through the simulation study, we demonstrate that our method performs well, especially with large sample size and small variance. It also shows a remarkable ability to distinguish real local feature in the function estimates. / Doctor of Philosophy / The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data and bring out innovations in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. In this dissertation, we mainly focus on three problems closely related to NGS and its applications: (1) how to improve variant calling accuracy, (2) how to model transcription factor (TF) binding patterns, and (3) how to quantify of the contribution of TF binding on gene expression. We develop novel statistical methods to identify sequence variants, find TF binding patterns, and explore the relationship between TF binding and gene expressions. We expect our findings will be helpful in promoting a better understanding of disease causality and facilitating the design of personalized treatments. next-generation sequencing hidden Markov model variant calling transcription factor nonparametric Bayesian log Gaussian Cox process Dirichlet process mixture gene expression wavelet-based functional model

1

Page generated in 0.0882 seconds