Connecting bioinformatics analysis to scientific practice : an integrated information behaviour and task analysis approach /Bartlett, Joan Catherine. January 2004 (has links)
Thesis (Ph. D.)--University of Toronto, 2004. / Adviser: Elaine Toms. Completed at the Faculty of Information Studies, University of Toronto. Includes bibliographical references (leaves 179-187).
Ballinger, Tracy J.
29 October 2015
<p> In the last century cancer has become increasingly prevalent and is the second largest killer in the United States, estimated to afflict 1 in 4 people during their life. Despite our long history with cancer and our herculean efforts to thwart the disease, in many cases we still do not understand the underlying causes or have successful treatments. In my graduate work, I’ve developed two approaches to the study of cancer genomics and applied them to the whole genome sequencing data of cancer patients from The Cancer Genome Atlas (TCGA). In collaboration with Dr. Ewing, I built a pipeline to detect retrotransposon insertions from paired-end high-throughput sequencing data and found somatic retrotransposon insertions in a fifth of cancer patients. </p><p> My second novel contribution to the study of cancer genomics is the development of the CN-AVG pipeline, a method for reconstructing the evolutionary history of a single tumor by predicting the order of structural mutations such as deletions, duplications, and inversions. The CN-AVG theory was developed by Drs. Haussler, Zerbino, and Paten and samples potential evolutionary histories for a tumor using Markov Chain Monte Carlo sampling. I contributed to the development of this method by testing its accuracy and limitations on simulated evolutionary histories. I found that the ability to reconstruct a history decays exponentially with increased breakpoint reuse, but that we can estimate how accurately we reconstruct a mutation event using the likelihood scores of the events. I further designed novel techniques for the application of CN-AVG to whole genome sequencing data from actual patients and applied these techniques to search for evolutionary patterns in glioblastoma multiforme using sequencing data from TCGA. My results show patterns of two-hit deletions, as we would expect, and amplifications occurring over several mutational events. I also find that the CN-AVG method frequently makes use of whole chromosome copy number changes following by localized deletions, a bias that could be mitigated through modifying the cost function for an evolutionary history. </p>
Roberts, Rick Lee
20 October 2015
<p> Many enzymes of the major metabolic pathways are categorized into superfamilies which share common folds. Current models postulate these superfamilies are the result of gene duplications coupled with mutations that result in the acquisition of new functions. Some of these new functions are considered advantageous and selected for, while others may simply be tolerated. The latter can result in metabolites being produced at low rates that are of no known use by the cell, and can become toxic when accumulated. Concurrent with the evolution of this tolerable or potentially detrimental metabolism, organisms are selected to evolve a means of correcting or “proofreading” these non-canonical metabolites to counterbalance their detrimental effects. Metabolite proofreading is a process of intermediary metabolism analogous to DNA proof reading that acts on these abnormal metabolites to prevent their accumulation and toxic effects. </p><p> Here we structurally characterize ethylmalonyl-CoA decarboxylase (EMCD), a member of the family of enoyl-CoA hydratases within the crotonase superfamily of proteins, which is coded by the ECHDC1 (enoyl-CoA hydratase domain containing 1) gene. EMCD has been shown to have a metabolic proofreading property, acting on the metabolic byproduct ethylmalonyl-CoA to prevent its accumulation which could result in oxidative damage. We use the complimentary methods of in situ crystallography, small angle X-ray scattering, and single crystal X-ray crystallography to structurally characterize EMCD, followed by homology analysis in order to propose a mechanism of action. This represents the first structure of a crotonase superfamily member thought to function as a metabolite proof reading enzyme.</p>
08 September 2017
<p> Humoral immunity is driven by the expansion, somatic hypermutation, and selection of B cell clones. Each clone is the progeny of a single B cell responding to antigen. with diversified Ig receptors. The advent of next-generation sequencing technologies enables deep profiling of the Ig repertoire. This large-scale characterization provides a window into the micro-evolutionary dynamics of the adaptive immune response and has a variety of applications in basic science and clinical studies. Clonal relationships are not directly measured, but must be computationally inferred from these sequencing data. In this dissertation, we use a combination of human experimental and simulated data to characterize the performance of hierarchical clustering-based methods for partitioning sequences into clones. Our results suggest that hierarchical clustering using single linkage with nucleotide Hamming distance identifies clones with high confidence and provides a fully automated method for clonal grouping. The performance estimates we develop provide important context to interpret clonal analysis of repertoire sequencing data and allow for rigorous testing of other clonal grouping algorithms. We present the clonal grouping tool as well as other tools for advanced analyses of large-scale Ig repertoire sequencing data through a suite of utilities, Change-O. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow. We then apply the Change-O suite in concert with the nucleotide coding se- quences for WNV-specific antibodies derived from single cells to identify expanded WNV-specific clones in the repertoires of recently infected subjects through quantitative Ig repertoire sequencing analysis. The method proposed in this dissertation to computationally identify B cell clones in Ig repertoire sequencing data with high confidence is made available through the Change-O suite and can be applied to provide insight into the dynamics of the adaptive immune response.</p><p>
Combining Protein Interactions and Functionality Classification in NS3 to Determine Specific Antiviral Targets in DengueAlomair, Lamya 15 September 2017 (has links)
<p> Dengue virus (DENV) is a serious worldwide health concern putting about 2.5 billion people in more than 100 countries at-risk Dengue is a member of the flaviviridae family, is transmitted to human via mosquitos. Dengue is a deadly viral disease. Unfortunately, there are no vaccines or antiviral that can prevent this infection and that is why researchers are diligently working to find cures. The DENV genome codes for multiple nonstructural proteins one of which is the NS3 enzyme that participates in different steps of the viral life cycle including viral replication, viral RNA genome synthesis and host immune mechanism. Recent studies suggest the role of fatty acid biogenesis during DENV infection, including posttranslational protein modification. Phosphorylation is among the protein post translational modifications and plays essential roles in protein folding, interactions, signal transduction, survival and apoptosis. </p><p> In silico study provides a powerful approach to gain a better understanding of the biological systems at the gene level. NS3 has the potential to be phosphorylated by any of the ∼500 human kinases. We predicted potential kinases that might phosphorylate NS3 and calculated Dena ranking score using neural network and other machine learning based webserver programs. These scores enabled us to investigate and identify the top kinases that phosphorylate DENV NS3. We hypothesize that preventing the phosphorylation of NS3 may interrupt the viral replication and participate in antiviral evasion. Using multiple sequence alignment bioinformatics tools we verified the results of the highly conserved residues and the residues around active sites whose phosphorylation may have a potential effect on viral replication. We further verified the results with multiple bioinformatics tools. Moreover, we included the Zika virus in our research and analysis taking into consideration the facts that Zika is related to the dengue virus because it belongs to the same Flavivirus genus affecting humans which might lead to a lot of similarities between Zika and Dengue, and that Zika is available for <i>in vitro</i> testing. </p><p> Our studies propose that the Host-Mediated Phosphorylation of NS3 would affect its capability to interact with NS5 and knocking out one of the interacting proteins may inhibit viral replication. These results will open new doors for further investigation and future work is expected to help identify the key inhibition mechanisms.</p><p>
Webber, James Trubek
16 November 2017
<p> Cancer is a complex and multifaceted disease, and a vast amount of time and effort has been spent on characterizing its behaviors, identifying its weaknesses, and discovering effective treatments. Two major obstacles stand in the way of progress toward effective precision treatment for the majority of patients.</p><p> First, cancer's extraordinary heterogeneity—both between and even within patients—means that most patients present with a disease slightly different from every previously recorded case. New methods are necessary to analyze the growing body of patient data so that we can classify each new patient with as much accuracy and precision as possible. In chapter 2 I present a method that integrates data from multiple genomics platforms to identify axes of variation across breast cancer patients, and to connect these gene modules to potential therapeutic options. In this work we find modules describing variation in the tumor microenvironment and activation of different cellular processes. We also illustrate the challenges and pitfalls of translating between model systems and patients, as many gene modules are poorly conserved when moving between datasets.</p><p> A second problem is that cancer cells are constantly evolving, and many treatments inevitably lead to resistance as new mutations arise or compensatory systems are activated. To overcome this we must find rational combinations that will prevent resistant adaptation before it can start. Starting in chapter 3 I present a series of projects in which we used a high-throughput proteomics approach to characterize the activity of a large proportion of protein kinases, ending with the discovery of a promising drug combination for the treatment of breast cancer in chapter 8.</p><p>
abstract: Study of canine cancer’s molecular underpinnings holds great potential for informing veterinary and human oncology. Sporadic canine cancers are highly abundant (~4 million diagnoses/year in the United States) and the dog’s unique genomic architecture due to selective inbreeding, alongside the high similarity between dog and human genomes both confer power for improving understanding of cancer genes. However, characterization of canine cancer genome landscapes has been limited. It is hindered by lack of canine-specific tools and resources. To enable robust and reproducible comparative genomic analysis of canine cancers, I have developed a workflow for somatic and germline variant calling in canine cancer genomic data. I have first adapted a human cancer genomics pipeline to create a semi-automated canine pipeline used to map genomic landscapes of canine melanoma, lung adenocarcinoma, osteosarcoma and lymphoma. This pipeline also forms the backbone of my novel comparative genomics workflow. Practical impediments to comparative genomic analysis of dog and human include challenges identifying similarities in mutation type and function across species. For example, canine genes could have evolved different functions and their human orthologs may perform different functions. Hence, I undertook a systematic statistical evaluation of dog and human cancer genes and assessed functional similarities and differences between orthologs to improve understanding of the roles of these genes in cancer across species. I tested this pipeline canine and human Diffuse Large B-Cell Lymphoma (DLBCL), given that canine DLBCL is the most comprehensively genomically characterized canine cancer. Logistic regression with genes bearing somatic coding mutations in each cancer was used to determine if conservation metrics (sequence identity, network placement, etc.) could explain co-mutation of genes in both species. Using this model, I identified 25 co-mutated and evolutionarily similar genes that may be compelling cross-species cancer genes. For example, PCLO was identified as a co-mutated conserved gene with PCLO having been previously identified as recurrently mutated in human DLBCL, but with an unclear role in oncogenesis. Further investigation of these genes might shed new light on the biology of lymphoma in dogs and human and this approach may more broadly serve to prioritize new genes for comparative cancer biology studies. / Dissertation/Thesis / Doctoral Dissertation Biomedical Informatics 2018
Studying Low Complexity Structures in Bioinformatics Data Analysis of Biological and Biomedical DataCausey, Jason L. 02 June 2018 (has links)
<p> Biological, biomedical, and radiological data tend to be large, complex, and noisy. Gene expression studies contain expression levels for thousands of genes and hundreds or thousands of patients. Chest Computed Tomography images used for diagnosing lung cancer consist of hundreds of 2-D image ”slices”, each containing hundreds of thousands of pixels. Beneath the size and apparent complexity of many of these data are simple and sparse structures. These low complexity structures can be leveraged into new approaches to biological, biomedical, and radiological data analyses. Two examples are presented here. First, a new framework SparRec (Sparse Recovery) for imputation of GWAS data, based on a matrix completion (MC) model taking advantage of the low-rank and low number of co-clusters of GWAS matrices. SparRec is flexible enough to impute meta-analyses with multiple cohorts genotyped on different sets of SNPs, even without a reference panel. Compared with Mendel-Impute, another MC method, our low-rank based method achieves similar accuracy and efficiency even with up to 90% missing data; our co-clustering based method has advantages in running time. MC methods are shown to have advantages over statistics-based methods, including Beagle and fastPhase. Second, we demonstrate NoduleX, a method for predicting lung nodule malignancy from chest Computed Tomography (CT) data, based on deep convolutional neural networks. For training and validation, we analyze >1000 lung nodules in images from the LIDC/IDRI cohort and compare our results with classifications provided by four experienced thoracic radiologists who participated in the LIDC project. NoduleX achieves high accuracy for nodule malignancy classification, with an AUC of up to 0.99, commensurate with the radiologists’ analysis. Whether they are leveraged directly or extracted using mathematical optimization and machine learning techniques, low complexity structures provide researchers with powerful tools for taming complex data. </p><p>
Vohr, Samuel H.
14 January 2017
<p> Forensic scientists routinely use DNA for identification and to match samples with individuals. Although standard approaches are effective on a wide variety of samples in various conditions, issues such as low-template DNA samples and mixtures of DNA from multiple individuals pose significant challenges. Extreme examples of these challenges can be found in the field of ancient DNA, where DNA recovered from ancient remains is highly fragmented and marked by patterns of DNA-damage. Additionally, ancient libraries are often characterized by low endogenous DNA content and contaminating DNA from outside sources. As a result, standard forensics approaches, such as amplification of short-tandem repeats, are not effective on ancient samples. Alternatively, ancient DNA is routinely directly sequenced using high-throughput sequencing to survey the molecules that are present within a library. However, the resulting sequences are not easily compared for the purposes of identification, as each data set represents a random and, in some cases, non-overlapping, sample of the genome.</p><p> In this dissertation, I present two approaches for interpreting shotgun sequences that address two common issues in forensic and ancient DNA: extremely low nuclear genome coverage and mixtures of sequences from multiple individuals. First, I present an approach to test for a common source individual between extremely low-coverage sequence data sets that makes use of the vast number of single-nucleotide polymorphisms (SNPs) discovered by surveys of human genetic diversity. As almost no observed SNP positions will be common to both samples, our method uses patterns of linkage disequilibrium as modeled by a panel of haplotypes to determine whether observations made across samples are consistent with originating from a single individual. I demonstrate the power of this approach using coalescent simulations, downsampled high-throughput sequencing data and published ancient DNA data. Second, I present an approach for interpreting mixtures of mitochondrial DNA sequences from multiple individuals. Mixed DNA samples are common in forensics investigations, either from the direct nature of a case (e.g., a sample containing DNA from both a victim and a perpetrator) or from outside contamination. I describe an expectation maximization approach for detecting the mitochondrial haplogroups contributing to a mixture and partitioning fragments by haplogroup to reconstruct the underlying haplotypes. I demonstrate the approach’s feasibility, accuracy, and sensitivity on both <i>in silico</i> and <i>in vitro</i> sequence mixtures. Finally, I present the results of applying our mixture interpretation approach on ancient contact DNA recovered from ∼ 700 year old moccasin and cordage samples.</p>
28 November 2018
<p> With advances in genome sequencing technology, datasets with large sample sizes can be generated relatively quickly and cheaply, especially compared to the past decade or so. We can utilize this data to analyze the associations between genetic variants and gene expression, and how that in turn relates to specific phenotypes. We will explore the impact of structural variants (SVs) on gene expression and microRNA expression in healthy individuals. This dissertation is an application of expression quantitative trait loci (eQTL) analysis techniques on several of these datasets, as well as a description of an eQTL analysis pipeline software package.</p><p>
Page generated in 0.1515 seconds