331 |
Predicting "Essential" Genes in Microbial Genomes: A Machine Learning Approach to Knowledge Discovery in Microbial Genomic DataPalaniappan, Krishnaveni 01 January 2010 (has links)
Essential genes constitute the minimal gene set of an organism that is indispensable for its survival under most favorable conditions. The problem of accurately identifying and predicting genes essential for survival of an organism has both theoretical and practical relevance in genome biology and medicine. From a theoretical perspective it provides insights in the understanding of the minimal requirements for cellular life and plays a key role in the emerging field of synthetic biology; from a practical perspective, it facilitates efficient identification of potential drug targets (e.g., antibiotics) in novel pathogens. However, characterizing essential genes of an organism requires sophisticated experimental studies that are expensive and time consuming. The goal of this research study was to investigate machine learning methods to accurately classify/predict "essential genes" in newly sequenced microbial genomes based solely on their genomic sequence data.
This study formulates the predication of essential genes problem as a binary classification problem and systematically investigates applicability of three different supervised classification methods for this task. In particular, Decision Tree (DT), Support Vector Machine (SVM), and Artificial Neural Network (ANN) based classifier models were constructed and trained on genomic features derived solely from gene sequence data of 14 experimentally validated microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features (including gene and protein sequence features, protein physio-chemical features and protein sub-cellular features) was used as input for the learners to learn the classifier models. The training and test datasets used in this study reflected between-class imbalance (i.e. skewed majority class vs. minority class) that is intrinsic to this data domain and essential genes prediction problem. Two imbalance reduction techniques (homology reduction and random under sampling of 50% of the majority class) were devised without artificially balancing the datasets and compromising classifier generalizability. The classifier models were trained and evaluated using 10-fold stratified cross validation strategy on both the full multi-genome datasets and its class imbalance reduced variants to assess their predictive ability of discriminating essential genes from non-essential genes. In addition, the classifiers were also evaluated using a novel blind testing strategy, called LOGO (Leave-One-Genome-Out) and LOTO (Leave-One-Taxon group-Out) tests on carefully constructed held-out datasets (both genome-wise (LOGO) and taxonomic group-wise (LOTO)) that were not used in training of the classifier models. Prediction performance metrics, accuracy, sensitivity, specificity, precision and area under the Receiver Operating Characteristics (AU-ROC) were assessed for DT, SVM and ANN derived models. Empirical results from 10 X 10-fold stratified cross validation, Leave-One-Genome-Out (LOGO) and Leave-One-Taxon group-Out (LOTO) blind testing experiments indicate SVM and ANN based models perform better than Decision Tree based models. On 10 X 10-fold cross validations, the SVM based models achieved an AU-ROC score of 0.80, while ANN and DT achieved 0.79 and 0.68 respectively. Both LOGO (genome-wise) and LOTO (taxonwise) blind tests revealed the generalization extent of these classifiers across different genomes and taxonomic orders.
This study empirically demonstrated the merits of applying machine learning methods to predict essential genes in microbial genomes by using only gene sequence and features derived from it. It also demonstrated that it is possible to predict essential genes based on features derived from gene sequence without using homology information. LOGO and LOTO Blind test results reveal that the trained classifiers do generalize across genomes and taxonomic boundaries and provide first critical estimate of predictive performance on microbial genomes. Overall, this study provides a systematic assessment of applying DT, ANN and SVM to this prediction problem.
An important potential application of this study will be to apply the resultant predictive model/approach and integrate it as a genome annotation pipeline method for comparative microbial genome and metagenome analysis resources such as the Integrated Microbial Genome Systems (IMG and IMG/M).
|
332 |
Estudo de bioinformática aplicado à análise de expressão gênica utilizando dados oriundos de sequenciamento por tecnologia de "Next-Generation" em animais controle e em modelos de epilepsia do lobo temporal mesial / Bioinformatics study applied to gene expression analysis using data from sequencing by "Next-Generation" technology in control animals and in models of epilepsy of mesial temporal lobeBrumatti Gonçalves, Kátia Cristiane, 1976- 27 August 2018 (has links)
Orientadores: Íscia Teresinha Lopes Cendes, Cristiane de Souza Rocha / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Ciências Médicas / Made available in DSpace on 2018-08-27T01:10:50Z (GMT). No. of bitstreams: 1
BrumattiGoncalves_KatiaCristiane_M.pdf: 3380943 bytes, checksum: e87dbdc3a9db8349ec00b7148c98fd7b (MD5)
Previous issue date: 2015 / Resumo: O campo da bioinformática associada à Next Generation Sequencing (NGS) ainda está em estado imaturo. A técnica de microarray tem sido muito utilizada nas últimas décadas em estudos de níveis de expressão de genes, porém essa técnica possui limitações. Sequenciamento de RNA (RNA-Seq) tem vantagens sobre as abordagens atuais, pois permite que o transcriptoma inteiro seja pesquisado com alto rendimento, fazendo com que RNA-Seq seja útil para estudar transcriptomas complexos, além disso, permite a análise de splicing alternativo. Muitas ferramentas têm sido desenvolvidas para abordar diferentes aspectos da análise de dados em RNA-Seq, e sua análise é um desafio constante. Nesse contexto, o objetivo deste estudo foi utilizar métodos de bioinformática para a análise de expressão gênica utilizando dados de RNA-Seq. Para isso, foram utilizados dados brutos obtidos em dois experimentos diferentes: a) utilizando animais normais, na qual a análise comparativa foi realizada da região do hipocampo (CA1, CA2 e CA3) e giro denteado, e b) utilizando animais tratados com pilocarpina e animais controle. Na análise dos dois experimentos, foram encontrados 3 genes (Nnat, Sv2b e Neurod6) em comum que tem diferença na expressão, ambos genes tem envolvimento no sistema nervoso central. Na análise de splicing alternativo, a ferramenta MISO (Mixture of Isoforms) comparado ao pipeline utilizado em Cuffdiff, gerou resultados melhores e mais detalhados, já que a ferramenta também realiza a quantificação dos transcritos, e com seus resultados foram descobertos 6 transcritos (Arpp21, Gria1, Gria2, Nrxn1, Dclk1 e Rtn1) em comum nas regiões do hipocampo, que tem alta expressão em giro denteado. Atualmente, existem diversos softwares em ascensão para análise diferencial, porém, o pipeline utilizado neste trabalho é ainda uma das principais ferramentas para análise de RNA-Seq, por usar algoritmos confiáveis e permitir flexibilização das análises quando necessário. Este estudo apresentou uma proposta de pipeline para a análise de expressão diferencial e identificação de splicing alternativo, para dados obtidos através de tecnologia de sequenciamento RNA-Seq. Foram identificados 5760 transcritos considerados significativamente expressos, e sugere que 6 transcritos sejam decorrentes de splicing alternativo / Abstract: The field of bioinformatics associated with Next Generation Sequencing (NGS) is still in an immature state. The microarray technique has been widely used in recent decades in studies of gene expression levels, but this technique has limitations. Sequencing RNA (RNA-Seq) has advantages over current approaches because it allows the whole transcriptome is researched with high yield, making RNA-Seq be useful for studying complex transcriptomes, moreover, allows the analysis of alternative splicing. Many tools have been developed to aproach different aspects of data analysis in RNA-Seq, and its analysis is a constant challenge. In this context, the objective of this study was to use bioinformatics methods for gene expression analysis using RNA-Seq data. For this, the raw data obtained in two different experiments were used: a) using normal animalsin which was made a comparative analysis of the hippocampus (CA1, CA2 and CA3) and dentate gyrus, and b) using pilocarpine treated animals and animals control. In the analysis of two experiments, were found three genes (NNAT, Sv2b and Neurod6) in common that there is a difference in the expression, both of genes is involved in the central nervous system. In alternative splicing analysis, MISO (Mixture of Isoforms) tool compared to the pipeline used in Cuffdiff, gave better and more detailed results, as the tool also performs the quantification of transcripts, and their results were found 6 transcripts (Arpp21, Gria1, Gria2, Nrxn1, Dclk1 and Rtn1) in common in the regions of the hippocampus, which has high expression in the dentate gyrus. Currently, there are various software on the rise for differential analysis, however, the pipeline used in this work is still one of the main tools for RNA-Seq analysis, by using reliable algorithms and allow flexibility of analyzes when necessary. This study showed a pipeline proposed for the analysis of differential expression, and alternative splicing of identification data obtained for RNA-Seq sequencing technology. 5760 transcripts considered significantly expressed were identified, and suggests that 6 transcripts are derived from alternative splicing / Mestrado / Fisiopatologia Médica / Mestra em Ciências
|
333 |
INTEGRATIVE OMICS REVEALS INSIGHTS INTO HUMAN LIVER DEVELOPMENT, DISEASE ETIOLOGY, AND PRECISION MEDICINEZhipeng Liu (8126406) 20 December 2019 (has links)
<div><div><div><p>Transcriptomic regulation of human liver is a tightly controlled and highly dynamic process. Genetic and environmental exposures to this process play pivotal roles in the development of multiple liver disorders. Despite accumulating knowledge have gained through large-scale genomics studies in the developed adult livers, the contributing factors to the interindividual variability in the pediatric livers remain largely uninvestigated. In the first two chapters of the present study, we addressed this question through an integrative analysis of both genetic variations and transcriptome-wide RNA expression profiles in a pediatric human liver cohort with different developmental stages ranging from embryonic to adulthood. Our systematic analysis revealed a transcriptome-wide transition from stem-cell-like to liver-specific profiles during the course of human liver development. Moreover, for the first time, we observed different genetic control of hepatic gene expression in different developmental stages. Motivated by the critical roles of genetics variations and development in regulating hepatic gene expression, we constructed robust predictive models to impute the virtual liver gene expression using easily available genotype and demographic information. Our model is promising in improving both PK/PD modeling and disease diagnosis for pediatric patients. In the last two chapters of the study, we analyzed the genomics data in a more liver disease- related context. Specifically, in the third chapter, we identified Macrophage migration inhibitory factor (MIF) and its related pathways as potential targets underlying human liver fibrosis through an integrative omics analysis. In the last chapter, utilizing the largest-to-date publicly available GWAS summary data, we dissected the causal relationships among three important and clinically related metabolic diseases: non-alcoholic fatty liver disease (NAFLD), type 2 diabetes (T2D), and obesity. Our analysis suggested new subtypes and provided insights into the precision treatment or prevention for the three complex diseases. Taken together, through integrative analysis of multiple levels of genomics information, we improved the current understanding of human liver development, the pathogenesis of liver disorders, and provided implications to precision medicine.</p></div></div></div>
|
334 |
Neural networks for imputation of missing genotype data : An alternative to the classical statistical methods in bioinformaticsAndersson, Alfred January 2020 (has links)
In this project, two different machine learning models were tested in an attempt at imputing missing genotype data from patients on two different panels. As the integrity of the patients had to be protected, initial training was done on data simulated from the 1000 Genomes Project. The first model consisted of two convolutional variational autoencoders and the latent representations of the networks were shuffled to force the networks to find the same patterns in the two datasets. This model was unfortunately unsuccessful at imputing the missing data. The second model was based on a UNet structure and was more successful at the task of imputation. This model had one encoder for each dataset, making each encoder specialized at finding patterns in its own data. Further improvements are required in order for the model to be fully capable at imputing the missing data.
|
335 |
A Transcriptomic Exploration of Hawaiian Drosophilid Development and EvolutionChenevert, Madeline M 20 December 2019 (has links)
One in four known species of fruit flies inhabit the Hawaiian Islands. From a small number of colonizing flies, a wide range of species evolved, some of which managed to reverse-colonize other continental environments. In order to explore the developmental pathways, which separate the Hawaiian Drosophila proper and the Scaptomyza group that contains reverse-colonized species, the transcriptomes of two better-known species in each group, Scaptomyza anomala and Drosophila grimshawi, were analyzed to find changes in gene expression between the two groups. This study describes a novel transcriptome for S. anomala studies as well as unusual changes in gene expression in D. grimshawi relative to other species, revealing priorities of both species in early development.
|
336 |
Differentiating Between a Protein and its Decoy Using Nested Graph Models and Weighted Graph Theoretical InvariantsGreen, Hannah E 01 May 2017 (has links)
To determine the function of a protein, we must know its 3-dimensional structure, which can be difficult to ascertain. Currently, predictive models are used to determine the structure of a protein from its sequence, but these models do not always predict the correct structure. To this end we use a nested graph model along with weighted invariants to minimize the errors and improve the accuracy of a predictive model to determine if we have the correct structure for a protein.
|
337 |
Streamlining user processes for a general data repository for life science in accordance with the FAIR principlesAsklöf, Anna January 2021 (has links)
With the increasing amounts of data generated in life science, methods for data storage and sharing are being developed and implemented. Online data repositories are more and more commonly used for data sharing. The national Swedish platform Science of Life Laboratory has decided to use an institutional data repository as a mean to address the increasing amounts of data generated at the platform. In this project, the system used for the institutional repository at SciLifeLab was studied and compared to implementations of the same system at other institutions to create user documentation for the repository. This documentation was created with the FAIR principles as a guidance. Feedback on the guidelines were then sought from users and based on the received feedback, the user documentation was improved. Using a FAIR evaluation tool called FAIR evaluation services, items published on the repository were evaluated. Investigation of these results and their correlation to the items record on the repository were carried out. Out of ten evaluated datasets all except one scored exactly the same on the FAIR evaluation services tests. This could indicate that the test used is not evaluating aspects needed to encounter the differences in these published items. Based on this, conclusions as to in what extent user documentation can increase the FAIRness of data cannot be drawn.
|
338 |
Applications in computational structural biology: the generation of a protein modelling pipeline and the structural analysis of patient-derived mutationsGuzmán-Vega, Francisco J. 04 1900 (has links)
Besides helping us advance the understanding of the physicochemical principles governing the three-dimensional folding of proteins and their mechanisms of action, the ability to build, evaluate, and optimize reliable 3D protein models has provided valuable tools for the development of different applications in the fields of biotechnology, medicine, and synthetic biology. The development of automated algorithms has made many of the current methodologies for protein modelling and visualization available to researchers from all backgrounds, without the need to be familiarized with the inner workings of their statistical and biophysical principles. However, there is still a lack in some areas where the learning curves are too steep for the methods to be widely used by the average non-programmer molecular biologist, or the implementation of the methods lacks key features to improve the interpretability and impact of their results.
Throughout this work, I will focus on two different applications in the field of structural biology where computational methods provide useful tools to aid in synthetic biology or medical research. The first application is the implementation of a pipeline to build models of protein complexes by joining structured domains with disordered linkers, in individual or multiple chains, and with the possibility of building symmetric structures. Its capabilities and performance for the generation of complex constructs are evaluated, and possible areas of improvement described. The second application, but not less important, involves the structural analysis of patient-derived protein mutants using protein modelling techniques and visualization tools, to elucidate the potential molecular basis for the patient’s phenotype. The methodology for these analyses is described, along with the results and observations from 22 such cases in 13 different proteins. Finally, the need for a dedicated pipeline for the structure-based prediction of the effect of different types of mutations on the stability and function of proteins, complementary to available sequence-based approaches, is highlighted.
|
339 |
A suite of computational tools to interrogate sequence data with local haplotype analysis within complex Plasmodium infections and other microbial mixturesHathaway, Nicholas J. 19 March 2018 (has links)
The rapid development of DNA sequencing technologies has opened up new avenues of research, including the investigation of population structure within infectious diseases (both within patient and between populations). In order to take advantage of these advances in technologies and the generation of new types of data, novel bioinformatics tools are needed that won’t succumb to artifacts introduced by the data generation, and thus provide accurate and precise results. To achieve this goal I have create several tools.
First, SeekDeep, a pipeline for analyzing targeted amplicon sequencing datasets from various technologies, is able to achieve 1-base resolution even at low frequencies and read depths allowing for accurate comparison between samples and the detection of important SNPs. Next, PathWeaver, a local haplotype assembler designed for complex infections and highly variable genomic regions with poor reference mapping. PathWeaver is able to create highly accurate haplotypes without generating chimeric assemblies. PathWeaver was used on the key protein in pregnancy-associated malaria Plasmodium falciparum VAR2CSA which revealed population sub-structuring within the key binding domain of the protein observed to be present globally along with confirming copy number variation. Finally, the program Carmen is able to utilize PathWeaver to augment the results from targeted amplicon approaches by reporting where and when local haplotypes have been found previously.
These rigorously tested tools allow the analysis of local haplotype data from various technologies and approaches to provide accurate, precise and easily accessible results.
|
340 |
Predicting safe drug combinations with Graph Neural Networks (GNN)Amanzadi, Amirhossein January 2021 (has links)
Many people - especially during their elderly - consume multiple drugs for the treatment of complex or co-existing diseases. Identifying side effects caused by polypharmacy is crucial for reducing mortality and morbidity of the patients which will lead to improvement in their quality of life. Since there is immense space for possible drug combinations, it is infeasible to examine them entirely in the lab. In silico models can offer a convenient solution, however, due to the lack of a sufficient amount of homogenous data it is difficult to develop both reliable and scalable models in its ability to accurately predict Polypharmacy Side Effect. Recent advancement in the field of representational learning has utilized the power of graph networks to harmonize information from the heterogeneous biological databases and interactomes. This thesis takes advantage of those techniques and incorporates them with the state-of-the-art Graph Neural Network algorithms to implement a Deep learning pipeline capable of predicting the Adverse Drug Reaction of any given paired drug combinations.
|
Page generated in 0.1267 seconds