11 |
A VISUALIZATION TOOL FOR CROSS-EXPERIMENT GENE EXPRESSION ANALYSIS OF C. ELEGANSXue, Lin 01 January 2007 (has links)
Forty-six genomic gene expression studies of free living soil nematode C. eleganshave been published. To facilitate exploratory analysis of those studies, we constructed adatabase containing all the published C. elegans expression datasets. A Perl CGIprogram, called Microarray Analysis Display (MAdisplay), allows gene expressionclustergrams of any combination of entered genes and datasets to be viewed(http://elegans.uky.edu/gl/madisplay). Perl programs were used to preprocess the rawdata from different sources into a common format and to transform the data to displaythe expression changes relative to each experiment's controls. Three hundred lists ofgenes from figures and tables were extracted from the publications and made available inthe GeneLists database, which also contains Gene Ontology and KEGG gene lists. Weused these tools to examine in a systematic fashion the mean expression of gene lists inthe set of microarray and SAGE experiments. Seventy-nine percent of publicationderived gene lists show a strong expression change (p-value andlt;0.001) in more than oneexperiment with the median being fourteen out of the 127 experiments that are derivedfrom the forty-six publications. This indicates that groups of genes identified in onepublication typically show an expression effect in many other experiments.
|
12 |
Statistical Models for Next Generation Sequencing DataWang, Yiyi 03 October 2013 (has links)
Three statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene’s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.
|
13 |
Computational analysis of multilevel omics data for the elucidation of molecular mechanisms of cancerFatai, Azeez Ayomide January 2015 (has links)
Philosophiae Doctor - PhD / Cancer is a group of diseases that arises from irreversible genomic and epigenomic alterations that result in unrestrained proliferation of abnormal cells. Detailed understanding of the molecular mechanisms underlying a cancer would aid the identification of most, if not all, genes responsible for its progression and the development of molecularly targeted chemotherapy. The challenge of recurrence after treatment shows that our understanding of cancer mechanisms is still poor. As a contribution to overcoming this challenge, we provide an integrative multi-omic analysis on glioblastoma multiforme (GBM) for which large data sets on di erent classes of genomic and epigenomic alterations have been made available in the Cancer Genome Atlas data portal. The rst part of this study involves protein network analysis for the elucidation of GBM tumourigenic molecular mechanisms, identification of driver genes, prioritization of genes in chromosomal regions with copy number alteration, and co-expression and transcriptional analysis. Functional modules were obtained by edge-betweenness clustering of a protein network constructed from genes with predicted functional impact mutations and differentially expressed genes. Pathway enrichment analysis was performed on each module to identify statistical overrepresentation of signaling pathways. Known and novel candidate cancer driver genes were identi ed in the modules, and functionally relevant genes in chromosomal regions altered by homologous deletion or high-level amplication were prioritized with the protein network. Co-expressed modules enriched in cancer biological processes and transcription factor targets were identified using network genes that demonstrated high expression variance. Our findings show that GBM's molecular mechanisms are much more complex than those reported in previous studies. We next identified differentially expressed miRNAs for which target genes associated with the protein network were also differentially expressed. MiRNAs and target genes were prioritized based on the number of targeted genes and targeting miRNAs, respectively. MiRNAs that correlated with time to progression were selected by an elastic net-penalized Cox regression model for survival analysis. These miRNA were combined into a signature that independently predicted adjuvant therapy-linked progression-free survival in GBM and its subtypes and overall survival in GBM. The results show that miRNAs play significant roles in GBM progression and patients' survival finally, a prognostic mRNA signature that independently predicted progression-free and overall survival was identified. Pathway enrichment analysis was carried on genes with high expression variance across a cohort to identify those in chemoradioresistance associated pathways. A support vector machine-based method was then used to identify a set of genes that discriminated between rapidly- and slowly-progressing GBM patients, with minimal 5 % cross-validation error rate. The prognostic value of the gene set was demonstrated by its ability to predict adjuvant therapy-linked progression-free and overall survival in GBM and its subtypes and was validated in an independent data set. We have identified a set of genes involved in tumourigenic mechanisms that could potentially be exploited as targets in drug development for the treatment of primary and recurrent GBM. Furthermore, given their demonstrated accuracy in this study, the identified miRNA and mRNA signatures have strong potential to be combined and developed into a robust clinical test for predicting prognosis and treatment response.
|
14 |
Automatic Protein Function Annotation Through Text MiningToonsi, Sumyyah 25 August 2019 (has links)
The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.
|
15 |
Development of computational tools and resources for systems biology of bacterial pathogensKumar, Ranjit 06 August 2011 (has links)
Bacterial pathogens are a major cause of diseases in human, agricultural plants and farm animals. Even after decades of research they remain a challenge to health care as they are known to rapidly evolve and develop resistance to the existing drugs. Systems biology is an emerging area of research where all of the components of the system, their interactions, and the dynamics can be studied in a comprehensive, quantitative, and integrative fashion to generate predictive models. When applied to bacterial pathogenesis, systems biology approaches will help identify potential novel molecular targets for drug discovery. A pre-requisite for conducting systems analysis is the identification of the building blocks of the system i.e. individual components of the system (structural annotation), identification of their functions (functional annotation) and identification of the interactions among the individual components (interaction prediction). In the context of bacterial pathogenesis, it is necessary to identify the host-pathogen interactions. This dissertation work describes computational resources that enable comprehensive systems level study of host pathogen system to enhance our understanding of bacterial pathogenesis. It specifically focuses on improving the structural and functional annotation of pathogen genomes as well as identifying host-pathogen interactions at a genome scale. The novel contributions of this dissertation towards systems biology of bacterial pathogens include three computational tools/resources. “TAAPP” (Tiling array analysis pipeline for prokaryotes) is a web based tool for the analysis of whole genome tiling array data for bacterial pathogens. TAAPP helps improve the structural annotation of bacterial genomes. “ISO-IEA” (Inferred from sequence orthology - Inferred from electronic annotation) is a tool that can be used for the functional annotation of any sequenced genome. “HPIDB” (Host pathogen interaction database) is developed with data a mining capability that includes host-pathogen interaction prediction. The new knowledge gained due to the implementation of these tools is the description of the non coding RNA as well as a computationally predicted host-pathogen interaction network for the human respiratory pathogen Streptococcus pneumoniae. In summary, the computation tools and resources developed in this dissertation study will enable building systems biology models of bacterial pathogens.
|
16 |
Genome Snapshot and Molecular Marker Development in <em>Penstemon</em> (Plantaginaceae)Dockter, Rhyan B. 01 July 2011 (has links) (PDF)
Penstemon Mitchell (Plantaginaceae) is one of the largest, most diverse plant genera in North America. Their unique diversity, paired with their drought-tolerance and overall hardiness, give Penstemon a vast amount of potential in the landscaping industry—especially in the more arid western United States where they naturally thrive. In order to develop Penstemon lines for more widespread commercial and private landscaping use, we must improve our understanding of the vast genetic diversity of the genus on a molecular level. In this study we utilize genome reduction and barcoding to optimize 454-pyrosequencing in four target species of Penstemon (P. cyananthus, P. davidsonii, P. dissectus and P. fruticosus). Sequencing and assembly produced contigs representing an average of 0.5% of the Penstemon species. From the sequence, SNP information and microsatellite markers were extracted. One hundred and thirty-three interspecific microsatellite markers were discovered, of which 50 met desired primer parameters, and were of high quality with readable bands on 3% Metaphor gels. Of the microsatellite markers, 82% were polymorphic with an average heterozygosity value of 0.51. An average of one SNP in 2,890 bp per species was found within the individual species assemblies and one SNP in 97 bp were found between any two supposed homologous sequences of the four species. An average of 21.5% of the assembled contigs were associated with putative genes involved in cellular components, biological processes, and molecular functions. On average 19.7% of the assembled contigs were identified as repetitive elements of which LTRs, DNA transposons and other unclassified repeats, were discovered. Our study demonstrates the effectiveness of using the GR-RSC technique to selectively reduce the genome size to putative homologous sequence in different species of Penstemon. It has also enabled us the ability to gain greater insights into microsatellite, SNP, putative gene and repetitive element content in the Penstemon genome which provide essential tools for further genetic work including plant breeding and phylogenetics.
|
17 |
Improving structural and functional annotation of the chicken genomeBuza, Teresia 11 December 2009 (has links)
Chicken is an important non-mammalian vertebrate model organism for biomedical research, especially for vaccine production and the study of embryology and development. Chicken is also an important agricultural species and major food source for high-quality protein worldwide. In addition, chicken is an important model organism for comparative and evolution genomics. Exploitation of this genome as a biomedical model is hindered by its incomplete structural and functional annotation. This incomplete annotation makes it difficult for researchers to model their functional genomics datasets. Improving structural and functional annotation of the chicken genome will allow researchers to derive biological meaning from their functional genomics datasets. The objectives of this study were to identify proteins expressed in multiple chicken tissues, to functionally annotate experimentally confirmed proteins expressed in different chicken tissues, to quantify and assess the Gene Ontology (GO) annotation quality, and to facilitate functional annotation of microarray data. The results of this research have proven to be fundamental resource for improving the structural and functional annotation of chicken genome. Specifically, we have improved the structural annotation of the chicken genome by adding support to predicted proteins. In addition, we have improved the functional annotation of the chicken genome by assigning useful biological information to proteomics datasets and the whole genome chicken array. The Gene Ontology Annotation Quality (GAQ) and Array GO Mapper (AGOM) tools developed in this study will sustainably continue to facilitate functional modeling of chicken arrays and high-throughput experimental datasets from microarray and proteomics studies. The ultimate positive impact of these results is to facilitate the field of biomedical research with useful information for comparative biology, better understanding of chicken biological systems, diseases, drug discovery and eventually development of therapies.
|
18 |
Mathematical and Experimental Investigation of Ontological Similarity Measures and Their Use in Biomedical DomainsYu, Xinran 18 August 2010 (has links)
No description available.
|
19 |
Méthodes sémantiques pour la comparaison inter-espèces de voies métaboliques : application au métabolisme des lipides chez l'humain, la souris et la poule / Semantic methods for the cross-species metabolic pathways comparison : application to human, mice and chicken lipid metabolismBettembourg, Charles 16 December 2013 (has links)
La comparaison inter-espèces de voies métaboliques est une problématique importante en biologie. Actuellement, les connaissances sont générées à partir d'expériences sur un nombre relativement limité d'espèces dites modèles. Mieux connaître une espèce permet de valider ou non une inférence faite à partir de ces données expérimentales et de déterminer si ou dans quelle mesure des résultats obtenus sur une espèce modèle peuvent être transposés à une autre espèce. Cette thèse propose une méthode de comparaison inter-espèces de voies métaboliques. Elle compare chaque étape d'une voie métabolique en exploitant les annotations dans Gene Ontology qui leur sont associées. Ce travail valide l'intérêt des mesures de similarités sémantiques pour interpréter ces annotations, propose d'utiliser conjointement une mesure de particularité sémantique et propose une méthode basée sur des motifs de similarité et de particularité pour interpréter chaque étape de voie métabolique. De nombreuses mesures sémantiques quantifient la similarité entre des produits de gènes en fonction des annotations qu'ils ont en commun. Nous en avons identifié et utilisé une adaptée à la problématique de comparaison inter-espèces. En se focalisant sur la part commune aux produits de gènes comparés, les mesures de similarité sémantiques ignorent les caractéristiques spécifiques d'un seul produit de gène. Or la comparaison inter-espèces de voies métaboliques se doit de quantifier non seulement la similarité des produits de gènes qui interviennent dans celles-ci, mais également leurs particularités. Nous avons développé une mesure de particularité sémantique répondant à cette problématique. Pour chaque étape de voie métabolique, nous calculons un profil composé de sa valeur de similarité et de ses deux valeurs de particularité sémantiques. Il n'est pas possible d'établir formellement que deux produits de gènes sont similaires ou que l'un d'eux a des particularités significatives sans disposer d'un seuil de similarité et d'un seuil de particularité. Jusqu'à présent, ces interprétations se faisaient sur la base d'un seuil implicite ou arbitraire. Pour combler ce manque, nous avons développé une méthode de définition de seuils pour les mesures de similarité et de particularité sémantiques. Nous avons enfin appliqué une mesure de similarité inter-espèces et notre mesure de particularité pour comparer le métabolisme des lipides entre l'Homme, la souris et la poule. Nous avons pu interpréter les résultats à l'aide des seuils que nous avions définis. Chez les trois espèces, des particularités ont pu être observées, y compris au niveau de produits de gènes similaires. Elles concernent notamment des processus biologiques et des composants cellulaires. Les fonctions moléculaires présentent une forte similarité et peu de particularités. Ces résultats sont biologiquement pertinents. / Cross-species comparison of metabolic pathways is an important task in biology. It is a major stake for both human health and agronomy. Currently, knowledge is acquired from some experiments on a relatively low number of species referred to as ``models''. A better understanding of a species determines whether to validate or not an inference made from these experimental data. It also determines whether or to what extent results obtained on model species can be transposed to another species. This thesis proposes a cross-species metabolic pathways comparison method. Our method compares each step of a metabolic pathway using the associated Gene Ontology annotations. This work validates the interest of the semantic similarity measures for interpreting these annotations, proposes to use jointly a semantic particularity measure and proposes a method based on similarity and particularity patterns to interpret each metabolic pathway step. Several gene products are involved throughout a metabolic pathway. They are associated to some annotations in order to describe their biological roles. Based on a shared ontology, these annotations allow to compare data from different species and to take into account several level of abstraction. Several semantic measures quantifying the similarity between gene products from their annotations have been developed previously. We have identified and used a semantic similarity measure appropriate for cross-species comparisons. Because they focus on the common part of the compared gene products, the semantic similarity measures ignore their specific characteristics. Therefore, cross-species metabolic pathways comparison has to quantify not only the similarity of the gene products involved, but also their particularity. We have developed a semantic particularity measure addressing this issue. For each pathway step, we proposed to create a profile combining its semantic similarity and its two semantic particularity values. Concerning the results interpretation, it is not possible to establish formally that two gene products are similar or that one of them have some significant particularities without having a similarity threshold and a particularity threshold. So far, these interpretations were based on an implicit or an arbitrary threshold. To address this gap, we developed a threshold definition method for the semantic similarity and particularity measures. We last applied a cross-species similarity measure and our particularity measure to compare the lipid metabolism between human, mice and chicken. We then interpreted the results using the previously defined thresholds. In all three species, we observed some particularities, including on similar genes. They concerned notably some biological processes and cellular components. The molecular functions present a strong similarity and few particularities. These results are biologically relevant.
|
20 |
Machine Learning Methods for Microarray Data AnalysisGabbur, Prasad January 2010 (has links)
Microarrays emerged in the 1990s as a consequence of the efforts to speed up the process of drug discovery. They revolutionized molecular biological research by enabling monitoring of thousands of genes together. Typical microarray experiments measure the expression levels of a large numberof genes on very few tissue samples. The resulting sparsity of data presents major challenges to statistical methods used to perform any kind of analysis on this data. This research posits that phenotypic classification and prediction serve as good objective functions for both optimization and evaluation of microarray data analysis methods. This is because classification measures whatis needed for diagnostics and provides quantitative performance measures such as leave-one-out (LOO) or held-out prediction accuracy and confidence. Under the classification framework, various microarray data normalization procedures are evaluated using a class label hypothesis testing framework and also employing Support Vector Machines (SVM) and linear discriminant based classifiers. A novel normalization technique based on minimizing the squared correlation coefficients between expression levels of gene pairs is proposed and evaluated along with the other methods. Our results suggest that most normalization methods helped classification on the datasets considered except the rank method, most likely due to its quantization effects.Another contribution of this research is in developing machine learning methods for incorporating an independent source of information, in the form of gene annotations, to analyze microarray data. Recently, genes of many organisms have been annotated with terms from a limited vocabulary called Gene Ontologies (GO), describing the genes' roles in various biological processes, molecular functions and their locations within the cell. Novel probabilistic generative models are proposed for clustering genes using both their expression levels and GO tags. These models are similar in essence to the ones used for multimodal data, such as images and words, with learning and inference done in a Bayesian framework. The multimodal generative models are used for phenotypic class prediction. More specifically, the problems of phenotype prediction for static gene expression data and state prediction for time-course data are emphasized. Using GO tags for organisms whose genes have been studied more comprehensively leads to an improvement in prediction. Our methods also have the potential to provide a way to assess the quality of available GO tags for the genes of various model organisms.
|
Page generated in 0.0809 seconds