11 |
Multiple testing using the posterior probability of half-space: application to gene expression data.Labbe, Aurelie January 2005 (has links)
We consider the problem of testing the equality of two sample means, when the number of tests performed is large. Applying this problem to the context of gene expression data, our goal is to detect a set of genes differentially expressed under two treatments or two biological conditions. A null hypothesis of no difference in the gene expression under the two conditions is constructed. Since such a hypothesis is tested for each gene, it follows that thousands of tests are performed simultaneously, and multiple testing issues then arise. The aim of our research is to make a connection between Bayesian analysis and frequentist theory in the context of multiple comparisons by deriving some properties shared by both p-values and posterior probabilities. The ultimate goal of this work is to use the posterior probability of the one-sided alternative hypothesis (or equivalently, posterior probability of the half-space) in the same spirit as a p-value. We show for instance that such a Bayesian probability can be used as an input in some standard multiple testing procedures controlling for the False Discovery rate.
|
12 |
A New Reclassification Method for Highly Uncertain Microarray Data in Allergy Gene PredictionPaul, Jasmin 11 April 2012 (has links)
The analysis of microarray data is a challenging task because of the large dimensionality and small sample size involved. Although a few methods are available to address the problem of small sample size, they are not sufficiently successful in dealing with microarray data from extremely small (~<20) sample sizes. We propose a method to incorporate information from diverse sources to analyze the microarray data so as to improve the predictability of significant genes. A transformed data set, including statistical parameters, literature mining and gene ontology data, is evaluated. We performed classification experiments to identify potential allergy-related genes. Feature selection is used to identify the effect of features on classifier behaviour.
An exploratory and domain knowledge analysis was performed on noisy real-life allergy data, and a subset of genes was selected as positive and negative class. A new set of transformed variables, depending on the mean and standard deviation statistics of the data distribution and other data sources, was identified. Significant allergy- and immune-related genes from the microarray data were selected. Experiments showed that classification predictability of significant genes can be improved. Important features from the transformed variable set were also identified.
|
13 |
Development and application of analysis modules in MADIBA, a Web-based toolkit for the interpretation of microarray dataLaw, Philip John 12 August 2009 (has links)
Microarray technology makes it possible to identify changes in gene expression of an organism, under various conditions. The challenge to researchers that employ microarray expression profiling is that once pre-processing is completed, and a cluster of co-expressed genes obtained, is to derive biological meaning from this data. Data mining is thus essential for deducing significant biological information such as the identification of new biological mechanisms or putative drug targets. While many algorithms and software have been developed for analysing gene expression, the extraction of relevant information from experimental data is still a substantial challenge, requiring significant time and skill. MADIBA (MicroArray Data Interface for Biological Annotation) facilitates the assignment of biological meaning to gene expression clusters by automating the post-processing stage. A relational database has been designed to store the data from gene to pathway for Plasmodium falciparum, Oryza sativa (rice), Arabidopsis thaliana, and Pectobacterium atrosepticum (Pba) As input, the user submits a cluster of genes, either the gene identifiers or the gene sequences. Tools within the web interface allow rapid analyses for the identification of the Gene Ontology terms relevant to each cluster; visualising the metabolic pathways where the gene products are implicated, their genomic localisations, putative common transcriptional regulatory elements in the upstream sequences, and an analysis specific to the organism being studied. The user has the option of outputting selected results of the analyses, either in PDF or plain text formats. MADIBA is an integrated, online tool that will assist researchers in interpreting their results and understand the meaning of the co-expression of a cluster of genes. Functionality of MADIBA was used to analyse a number of gene clusters from several experiments – expression profiling of the Plasmodium falciparum life cycle, a Ralstonia solanacearum infection ofArabidopsis thaliana, a rice treatment with BTH, a millet SA- and MeJ-treatment experiment, and an expI mutant experiment in Pectobacterium atrosepticum. Data from the Plasmodium falciparum and rice were used to illustrate MADIBA’s functionality. For the A. thaliana analyses, the DRASTIC database was implemented to identify how genes respond to various treatments. In addition, a method named PCA Experiment Comparer was developed, which compares the expression values of the numerous experiments in NASCArrays. Using the A. thaliana-R. solanacearum interaction data several related experiments matched in both the susceptible and resistant interactions. In the millet analyses, besides defence related genes being identified, several genes also involved in photosynthesis were found, possibly suggesting a relation between light and defence signalling. The Pba data identified genes involved in quorum sensing, as well as some associated genes with no known function that may also be related to this regulatory process. With the advent of whole genome microarray chips and an increasing number of organisms being sequenced, tools such as MADIBA will become even more significant in understanding the underlying biology. MADIBA provides access to several genomic data sources and analyses, allowing users to quickly annotate and visualise the results. MADIBA is freely available and can be accessed at http://www.bi.up.ac.za/MADIBA/. Copyright / Dissertation (MSc)--University of Pretoria, 2009. / Biochemistry / unrestricted
|
14 |
Microarray data analysis methods and their applications to gene expression data analysis for Saccharomyces cerevisiae under oxidative stressSha, Wei 12 June 2006 (has links)
Oxidative stress is a harmful condition in a cell, tissue, or organ, caused by an imbalance between reactive oxygen species or other oxidants and the capacity of antioxidant defense systems to remove them. These oxidants cause wide-ranging damage to macromolecules, including proteins, lipids, DNA and carbohydrates. Oxidative stress is an important pathophysiologic component of a number of diseases, such as Alzheimer's disease, diabetes and certain cancers. Cells contain effective defense mechanisms to respond to oxidative stress. Despite much accumulated knowledge about these responses, their kinetics, especially the kinetics of early responses is still not clearly understood.
The Yap1 transcription factor is crucial for the normal response to a variety of stress conditions including oxidative stress. Previous studies on Yap1 regulation started to measure gene expression profile at least 20 minutes after the induction of oxidative stress. Genes and pathways regulated by Yap1 in early oxidative stress response (within 20 minutes) were not identified in these studies.
Here we study the kinetics of early oxidative stress response induced by the cumene hydroperoxide (CHP) in Saccharomyces cerevisiae wild type and yap1 mutant. Gene expression profiles after exposure to CHP were obtained in controlled conditions using Affymetrix Yeast Genome S98 arrays. The oxidative stress response was measured at 8 time points along 120 minutes after the addition of CHP, with the earliest time point at 3 minute after the exposure. Statistical analysis methods, including ANOVA, k-means clustering analysis, and pathway analysis were used to analyze the data. The results from this study provide a dynamic resolution of the oxidative stress responses in S. cerevisiae, and contribute to a richer understanding of the antioxidant defense systems. It also provides a global view of the roles that Yap1 plays under normal and oxidative stress conditions. / Ph. D.
|
15 |
Redução dimensional de dados de alta dimensão e poucas amostras usando Projection Pursuit / Dimension reduction of datasets with large dimensionalities and few samples using Projection PursuitEspezua Llerena, Soledad 30 July 2013 (has links)
Reduzir a dimensão de bancos de dados é um passo importante em processos de reconhecimento de padrões e aprendizagem de máquina. Projection Pursuit (PP) tem emergido como uma técnica relevante para tal fim, a qual busca projeções dos dados em espaços de baixa dimensão onde estruturas interessantes sejam reveladas. Apesar do relativo sucesso de PP em vários problemas de redução dimensional, a literatura mostra uma aplicação limitada da mesma em bancos de dados com elevada quantidade de atributos e poucas amostras, tais como os gerados em biologia molecular. Nesta tese, estudam-se formas de aproveitar o potencial de PP em problemas de alta dimensão e poucas amostras a fim de facilitar a posterior construção de classificadores. Entre as principais contribuições deste trabalho tem-se: i) Sequential Projection Pursuit Modified (SPPM), um método de busca sequencial de espaços de projeção baseado em Algoritmo Genético (AG) e operadores de cruzamento especializados; ii) Block Sequential Projection Pursuit Modified (Block-SPPM) e Whitened Sequential Projection Pursuit Modified (W-SPPM), duas estratégias de aplicação de SPPM em problemas com mais atributos do que amostras, sendo a primeira baseada e particionamento de atributos e a segunda baseada em pré-compactação dos dados. Avaliações experimentais sobre bancos de dados públicos de expressão gênica mostraram a eficácia das propostas em melhorar a acurácia de algoritmos de classificação populares em relação a vários outros métodos de redução dimensional, tanto de seleção quanto de extração de atributos, encontrando-se que W-SPPM oferece o melhor compromisso entre acurácia e custo computacional. / Reducing the dimension of datasets is an important step in pattern recognition and machine learning processes. PP has emerged as a relevant technique for that purpose. PP aims to find projections of the data in low dimensional spaces where interesting structures are revealed. Despite the success of PP in many dimension reduction problems, the literature shows a limited application of it in dataset with large amounts of features and few samples, such as those obtained in molecular biology. In this work we study ways to take advantage of the potential of PP in order to deal with problems of large dimensionalities and few samples. Among the main contributions of this work are: i) SPPM, an improved method for searching projections, based on a genetic algorithm and specialized crossover operators; and ii) Block-SPPM and W-SPPM, two strategies of applying SPPM in problems with more attributes than samples. The first strategy is based on partitioning the attribute space while the later is based on a precompaction of the data followed by a projection search. Experimental evaluations over public gene-expression datasets showed the efficacy of the proposals in improving the accuracy of popular classifiers with respect to several representative dimension reduction methods, being W-SPPM the strategy with the best compromise between accuracy and computational cost.
|
16 |
Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteinsDurek, Pawel, Schudoma, Christian, Weckwerth, Wolfram, Selbig, Joachim, Walther, Dirk January 2009 (has links)
Background:
Phosphorylation of proteins plays a crucial role in the regulation and activation of metabolic and signaling pathways and constitutes an important target for pharmaceutical intervention. Central to the phosphorylation process is the recognition of specific target sites by protein kinases followed by the covalent attachment of phosphate groups to the amino acids serine, threonine, or tyrosine. The experimental identification as well as computational prediction of phosphorylation sites (P-sites) has proved to be a challenging problem. Computational methods have focused primarily on extracting predictive features from the local, one-dimensional sequence information surrounding phosphorylation sites.
Results:
We characterized the spatial context of phosphorylation sites and assessed its usability for improved phosphorylation site predictions. We identified 750 non-redundant, experimentally verified sites with three-dimensional (3D) structural information available in the protein data bank (PDB) and grouped them according to their respective kinase family. We studied the spatial distribution of amino acids around phosphorserines, phosphothreonines, and phosphotyrosines to extract signature 3D-profiles. Characteristic spatial distributions of amino acid residue types around phosphorylation sites were indeed discernable, especially when kinase-family-specific target sites were analyzed. To test the added value of using spatial information for the computational prediction of phosphorylation sites, Support Vector Machines were applied using both sequence as well as structural information. When compared to sequence-only based prediction methods, a small but consistent performance improvement was obtained when the prediction was informed by 3D-context information.
Conclusion:
While local one-dimensional amino acid sequence information was observed to harbor most of the discriminatory power, spatial context information was identified as relevant for the recognition of kinases and their cognate target sites and can be used for an improved prediction of phosphorylation sites. A web-based service (Phos3D) implementing the developed structurebased P-site prediction method has been made available at http://phos3d.mpimp-golm.mpg.de.
|
17 |
Mixed Effects Models For Time Series Gene Expression DataErkan, Ibrahim 01 December 2011 (has links) (PDF)
The experimental factors such as the cell type and the treatment may have different impact on expression levels of individual genes which are quantitative measurements from microarrays. The measurements can be collected at a few unevenly spaced time points with replicates. The aim of this study is to consider cell type, treatment and short time series attributes and to infer about their effects on individual genes. A mixed effects model (LME) was proposed to model the gene expression data and the performance of the model was validated by a simulation study. Realistic data sets were generated preserving the structure of the sample real life data studied by Nymark et al. (2007). Predictive performance of the model was evaluated by performance measures, such as accuracy, sensitivity and specificity, as well as compared to the competing method by Smyth (2004), namely Limma. Both methods were also compared on real life data. Simulation results showed that the predictive performance of LME is as high as 99%, and it produces False Discovery Rate (FDR) as low as 0.4% whereas Limma has an FDR value of at least 32%. Moreover, LME has almost 99% predictive capability on the continuous time parameter where Limma has only about 67% and even it cannot handle continuous independent variables.
|
18 |
Bayesian variable selection in clustering via dirichlet process mixture modelsKim, Sinae 17 September 2007 (has links)
The increased collection of high-dimensional data in various fields has raised a strong
interest in clustering algorithms and variable selection procedures. In this disserta-
tion, I propose a model-based method that addresses the two problems simultane-
ously. I use Dirichlet process mixture models to define the cluster structure and to
introduce in the model a latent binary vector to identify discriminating variables. I
update the variable selection index using a Metropolis algorithm and obtain inference
on the cluster structure via a split-merge Markov chain Monte Carlo technique. I
evaluate the method on simulated data and illustrate an application with a DNA
microarray study. I also show that the methodology can be adapted to the problem
of clustering functional high-dimensional data. There I employ wavelet thresholding
methods in order to reduce the dimension of the data and to remove noise from the
observed curves. I then apply variable selection and sample clustering methods in the
wavelet domain. Thus my methodology is wavelet-based and aims at clustering the
curves while identifying wavelet coefficients describing discriminating local features.
I exemplify the method on high-dimensional and high-frequency tidal volume traces
measured under an induced panic attack model in normal humans.
|
19 |
IMPROVED GENE PAIR BIOMARKERS FOR MICROARRAY DATA CLASSIFICATIONKhamesipour, Alireza 01 August 2018 (has links)
The Top Scoring Pair (TSP) classifier, based on the notion of relative ranking reversals in the expressions of two marker genes, has been proposed as a simple, accurate, and easily interpretable decision rule for classification and class prediction of gene expression profiles. We introduce the AUC-based TSP classifier, which is based on the Area Under the ROC (Receiver Operating Characteristic) Curve. The AUCTSP classifier works according to the same principle as TSP but differs from the latter in that the probabilities that determine the top scoring pair are computed based on the relative rankings of the two marker genes across all subjects as opposed to for each individual subject. Although the classification is still done on an individual subject basis, the generalization that the AUC-based probabilities provide during training yield an overall better and more stable classifier. Through extensive simulation results and case studies involving classification in ovarian, leukemia, colon, and breast and prostate cancers and diffuse large b-cell lymphoma, we show the superiority of the proposed approach in terms of improving classification accuracy, avoiding overfitting and being less prone to selecting non-informative pivot genes. The proposed AUCTSP is a simple yet reliable and robust rank-based classifier for gene expression classification. While the AUCTSP works by the same principle as TSP, its ability to determine the top scoring gene pair based on the relative rankings of two marker genes across {\em all} subjects as opposed to each individual subject results in significant performance gains in classification accuracy. In addition, the proposed method tends to avoid selection of non-informative (pivot) genes as members of the top-scoring pair.\\ We have also proposed the use of the AUC test statistic in order to reduce the computational cost of the TSP in selecting the most informative pair of genes for diagnosing a specific disease. We have proven the efficacy of our proposed method through case studies in ovarian, colon, leukemia, breast and prostate cancers and diffuse large b-cell lymphoma in selecting informative genes. We have compared the selected pairs, computational cost and running time and classification performance of a subset of differentially expressed genes selected based on the AUC probability with the original TSP in the aforementioned datasets. The reduce sized TSP has proven to dramatically reduce the computational cost and time complexity of selecting the top scoring pair of genes in comparison to the original TSP in all of the case studies without degrading the performance of the classifier. Using the AUC probability, we were able to reduce the computational cost and CPU running time of the TSP by 79\% and 84\% respectively on average in the tested case studies. In addition, the use of the AUC probability prior to applying the TSP tends to avoid the selection of genes that are not expressed (``pivot'' genes) due to the imposed condition. We have demonstrated through LOOCV and 5-fold cross validation that the reduce sized TSP and TSP have shown to perform approximately the same in terms of classification accuracy for smaller threshold values. In conclusion, we suggest the use of the AUC test statistic in reducing the size of the dataset for the extensions of the TSP method, e.g. the k-TSP and TST, in order to make these methods feasible and cost effective.
|
20 |
Computational Methods for Knowledge Integration in the Analysis of Large-scale Biological NetworksJanuary 2012 (has links)
abstract: As we migrate into an era of personalized medicine, understanding how bio-molecules interact with one another to form cellular systems is one of the key focus areas of systems biology. Several challenges such as the dynamic nature of cellular systems, uncertainty due to environmental influences, and the heterogeneity between individual patients render this a difficult task. In the last decade, several algorithms have been proposed to elucidate cellular systems from data, resulting in numerous data-driven hypotheses. However, due to the large number of variables involved in the process, many of which are unknown or not measurable, such computational approaches often lead to a high proportion of false positives. This renders interpretation of the data-driven hypotheses extremely difficult. Consequently, a dismal proportion of these hypotheses are subject to further experimental validation, eventually limiting their potential to augment existing biological knowledge. This dissertation develops a framework of computational methods for the analysis of such data-driven hypotheses leveraging existing biological knowledge. Specifically, I show how biological knowledge can be mapped onto these hypotheses and subsequently augmented through novel hypotheses. Biological hypotheses are learnt in three levels of abstraction -- individual interactions, functional modules and relationships between pathways, corresponding to three complementary aspects of biological systems. The computational methods developed in this dissertation are applied to high throughput cancer data, resulting in novel hypotheses with potentially significant biological impact. / Dissertation/Thesis / Ph.D. Computer Science 2012
|
Page generated in 0.0155 seconds