1 |
A New Reclassification Method for Highly Uncertain Microarray Data in Allergy Gene PredictionPaul, Jasmin 11 April 2012 (has links)
The analysis of microarray data is a challenging task because of the large dimensionality and small sample size involved. Although a few methods are available to address the problem of small sample size, they are not sufficiently successful in dealing with microarray data from extremely small (~<20) sample sizes. We propose a method to incorporate information from diverse sources to analyze the microarray data so as to improve the predictability of significant genes. A transformed data set, including statistical parameters, literature mining and gene ontology data, is evaluated. We performed classification experiments to identify potential allergy-related genes. Feature selection is used to identify the effect of features on classifier behaviour.
An exploratory and domain knowledge analysis was performed on noisy real-life allergy data, and a subset of genes was selected as positive and negative class. A new set of transformed variables, depending on the mean and standard deviation statistics of the data distribution and other data sources, was identified. Significant allergy- and immune-related genes from the microarray data were selected. Experiments showed that classification predictability of significant genes can be improved. Important features from the transformed variable set were also identified.
|
2 |
Microarray data analysis methods and their applications to gene expression data analysis for Saccharomyces cerevisiae under oxidative stressSha, Wei 12 June 2006 (has links)
Oxidative stress is a harmful condition in a cell, tissue, or organ, caused by an imbalance between reactive oxygen species or other oxidants and the capacity of antioxidant defense systems to remove them. These oxidants cause wide-ranging damage to macromolecules, including proteins, lipids, DNA and carbohydrates. Oxidative stress is an important pathophysiologic component of a number of diseases, such as Alzheimer's disease, diabetes and certain cancers. Cells contain effective defense mechanisms to respond to oxidative stress. Despite much accumulated knowledge about these responses, their kinetics, especially the kinetics of early responses is still not clearly understood.
The Yap1 transcription factor is crucial for the normal response to a variety of stress conditions including oxidative stress. Previous studies on Yap1 regulation started to measure gene expression profile at least 20 minutes after the induction of oxidative stress. Genes and pathways regulated by Yap1 in early oxidative stress response (within 20 minutes) were not identified in these studies.
Here we study the kinetics of early oxidative stress response induced by the cumene hydroperoxide (CHP) in Saccharomyces cerevisiae wild type and yap1 mutant. Gene expression profiles after exposure to CHP were obtained in controlled conditions using Affymetrix Yeast Genome S98 arrays. The oxidative stress response was measured at 8 time points along 120 minutes after the addition of CHP, with the earliest time point at 3 minute after the exposure. Statistical analysis methods, including ANOVA, k-means clustering analysis, and pathway analysis were used to analyze the data. The results from this study provide a dynamic resolution of the oxidative stress responses in S. cerevisiae, and contribute to a richer understanding of the antioxidant defense systems. It also provides a global view of the roles that Yap1 plays under normal and oxidative stress conditions. / Ph. D.
|
3 |
Bayesian variable selection in clustering via dirichlet process mixture modelsKim, Sinae 17 September 2007 (has links)
The increased collection of high-dimensional data in various fields has raised a strong
interest in clustering algorithms and variable selection procedures. In this disserta-
tion, I propose a model-based method that addresses the two problems simultane-
ously. I use Dirichlet process mixture models to define the cluster structure and to
introduce in the model a latent binary vector to identify discriminating variables. I
update the variable selection index using a Metropolis algorithm and obtain inference
on the cluster structure via a split-merge Markov chain Monte Carlo technique. I
evaluate the method on simulated data and illustrate an application with a DNA
microarray study. I also show that the methodology can be adapted to the problem
of clustering functional high-dimensional data. There I employ wavelet thresholding
methods in order to reduce the dimension of the data and to remove noise from the
observed curves. I then apply variable selection and sample clustering methods in the
wavelet domain. Thus my methodology is wavelet-based and aims at clustering the
curves while identifying wavelet coefficients describing discriminating local features.
I exemplify the method on high-dimensional and high-frequency tidal volume traces
measured under an induced panic attack model in normal humans.
|
4 |
IMPROVED GENE PAIR BIOMARKERS FOR MICROARRAY DATA CLASSIFICATIONKhamesipour, Alireza 01 August 2018 (has links)
The Top Scoring Pair (TSP) classifier, based on the notion of relative ranking reversals in the expressions of two marker genes, has been proposed as a simple, accurate, and easily interpretable decision rule for classification and class prediction of gene expression profiles. We introduce the AUC-based TSP classifier, which is based on the Area Under the ROC (Receiver Operating Characteristic) Curve. The AUCTSP classifier works according to the same principle as TSP but differs from the latter in that the probabilities that determine the top scoring pair are computed based on the relative rankings of the two marker genes across all subjects as opposed to for each individual subject. Although the classification is still done on an individual subject basis, the generalization that the AUC-based probabilities provide during training yield an overall better and more stable classifier. Through extensive simulation results and case studies involving classification in ovarian, leukemia, colon, and breast and prostate cancers and diffuse large b-cell lymphoma, we show the superiority of the proposed approach in terms of improving classification accuracy, avoiding overfitting and being less prone to selecting non-informative pivot genes. The proposed AUCTSP is a simple yet reliable and robust rank-based classifier for gene expression classification. While the AUCTSP works by the same principle as TSP, its ability to determine the top scoring gene pair based on the relative rankings of two marker genes across {\em all} subjects as opposed to each individual subject results in significant performance gains in classification accuracy. In addition, the proposed method tends to avoid selection of non-informative (pivot) genes as members of the top-scoring pair.\\ We have also proposed the use of the AUC test statistic in order to reduce the computational cost of the TSP in selecting the most informative pair of genes for diagnosing a specific disease. We have proven the efficacy of our proposed method through case studies in ovarian, colon, leukemia, breast and prostate cancers and diffuse large b-cell lymphoma in selecting informative genes. We have compared the selected pairs, computational cost and running time and classification performance of a subset of differentially expressed genes selected based on the AUC probability with the original TSP in the aforementioned datasets. The reduce sized TSP has proven to dramatically reduce the computational cost and time complexity of selecting the top scoring pair of genes in comparison to the original TSP in all of the case studies without degrading the performance of the classifier. Using the AUC probability, we were able to reduce the computational cost and CPU running time of the TSP by 79\% and 84\% respectively on average in the tested case studies. In addition, the use of the AUC probability prior to applying the TSP tends to avoid the selection of genes that are not expressed (``pivot'' genes) due to the imposed condition. We have demonstrated through LOOCV and 5-fold cross validation that the reduce sized TSP and TSP have shown to perform approximately the same in terms of classification accuracy for smaller threshold values. In conclusion, we suggest the use of the AUC test statistic in reducing the size of the dataset for the extensions of the TSP method, e.g. the k-TSP and TST, in order to make these methods feasible and cost effective.
|
5 |
Novel Monte Carlo Approaches to Identify Aberrant Pathways in CancerGu, Jinghua 27 August 2013 (has links)
Recent breakthroughs in high-throughput biotechnology have promoted the integration of multi-platform data to investigate signal transduction pathways within a cell. In order to model complicated dynamics and heterogeneity of biological pathways, sophisticated computational models are needed to address unique properties of both the biological hypothesis and the data. In this dissertation work, we have proposed and developed methods using Markov Chain Monte Carlo (MCMC) techniques to solve complex modeling problems in human cancer research by integrating multi-platform data. We focus on two research topics: 1) identification of transcriptional regulatory networks and 2) uncovering of aberrant intracellular signal transduction pathways.
We propose a robust method, called GibbsOS, to identify condition specific gene regulatory patterns between transcription factors and their target genes. A Gibbs sampler is employed to sample target genes from the marginal function of outlier sum of regression t statistic. Numerical simulation has demonstrated significant performance improvement of GibbsOS over existing methods against noise and false positive connections in binding data. We have applied GibbsOS to breast cancer cell line datasets and identified condition specific regulatory rewiring in human breast cancer.
We also propose a novel method, namely Gibbs sampler to Infer Signal Transduction (GIST), to detect aberrant pathways that are highly associated with biological phenotypes or clinical information. By converting predefined potential functions into a Gibbs distribution, GIST estimates edge directions by learning the distribution of linear signaling pathway structures. Through the sampling process, the algorithm is able to infer signal transduction directions which are jointly determined by both gene expression and network topology. We demonstrate the advantage of the proposed algorithms on simulation data with respect to different settings of noise level in gene expression and false-positive connections in protein-protein interaction (PPI) network.
Another major contribution of the dissertation work is that we have improved traditional perspective towards understanding aberrant signal transductions by further investigating structural linkage of signaling pathways. We develop a method called Structural Organization to Uncover pathway Landscape (SOUL), which emphasizes on modularized pathways structures from reconstructed pathway landscape. GIST and SOUL provide a very unique angle to computationally model alternative pathways and pathway crosstalk. The proposed new methods can bring insight to drug discovery research by targeting nodal proteins that oversee multiple signaling pathways, rather than treating individual pathways separately. A complete pathway identification protocol, namely Infer Modularization of PAthway CrossTalk (IMPACT), is developed to bridge downstream regulatory networks with upstream signaling cascades. We have applied IMPACT to breast cancer treated patient datasets to investigate how estrogen receptor (ER) signaling pathways are related to drug resistance. The identified pathway proteins from patient datasets are well supported by breast cancer cell line models. We hypothesize from computational results that HSP90AA1 protein is an important nodal protein that oversees multiple signaling pathways to drive drug resistance. Cell viability analysis has supported our hypothesis by showing a significant decrease in viability of endocrine resistant cells compared with non-resistant cells when 17-AAG (a drug that inhibits HSP90AA1) is applied.
We believe that this dissertation work not only offers novel computational tools towards understanding complicated biological problems, but more importantly, it provides a valuable paradigm where systems biology connects data with hypotheses using computational modeling. Initial success of using microarray datasets to study endocrine resistance in breast cancer has shed light on translating results from high throughput datasets to biological discoveries in complicated human disease studies. As the next generation biotechnology becomes more cost-effective, the power of the proposed methods to untangle complicated aberrant signaling rewiring and pathway crosstalk will be finally unleashed. / Ph. D.
|
6 |
TESTING FOR DIFFERENTIALLY EXPRESSED GENES AND KEY BIOLOGICAL CATEGORIES IN DNA MICROARRAY ANALYSISSARTOR, MAUREEN A. January 2007 (has links)
No description available.
|
7 |
Integrative Modeling and Analysis of High-throughput Biological DataChen, Li 21 January 2011 (has links)
Computational biology is an interdisciplinary field that focuses on developing mathematical models and algorithms to interpret biological data so as to understand biological problems. With current high-throughput technology development, different types of biological data can be measured in a large scale, which calls for more sophisticated computational methods to analyze and interpret the data. In this dissertation research work, we propose novel methods to integrate, model and analyze multiple biological data, including microarray gene expression data, protein-DNA interaction data and protein-protein interaction data. These methods will help improve our understanding of biological systems.
First, we propose a knowledge-guided multi-scale independent component analysis (ICA) method for biomarker identification on time course microarray data. Guided by a knowledge gene pool related to a specific disease under study, the method can determine disease relevant biological components from ICA modes and then identify biologically meaningful markers related to the specific disease. We have applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.
Second, we propose a novel method for transcriptional regulatory network identification by integrating gene expression data and protein-DNA binding data. The approach is built upon a multi-level analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes increasingly significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to reduce false positive predictions by integrating binding motif information and gene expression data; a significance analysis procedure is followed to assess the significance of each regulatory module. The resulting performance on simulation data and yeast cell cycle data shows that the multi-level SVR approach outperforms other existing methods in the identification of both regulators and their target genes. We have further applied the proposed method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Third, we propose a bootstrapping Markov Random Filed (MRF)-based method for subnetwork identification on microarray data by incorporating protein-protein interaction data. Methodologically, an MRF-based network score is first derived by considering the dependency among genes to increase the chance of selecting hub genes. A modified simulated annealing search algorithm is then utilized to find the optimal/suboptimal subnetworks with maximal network score. A bootstrapping scheme is finally implemented to generate confident subnetworks. Experimentally, we have compared the proposed method with other existing methods, and the resulting performance on simulation data shows that the bootstrapping MRF-based method outperforms other methods in identifying ground truth subnetwork and hub genes. We have then applied our method to breast cancer data to identify significant subnetworks associated with drug resistance. The identified subnetworks not only show good reproducibility across different data sets, but indicate several pathways and biological functions potentially associated with the development of breast cancer and drug resistance. In addition, we propose to develop network-constrained support vector machines (SVM) for cancer classification and prediction, by taking into account the network structure to construct classification hyperplanes. The simulation study demonstrates the effectiveness of our proposed method. The study on the real microarray data sets shows that our network-constrained SVM, together with the bootstrapping MRF-based subnetwork identification approach, can achieve better classification performance compared with conventional biomarker selection approaches and SVMs.
We believe that the research presented in this dissertation not only provides novel and effective methods to model and analyze different types of biological data, the extensive experiments on several real microarray data sets and results also show the potential to improve the understanding of biological mechanisms related to cancers by generating novel hypotheses for further study. / Ph. D.
|
8 |
Conception et analyse des biopuces à ADN en environnements parallèles et distribués / Design and analysis of DNA microarrays in parallel and distributed environmentsJaziri, Faouzi 23 June 2014 (has links)
Les microorganismes constituent la plus grande diversité du monde vivant. Ils jouent un rôle clef dans tous les processus biologiques grâce à leurs capacités d’adaptation et à la diversité de leurs capacités métaboliques. Le développement de nouvelles approches de génomique permet de mieux explorer les populations microbiennes. Dans ce contexte, les biopuces à ADN représentent un outil à haut débit de choix pour l'étude de plusieurs milliers d’espèces en une seule expérience. Cependant, la conception et l’analyse des biopuces à ADN, avec leurs formats de haute densité actuels ainsi que l’immense quantité de données à traiter, représentent des étapes complexes mais cruciales. Pour améliorer la qualité et la performance de ces deux étapes, nous avons proposé de nouvelles approches bioinformatiques pour la conception et l’analyse des biopuces à ADN en environnements parallèles. Ces approches généralistes et polyvalentes utilisent le calcul haute performance (HPC) et les nouvelles approches du génie logiciel inspirées de la modélisation, notamment l’ingénierie dirigée par les modèles (IDM) pour contourner les limites actuelles. Nous avons développé PhylGrid 2.0, une nouvelle approche distribuée sur grilles de calcul pour la sélection de sondes exploratoires pour biopuces phylogénétiques. Ce logiciel a alors été utilisé pour construire PhylOPDb: une base de données complète de sondes oligonucléotidiques pour l’étude des communautés procaryotiques. MetaExploArrays qui est un logiciel parallèle pour la détermination de sondes sur différentes architectures de calcul (un PC, un multiprocesseur, un cluster ou une grille de calcul), en utilisant une approche de méta-programmation et d’ingénierie dirigée par les modèles a alors été conçu pour apporter une flexibilité aux utilisateurs en fonction de leurs ressources matériel. PhylInterpret, quant à lui est un nouveau logiciel pour faciliter l’analyse des résultats d’hybridation des biopuces à ADN. PhylInterpret utilise les notions de la logique propositionnelle pour déterminer la composition en procaryotes d’échantillons métagénomiques. Enfin, une démarche d’ingénierie dirigée par les modèles pour la parallélisation de la traduction inverse d’oligopeptides pour le design des biopuces à ADN fonctionnelles a également été mise en place. / Microorganisms represent the largest diversity of the living beings. They play a crucial rôle in all biological processes related to their huge metabolic potentialities and their capacity for adaptation to different ecological niches. The development of new genomic approaches allows a better knowledge of the microbial communities involved in complex environments functioning. In this context, DNA microarrays represent high-throughput tools able to study the presence, or the expression levels of several thousands of genes, combining qualitative and quantitative aspects in only one experiment. However, the design and analysis of DNA microarrays, with their current high density formats as well as the huge amount of data to process, are complex but crucial steps. To improve the quality and performance of these two steps, we have proposed new bioinformatics approaches for the design and analysis of DNA microarrays in parallel and distributed environments. These multipurpose approaches use high performance computing (HPC) and new software engineering approaches, especially model driven engineering (MDE), to overcome the current limitations. We have first developed PhylGrid 2.0, a new distributed approach for the selection of explorative probes for phylogenetic DNA microarrays at large scale using computing grids. This software was used to build PhylOPDb: a comprehensive 16S rRNA oligonucleotide probe database for prokaryotic identification. MetaExploArrays, which is a parallel software of oligonucleotide probe selection on different computing architectures (a PC, a multiprocessor, a cluster or a computing grid) using meta-programming and a model driven engineering approach, has been developed to improve flexibility in accordance to user’s informatics resources. Then, PhylInterpret, a new software for the analysis of hybridization results of DNA microarrays. PhylInterpret uses the concepts of propositional logic to determine the prokaryotic composition of metagenomic samples. Finally, a new parallelization method based on model driven engineering (MDE) has been proposed to compute a complete backtranslation of short peptides to select probes for functional microarrays.
|
9 |
Variance of Difference as Distance Like Measure in Time Series Microarray Data ClusteringMukhopadhyay, Sayan January 2014 (has links) (PDF)
Our intention is to find similarity among the time series expressions of the genes in microarray experiments. It is hypothesized that at a given time point the concentration of one gene’s mRNA is directly affected by the concentration of other gene’s mRNA, and may have biological significance. We define dissimilarity between two time-series data set as the variance of Euclidean distances of each time points. The large numbers of gene expressions make the calculation of variance of distance in each point computationally expensive and therefore computationally challenging in terms of execution time. For this reason we use autoregressive model which estimates nineteen points gene expression to a three point vector. It allows us to find variance of difference between two data sets without point-to-point matching. Previous analysis from the microarray experiments data found that 62 genes are regulated following EGF (Epidermal Growth Factor) and HRG (Heregulin) treatment of the MCF-7 breast cancer cells. We have chosen these suspected cancer-related genes as our reference and investigated which additional set of genes has similar time point expression profiles. Keeping variance of difference as a measure of distance, we have used several methods for clustering the gene expression data, such as our own maximum clique finding heuristics and hierarchical clustering. The results obtained were validated through a text mining study. New predictions from our study could be a basis for further investigations in the genesis of breast cancer. Overall in 84 new genes are found in which 57 genes are related to cancer among them 35 genes are associated with breast cancer.
|
Page generated in 0.0812 seconds