Spelling suggestions: "subject:"gene set"" "subject:"ene set""
11 |
Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment AnalysisLi, Wei 01 December 2009 (has links)
The DNA microarray biotechnology simultaneously monitors the expression of thousands of genes and aims to identify genes that are differently expressed under different conditions. From the statistical point of view, it can be restated as identify genes strongly associated with the response or covariant of interest. The Gene Set Enrichment Analysis (GSEA) method is one method which focuses the analysis at the functional related gene sets level instead of single genes. It helps biologists to interpret the DNA microarray data by their previous biological knowledge of the genes in a gene set. GSEA has been shown to efficiently identify gene sets containing known disease-related genes in the real experiments. Here we want to evaluate the statistical power of this method by simulation studies. The results show that the the power of GSEA is good enough to identify the gene sets highly associated with the response or covariant of interest.
|
12 |
Genetic Risk Factors for PTSD: A Gene-Set Analysis of Neurotransmitter ReceptorsLewis, Michael 08 July 2020 (has links)
PTSD is a moderately heritable disorder that causes intense and chronic suffering in many afflicted individuals. The pathogenesis of PTSD is not well understood, and genetic mechanisms are particularly elusive. Neurotransmitter systems are thought to contribute to PTSD etiology and are the targets of most pharmacotherapies used to treat PTSD, including the only two FDA approved options and a wide array of off-label options. However, the degree to which variation in genes which encode for and regulate neurotransmitter receptors increase risk of developing PTSD is unclear. Recently, large collaborative groups of PTSD genetics researchers have completed genome-wide association studies (GWAS) using massive sample sizes and have made summary statistics available for public use. In 2018, a new technique for high-powered analysis of GWAS summary statistics called GSA-SNP2 was introduced. In order to explore the relationship between PTSD and genetic variants in widely theorized molecular targets, this study applied GSA-SNP2 to manually curated neurotransmitter receptor gene-sets. Curated gene-sets included nine total "neurotransmitter receptor group" gene-sets and 45 total "receptor subtype" gene-sets. Each "neurotransmitter receptor group" gene-sets was designed to capture concentration of genetic risk factors for PTSD within genes which encode for all receptor subtypes that are activated by a given neurotransmitter. In contrast, "receptor subtype" gene-sets focused on specific subtypes and also accounted for intracellular signaling; each was designed to capture concentration of genetic risk factors for PTSD within genes which encode for specific receptor subtypes and the intracellular signaling proteins through which they exert their effects. Due to practical considerations, this work used summary statistics derived from a GWAS with far fewer participants (2,424 cases; 7,113 controls) than initially planned (23,212 cases; 151,447 controls). Prior to controlling for multiple comparisons, 7 of the investigated gene-sets reached statistical significance at the p ≤ .05 level. However, after controlling for multiple comparisons, none of the investigated gene-sets reached statistical significance. Due to limited statistical power of the current work, these results should be interpreted very cautiously. The current study is best interpreted as a preliminary study and is most informative in relation to refining study design. Implications for next steps are emphasized in discussion and nominally significant results are synthesized with the literature to demonstrate the types of research questions that might be addressed by applying a refined version of this study design to a larger sample. / Doctor of Philosophy / Though nearly all individuals will be exposed to a potentially traumatic event in their lifetime, only a small percentage will experience PTSD, which is a severe psychological disorder. Though genetics are known contribute to an individual's level of risk for developing PTSD, relatively little is known about which particular genetic differences are key. Neurotransmitter receptors are thought to contribute to the risk for PTSD and are a key aspect of medications for PTSD. However, little is known about whether genetic differences in neurotransmitter receptors contribute to risk for developing PTSD. Recently, large collaborative groups of PTSD genetics researchers have completed studies which investigate genetic risk factors from across the genome using massive sample sizes and have made the statistical output of these studies available to the public. In 2018, a new technique called GSA-SNP2 was created to help assist with efforts to analyze aspects of that statistical output that have not been previously analyzed. This study used GSA-SNP2 to analyze the degree to which groups of neurotransmitter receptor genes contribute to the risk of developing PTSD. Due to the coronavirus pandemic, the researcher did not have access to the computing power needed to analyze the initially planned data which included 23,212 individuals with PTSD and 151,447 individuals without PTSD. As a substitute, the current work is an analysis using statistical output data from a study which included 2,424 individuals with PTSD and 7,113 individuals without PTSD. Based on a level of statistical significance that is typically used in most psychological studies, seven of the investigated gene-sets contribute highly to the risk for PTSD. However, it was necessary to use a different threshold for statistical significance due to the testing of many different groups of genes. After making that adjustment, none of the investigated gene-sets reached statistical significance. Due to limited statistical power of the current work, these results should be interpreted very cautiously. The current study is best interpreted as a preliminary study and is most informative in relation to refining study design. Implications for next steps are emphasized in discussion and nominally significant results are synthesized with the literature to demonstrate the types of research questions that might be addressed by applying a refined version of this study design to a larger sample.
|
13 |
Multi-omics Data Integration for Identifying Disease Specific Biological PathwaysLu, Yingzhou 05 June 2018 (has links)
Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways.
Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes.
This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis.
This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term. / Master of Science / Identification of biological pathways play a central role in understanding both human health and diseases. A biological pathway is a series of information processing steps via interactions among molecules in a cell that partially determines the phenotype of a cell. Specifically, identifying disease-specific pathway will guide focused studies on complex diseases, thus potentially improve the prevention and treatment of diseases.
To identify disease-specific pathways, it is crucial to develop computational methods and statistical tests that can integrate multi-omics (multiple omes such as genome, proteome, etc) data. Compared to single omics data, multi-omics data will help gaining a more comprehensive understanding on the molecular architecture of disease processes.
In this thesis, we propose a novel data analytics pipeline for multi-omics data integration. We test and apply our method on/to the real proteomics data sets on muscular dystrophy subtypes, and identify several biologically plausible pathways related to muscular dystrophies.
|
14 |
TESTING FOR DIFFERENTIALLY EXPRESSED GENES AND KEY BIOLOGICAL CATEGORIES IN DNA MICROARRAY ANALYSISSARTOR, MAUREEN A. January 2007 (has links)
No description available.
|
15 |
Diel Mediated Populus balsamifera Transcriptome Components Test the Impacts of Artificial Nighttime LightingSkaf, Joseph 27 November 2012 (has links)
Artificial nighttime lighting (ANL) is known to adversely affect animals, but little is known what the consequences are to plants. Two genotypes of Populus balsamifera, a common urban tree, were used to investigate how ANL impacts plants. While the two genotypes varied in their physiological sensitivity to ANL, poorer levels of net leaf carbon assimilation compared to control samples suggested that ANL perturbed the perception of time of day for these plants. Gene set analysis on a subset of PopGenExpress microarray samples identified time of day specific processes in P. balsamifera, and a set of candidate ANL-sensitive genes were identified from these. Transcript measurements from the two genotypes revealed that ANL affects plants at the molecular level, for the diel cycling of the putative ANL-sensitive genes was perturbed. Together, these results suggest that ANL affects plants at the physiological and molecular level by perturbing their perception of time of day.
|
16 |
Diel Mediated Populus balsamifera Transcriptome Components Test the Impacts of Artificial Nighttime LightingSkaf, Joseph 27 November 2012 (has links)
Artificial nighttime lighting (ANL) is known to adversely affect animals, but little is known what the consequences are to plants. Two genotypes of Populus balsamifera, a common urban tree, were used to investigate how ANL impacts plants. While the two genotypes varied in their physiological sensitivity to ANL, poorer levels of net leaf carbon assimilation compared to control samples suggested that ANL perturbed the perception of time of day for these plants. Gene set analysis on a subset of PopGenExpress microarray samples identified time of day specific processes in P. balsamifera, and a set of candidate ANL-sensitive genes were identified from these. Transcript measurements from the two genotypes revealed that ANL affects plants at the molecular level, for the diel cycling of the putative ANL-sensitive genes was perturbed. Together, these results suggest that ANL affects plants at the physiological and molecular level by perturbing their perception of time of day.
|
17 |
隨機森林分類方法於基因組顯著性檢定上之應用 / Assessing the significance of a Gene Set卓達瑋 Unknown Date (has links)
在現今生物醫學領域中,一重要課題為透過基因實驗所獲得的量化資料,來研究與分析基因與外顯表型變數(phenotype)的相關性。已知多數已發展的方法皆屬於單基因分析法,無法適當的考慮基因之間的相關性。本研究主要針對基因組分析(gene set analysis)問題,提出統計檢定方法來驗證特定基因組的顯著性。為了能盡其所能的捕捉整體基因組與外顯表型變數的關係,我們結合了傳統的檢定方法與分類方法,提出以隨機森林分類方法(Random Forests)的測試組分類誤差值(test error)作為檢定統計量(test statistic),並以其排列顯著值(permutation-based p-value)來獲得統計結論。我們透過模擬研究將本研究方法和其他七種基因組分析方法做比較,可發現本方法在型一誤差率(type I error rate)和檢定力(power)上皆有優異表現。最後,我們運用本方法在數個實際基因資料組的分析上,並深入探討所獲得結果。 / Nowadays microarray data analysis has become an important issue in biomedical research. One major goal is to explore the relationship between gene expressions and some specific phenotypes. So far in literatures many developed methods are single gene-based methods, which use solely the information of individual genes and cannot appropriately take into account the relationship among genes. This research focuses on the gene set analysis, which carries out the statistical test for the significance of a set of genes to a phenotype. In order to capture the relationship between a gene set and the phenotype, we propose the use of performance of a complex classifier in the statistical test: The test error rate of a Random Forests classification is adopted as the test statistic, and the statistical conclusion is drawn according to its permutation-based p-value. We compare our test with other seven existing gene set analyses through simulation studies. It’s found that our method has leading performance in terms of having a controlled type I error rate and a high power. Finally, this method is applied in several real examples and brief discussions on the results are provided.
|
18 |
Pathway-centric approaches to the analysis of high-throughput genomics dataHänzelmann, Sonja, 1981- 11 October 2012 (has links)
In the last decade, molecular biology has expanded from a reductionist view to a systems-wide view that tries to unravel the complex interactions of cellular components. Owing to the emergence of high-throughput technology it is now possible to interrogate entire genomes at an unprecedented resolution. The dimension and unstructured nature of these data made it evident that new methodologies and tools are needed to turn data into biological knowledge. To contribute to this challenge we exploited the wealth of publicly available high-throughput genomics data and developed bioinformatics methodologies focused on extracting information at the pathway rather than the single gene level. First, we developed Gene Set Variation Analysis (GSVA), a method that facilitates the organization and condensation of gene expression profiles into gene sets. GSVA enables pathway-centric downstream analyses of microarray and RNA-seq gene expression data. The method estimates sample-wise pathway variation over a population and allows for the integration of heterogeneous biological data sources with pathway-level expression measurements. To illustrate the features of GSVA, we applied it to several use-cases employing different data types and addressing biological questions. GSVA is made available as an R package within the Bioconductor project.
Secondly, we developed a pathway-centric genome-based strategy to reposition drugs in type 2 diabetes (T2D). This strategy consists of two steps, first a regulatory network is constructed that is used to identify disease driving modules and then these modules are searched for compounds that might target them. Our strategy is motivated by the observation that disease genes tend to group together in the same neighborhood forming disease modules and that multiple genes might have to be targeted simultaneously to attain an effect on the pathophenotype. To find potential compounds, we used compound exposed genomics data deposited in public databases. We collected about 20,000 samples that have been exposed to about 1,800 compounds. Gene expression can be seen as an intermediate phenotype reflecting underlying dysregulatory pathways in a disease. Hence, genes contained in the disease modules that elicit similar transcriptional responses upon compound exposure are assumed to have a potential therapeutic effect. We applied the strategy to gene expression data of human islets from diabetic and healthy individuals and identified four potential compounds, methimazole, pantoprazole, bitter orange extract and torcetrapib that might have a positive effect on insulin secretion. This is the first time a regulatory network of human islets has been used to reposition compounds for T2D.
In conclusion, this thesis contributes with two pathway-centric approaches to important bioinformatic problems, such as the assessment of biological function and in silico drug repositioning. These contributions demonstrate the central role of pathway-based analyses in interpreting high-throughput genomics data. / En l'última dècada, la biologia molecular ha evolucionat des d'una perspectiva reduccionista cap a una perspectiva a nivell de sistemes que intenta desxifrar les complexes interaccions entre els components cel•lulars. Amb l'aparició de les tecnologies d'alt rendiment actualment és possible interrogar genomes sencers amb una resolució sense precedents. La dimensió i la naturalesa desestructurada d'aquestes dades ha posat de manifest la necessitat de desenvolupar noves eines i metodologies per a convertir aquestes dades en coneixement biològic. Per contribuir a aquest repte hem explotat l'abundància de dades genòmiques procedents d'instruments d'alt rendiment i disponibles públicament, i hem desenvolupat mètodes bioinformàtics focalitzats en l'extracció d'informació a nivell de via molecular en comptes de fer-ho al nivell individual de cada gen. En primer lloc, hem desenvolupat GSVA (Gene Set Variation Analysis), un mètode que facilita l'organització i la condensació de perfils d'expressió dels gens en conjunts. GSVA possibilita anàlisis posteriors en termes de vies moleculars amb dades d'expressió gènica provinents de microarrays i RNA-seq. Aquest mètode estima la variació de les vies moleculars a través d'una població de mostres i permet la integració de fonts heterogènies de dades biològiques amb mesures d'expressió a nivell de via molecular. Per il•lustrar les característiques de GSVA, l'hem aplicat a diversos casos usant diferents tipus de dades i adreçant qüestions biològiques. GSVA està disponible com a paquet de programari lliure per R dins el projecte Bioconductor.
En segon lloc, hem desenvolupat una estratègia centrada en vies moleculars basada en el
genoma per reposicionar fàrmacs per la diabetis tipus 2 (T2D). Aquesta estratègia consisteix
en dues fases: primer es construeix una xarxa reguladora que s'utilitza per identificar mòduls
de regulació gènica que condueixen a la malaltia; després, a partir d'aquests mòduls es busquen compostos que els podrien afectar. La nostra estratègia ve motivada per l'observació que els gens que provoquen una malaltia tendeixen a agrupar-se, formant mòduls patogènics, i pel fet que podria caldre una actuació simultània sobre múltiples gens per assolir un efecte en el fenotipus de la malaltia. Per trobar compostos potencials, hem usat dades genòmiques exposades a compostos dipositades en bases de dades públiques. Hem recollit unes 20.000 mostres que han estat exposades a uns 1.800 compostos. L'expressió gènica es pot interpretar com un fenotip intermedi que reflecteix les vies moleculars desregulades subjacents a una malaltia. Per tant, considerem que els gens d'un mòdul patològic que responen, a nivell transcripcional, d'una manera similar a l'exposició del medicament tenen potencialment un efecte terapèutic. Hem aplicat aquesta estratègia a dades d'expressió gènica en illots pancreàtics humans corresponents a individus sans i diabètics, i hem identificat quatre compostos potencials (methimazole, pantoprazole, extracte de taronja amarga i torcetrapib) que podrien tenir un efecte positiu sobre la secreció de la insulina. Aquest és el primer cop que una xarxa reguladora d'illots pancreàtics humans s'ha utilitzat per reposicionar compostos per a T2D.
En conclusió, aquesta tesi aporta dos enfocaments diferents en termes de vies moleculars
a problemes bioinformàtics importants, com ho son el contrast de la funció biològica i el
reposicionament de fàrmacs "in silico". Aquestes contribucions demostren el paper central
de les anàlisis basades en vies moleculars a l'hora d'interpretar dades genòmiques procedents
d'instruments d'alt rendiment.
|
19 |
Computational development of regulatory gene set networks for systems biology applicationsSuphavilai, Chayaporn January 2014 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / In systems biology study, biological networks were used to gain insights into biological systems. While the traditional approach to studying biological networks is based on the identification of interactions among genes or the identification of a gene set ranking according to differentially expressed gene lists, little is known about interactions between higher order biological systems, a network of gene sets. Several types of gene set network have been proposed including co-membership, linkage, and co-enrichment human gene set networks. However, to our knowledge, none of them contains directionality information. Therefore, in this study we proposed a method to construct a regulatory gene set network, a directed network, which reveals novel relationships among gene sets. A regulatory gene set network was constructed by using publicly available gene regulation data. A directed edge in regulatory gene set networks represents a regulatory relationship from one gene set to the other gene set. A regulatory gene set network was compared with another type of gene set network to show that the regulatory network provides additional information. In order to show that a regulatory gene set network is useful for understand the underlying mechanism of a disease, an Alzheimer's disease (AD) regulatory gene set network was constructed.
In addition, we developed Pathway and Annotated Gene-set Electronic Repository (PAGER), an online systems biology tool for constructing and visualizing gene and gene set networks from multiple gene set collections. PAGER is available at http://discern.uits.iu.edu:8340/PAGER/. Global regulatory and global co-membership gene set networks were pre-computed. PAGER contains 166,489 gene sets, 92,108,741 co-membership edges, 697,221,810 regulatory edges, 44,188 genes, 651,586 unique gene regulations, and 650,160 unique gene interactions. PAGER provided several unique features including constructing regulatory gene set networks, generating expanded gene set networks, and constructing gene networks within a gene set.
However, tissue specific or disease specific information was not considered in the disease specific network constructing process, so it might not have high accuracy of presenting the high level relationship among gene sets in the disease context. Therefore, our framework can be improved by collecting higher resolution data, such as tissue specific and disease specific gene regulations and gene sets. In addition, experimental gene expression data can be applied to add more information to the gene set network. For the current version of PAGER, the size of gene and gene set networks are limited to 100 nodes due to browser memory constraint. Our future plans is integrating internal gene or proteins interactions inside pathways in order to support future systems biology study.
|
20 |
Analyse intégrative de données de grande dimension appliquée à la recherche vaccinale / Integrative analysis of high-dimensional data applied to vaccine researchHejblum, Boris 06 March 2015 (has links)
Les données d’expression génique sont reconnues comme étant de grande dimension, etnécessitant l’emploi de méthodes statistiques adaptées. Mais dans le contexte des essaisvaccinaux, d’autres mesures, comme par exemple les mesures de cytométrie en flux, sontégalement de grande dimension. De plus, ces données sont souvent mesurées de manièrelongitudinale. Ce travail est bâti sur l’idée que l’utilisation d’un maximum d’informationdisponible, en modélisant les connaissances a priori ainsi qu’en intégrant l’ensembledes différentes données disponibles, améliore l’inférence et l’interprétabilité des résultatsd’analyses statistiques en grande dimension. Tout d’abord, nous présentons une méthoded’analyse par groupe de gènes pour des données d’expression génique longitudinales. Ensuite,nous décrivons deux analyses intégratives dans deux études vaccinales. La premièremet en évidence une sous-expression des voies biologiques d’inflammation chez les patientsayant un rebond viral moins élevé à la suite d’un vaccin thérapeutique contre le VIH. Ladeuxième étude identifie un groupe de gènes lié au métabolisme lipidique dont l’impactsur la réponse à un vaccin contre la grippe semble régulé par la testostérone, et donc liéau sexe. Enfin, nous introduisons un nouveau modèle de mélange de distributions skew t àprocessus de Dirichlet pour l’identification de populations cellulaires à partir de donnéesde cytométrie en flux disponible notamment dans les essais vaccinaux. En outre, nousproposons une stratégie d’approximation séquentielle de la partition a posteriori dans lecas de mesures répétées. Ainsi, la reconnaissance automatique des populations cellulairespourrait permettre à la fois une avancée pratique pour le quotidien des immunologistesainsi qu’une interprétation plus précise des résultats d’expression génique après la priseen compte de l’ensemble des populations cellulaires. / Gene expression data is recognized as high-dimensional data that needs specific statisticaltools for its analysis. But in the context of vaccine trials, other measures, such asflow-cytometry measurements are also high-dimensional. In addition, such measurementsare often repeated over time. This work is built on the idea that using the maximum ofavailable information, by modeling prior knowledge and integrating all data at hand, willimprove the inference and the interpretation of biological results from high-dimensionaldata. First, we present an original methodological development, Time-course Gene SetAnalysis (TcGSA), for the analysis of longitudinal gene expression data, taking into accountprior biological knowledge in the form of predefined gene sets. Second, we describetwo integrative analyses of two different vaccine studies. The first study reveals lowerexpression of inflammatory pathways consistently associated with lower viral rebound followinga HIV therapeutic vaccine. The second study highlights the role of a testosteronemediated group of genes linked to lipid metabolism in sex differences in immunologicalresponse to a flu vaccine. Finally, we introduce a new model-based clustering approach forthe automated treatment of cell populations from flow-cytometry data, namely a Dirichletprocess mixture of skew t-distributions, with a sequential posterior approximation strategyfor dealing with repeated measurements. Hence, the automatic recognition of thecell populations could allow a practical improvement of the daily work of immunologistsas well as a better interpretation of gene expression data after taking into account thefrequency of all cell populations.
|
Page generated in 0.0575 seconds