11 |
Comparisons of statistical modeling for constructing gene regulatory networksChen, Xiaohui 11 1900 (has links)
Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput
measurement devices are available, such as microarrays and
sequencing techniques, regulatory networks have been intensively studied
over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing
regression models and apply them to construct regulatory networks which
span trancription factors and microRNAs. We also propose an extended
algorithm to address the local optimum issue in finding the Maximum A
Posterjorj estimator. An E. coli mRNA expression microarray data set with
known bona fide interactions is used to evaluate our models and we show
that our regression networks with a properly chosen prior can perform comparably
to the state-of-the-art regulatory network construction algorithm.
Finally, we apply our models on a p53-related data set, NCI-60 data. By
further incorporating available prior structural information from sequencing
data, we identify several significantly enriched interactions with cell proliferation
function. In both of the two data sets, we select specific examples
to show that many regulatory interactions can be confirmed by previous
studies or functional enrichment analysis. Through comparing statistical
models, we conclude from the project that combining different models with
over-representation analysis and prior structural information can improve
the quality of prediction and facilitate biological interpretation.
Keywords: regulatory network, variable selection, penalized maximum
likelihood estimation, optimization, functional enrichment analysis.
|
12 |
Comparisons of statistical modeling for constructing gene regulatory networksChen, Xiaohui 11 1900 (has links)
Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput
measurement devices are available, such as microarrays and
sequencing techniques, regulatory networks have been intensively studied
over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing
regression models and apply them to construct regulatory networks which
span trancription factors and microRNAs. We also propose an extended
algorithm to address the local optimum issue in finding the Maximum A
Posterjorj estimator. An E. coli mRNA expression microarray data set with
known bona fide interactions is used to evaluate our models and we show
that our regression networks with a properly chosen prior can perform comparably
to the state-of-the-art regulatory network construction algorithm.
Finally, we apply our models on a p53-related data set, NCI-60 data. By
further incorporating available prior structural information from sequencing
data, we identify several significantly enriched interactions with cell proliferation
function. In both of the two data sets, we select specific examples
to show that many regulatory interactions can be confirmed by previous
studies or functional enrichment analysis. Through comparing statistical
models, we conclude from the project that combining different models with
over-representation analysis and prior structural information can improve
the quality of prediction and facilitate biological interpretation.
Keywords: regulatory network, variable selection, penalized maximum
likelihood estimation, optimization, functional enrichment analysis. / Science, Faculty of / Graduate
|
13 |
Identification of potential biomarkers in lung cancer as possible diagnostic agents using bioinformatics and molecular approachesAhmed, Firdous January 2015 (has links)
>Magister Scientiae - MSc / Lung cancer remains the leading cause of cancer deaths worldwide, with the majority of cases attributed to non-small cell lung carcinomas. At the time of diagnosis, a large percentage of patients present with advanced stage of disease, ultimately resulting in a poor prognosis. The identification circulatory markers, overexpressed by the tumour tissue, could facilitate the discovery of an early, specific, non-invasive diagnostic tool as well as improving prognosis and treatment protocols. The aim was to analyse gene expression data from both microarray and RNA sequencing platforms, using bioinformatics and statistical analysis tools. Enrichment analysis sought to identify genes, which were differentially expressed (p < 0.05, FC > 2) and had the potential to be secreted into the extracellular circulation, by using Gene Ontology terms of the Cellular Component. Results identified 1 657 statically significant genes between normal and early lung cancer tissue, with only 1 gene differentially expressed (DE) between the early and late stage disease. Following statistical analysis, 171 DE genes selected as potential early stage biomarkers. The overall sensitivity of RNAseq, in comparison to arrays enabled the identification of 57 potential serum markers. These genes of interest were all downregulated in the tumour tissue, and while they did not facilitate the discovery of an ideal diagnostic marker based on the set criteria in this study, their roles in disease initiation and progression require further analysis.
|
14 |
Gene-pair based statistical methods for testing gene set enrichment in microarray gene expression studiesZhao, Kaiqiong 16 September 2016 (has links)
Gene set enrichment analysis aims to discover sets of genes, such as biological pathways or protein complexes, which may show moderate but coordinated differentiation across experimental conditions. The existing gene set enrichment approaches utilize single gene statistic as a measure of differentiation for individual genes.
These approaches do not utilize any inter-gene correlations, but it has been known that genes in a pathway often interact with each other.
Motivated by the need for taking gene dependence into account, we propose a novel gene set enrichment algorithm, where the gene-gene correlation is addressed via a gene-pair representation strategy. Relying on an appropriately defined gene pair statistic, the gene set statistic is formulated using a competitive null hypothesis.
Extensive simulation studies show that our proposed approach can correctly control the type I error (false positive rate), and retain good statistical power for detecting true differential expression. The new method is also applied to analyze several gene expression datasets. / October 2016
|
15 |
COMPUTATIONAL TOOLS FOR THE DYNAMIC CATEGORIZATION AND AUGMENTED UTILIZATION OF THE GENE ONTOLOGYHinderer, Eugene Waverly, III 01 January 2019 (has links)
Ontologies provide an organization of language, in the form of a network or graph, which is amenable to computational analysis while remaining human-readable. Although they are used in a variety of disciplines, ontologies in the biomedical field, such as Gene Ontology, are of interest for their role in organizing terminology used to describe—among other concepts—the functions, locations, and processes of genes and gene-products. Due to the consistency and level of automation that ontologies provide for such annotations, methods for finding enriched biological terminology from a set of differentially identified genes in a tissue or cell sample have been developed to aid in the elucidation of disease pathology and unknown biochemical pathways. However, despite their immense utility, biomedical ontologies have significant limitations and caveats. One major issue is that gene annotation enrichment analyses often result in many redundant, individually enriched ontological terms that are highly specific and weakly justified by statistical significance. These large sets of weakly enriched terms are difficult to interpret without manually sorting into appropriate functional or descriptive categories. Also, relationships that organize the terminology within these ontologies do not contain descriptions of semantic scoping or scaling among terms. Therefore, there exists some ambiguity, which complicates the automation of categorizing terms to improve interpretability.
We emphasize that existing methods enable the danger of producing incorrect mappings to categories as a result of these ambiguities, unless simplified and incomplete versions of these ontologies are used which omit problematic relations. Such ambiguities could have a significant impact on term categorization, as we have calculated upper boundary estimates of potential false categorizations as high as 121,579 for the misinterpretation of a single scoping relation, has_part, which accounts for approximately 18% of the total possible mappings between terms in the Gene Ontology. However, the omission of problematic relationships results in a significant loss of retrievable information. In the Gene Ontology, this accounts for a 6% reduction for the omission of a single relation. However, this percentage should increase drastically when considering all relations in an ontology. To address these issues, we have developed methods which categorize individual ontology terms into broad, biologically-related concepts to improve the interpretability and statistical significance of gene-annotation enrichment studies, meanwhile addressing the lack of semantic scoping and scaling descriptions among ontological relationships so that annotation enrichment analyses can be performed across a more complete representation of the ontological graph.
We show that, when compared to similar term categorization methods, our method produces categorizations that match hand-curated ones with similar or better accuracy, while not requiring the user to compile lists of individual ontology term IDs. Furthermore, our handling of problematic relations produces a more complete representation of ontological information from a scoping perspective, and we demonstrate instances where medically-relevant terms--and by extension putative gene targets--are identified in our annotation enrichment results that would be otherwise missed when using traditional methods. Additionally, we observed a marginal, yet consistent improvement of statistical power in enrichment results when our methods were used, compared to traditional enrichment analyses that utilize ontological ancestors. Finally, using scalable and reproducible data workflow pipelines, we have applied our methods to several genomic, transcriptomic, and proteomic collaborative projects.
|
16 |
Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment AnalysisLi, Wei 01 December 2009 (has links)
The DNA microarray biotechnology simultaneously monitors the expression of thousands of genes and aims to identify genes that are differently expressed under different conditions. From the statistical point of view, it can be restated as identify genes strongly associated with the response or covariant of interest. The Gene Set Enrichment Analysis (GSEA) method is one method which focuses the analysis at the functional related gene sets level instead of single genes. It helps biologists to interpret the DNA microarray data by their previous biological knowledge of the genes in a gene set. GSEA has been shown to efficiently identify gene sets containing known disease-related genes in the real experiments. Here we want to evaluate the statistical power of this method by simulation studies. The results show that the the power of GSEA is good enough to identify the gene sets highly associated with the response or covariant of interest.
|
17 |
Developing bioinformatics tools for metabolomicsXia, Jianguo Unknown Date
No description available.
|
18 |
Výpočetní metody pro anotační analýzy genetických variací / Computational Methods for Annotation Analysis of Genetic VariationsFülöp, Tibor January 2015 (has links)
Analýza a interpretace variací DNA je důležité pro zkoumání genetického pozadí dědičnosti, nemocí a jiných fenotypových rysů. Tato práce stručně úvadí oblasti molekulární biologie a základních principů genetiky, popisuje metody pro anotační analýzy genetických variací, genomové asociační studie a metody pro analýzy obohacení s jejich implementací. V rámci této práce jsme představili nový webový nástroj Varanto, který může být použit k anotaci, vizualizaci a analýze genetických variací. Může být použit k analýze obohacení anotací pomocí hypergeometrického testu pro danou množinu variací. Varanto obsahuje uživatelské webové rozhraní vyvinuté pomocí frameworku Shiny jazyka R. Výkon a funkcionalita nástroje jsou testovány a demonstrovány podle výkonových benchmarků a na základě analýzy a interpretace dat z dříve publikovaných genomových asociačních studií.
|
19 |
Mining high-level brain imaging genetic associationsYao, Xiaohui 16 January 2018 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Imaging genetics is an emerging research field in neurodegenerative diseases. It studies the influence of genetic variants on brain structure and function. Genome-wide association studies (GWAS) of brain imaging has identified a few independent risk loci for individual imaging quantitative trait (iQT), which however display only modest effect size and explain limited heritability. This thesis focuses on mining high-level imaging genetic associations and their applications on neurodegenerative diseases. This thesis first presents a novel network-based GWAS framework for identifying functional modules, by employing a two-step strategy in a top-down manner. It first integrates tissue-specific network with GWAS of corresponding phenotype in regression models in addition to classification, to re-prioritize genome-wide associations. Then it detects densely connected and disease-relevant modules based on interactions among top reprioritizations. The discovered modules hold both phenotypical specificity and densely interaction. We applied it to an amygdala imaging genetics analysis in the study of Alzheimer's disease (AD). The proposed framework effectively detects densely interacted modules; and the reprioritizations achieve highest concordance with AD genes. We then present an extension of the above framework, named GWAS top-neighbor-based (tnGWAS); and compare it with previous approaches. This tnGWAS extracts densely connected modules from top GWAS findings, based on the hypothesis that relevant modules consist of top GWAS findings and their close neighbors. It is applied to a hippocampus imaging genetics analysis in AD research, and yields the densest interactions among top candidate genes. Experimental results demonstrate that precise context does help explore collective effects of genes with functional interactions specific to the studied phenotype. In the second part, a novel imaging genetic enrichment analysis (IGEA) paradigm is proposed for discovering complex associations among genetic modules and brain circuits. In addition to genetic modules, brain regions of interest also grouped to play role. We expand the scope of one-dimensional enrichment analysis into imaging genetics. This framework jointly considers meaningful gene sets (GS) and brain circuits (BC), and examines whether given GS-BC module is enriched in gene-iQT findings. We conduct the proof-of-concept study and demonstrate its performance by applying to a brain-wide imaging genetics study of AD.
|
20 |
Multi-omics Data Integration for Identifying Disease Specific Biological PathwaysLu, Yingzhou 05 June 2018 (has links)
Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways.
Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes.
This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis.
This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term. / Master of Science / Identification of biological pathways play a central role in understanding both human health and diseases. A biological pathway is a series of information processing steps via interactions among molecules in a cell that partially determines the phenotype of a cell. Specifically, identifying disease-specific pathway will guide focused studies on complex diseases, thus potentially improve the prevention and treatment of diseases.
To identify disease-specific pathways, it is crucial to develop computational methods and statistical tests that can integrate multi-omics (multiple omes such as genome, proteome, etc) data. Compared to single omics data, multi-omics data will help gaining a more comprehensive understanding on the molecular architecture of disease processes.
In this thesis, we propose a novel data analytics pipeline for multi-omics data integration. We test and apply our method on/to the real proteomics data sets on muscular dystrophy subtypes, and identify several biologically plausible pathways related to muscular dystrophies.
|
Page generated in 0.1007 seconds