Global ETD Search

1	Differential Dependency Network and Data Integration for Detecting Network Rewiring and Biomarkers Fu, Yi 30 January 2020 (has links) Rapid advances in high-throughput molecular profiling techniques enabled large-scale genomics, transcriptomics, and proteomics-based biomedical studies, generating an enormous amount of multi-omics data. Processing and summarizing multi-omics data, modeling interactions among biomolecules, and detecting condition-specific dysregulation using multi-omics data are some of the most important yet challenging analytics tasks. In the case of detecting somatic DNA copy number aberrations using bulk tumor samples in cancer research, normal cell contamination becomes one significant confounding factor that weakens the power regardless of whichever methods used for detection. To address this problem, we propose a computational approach – BACOM 2.0 to more accurately estimate normal cell fraction and accordingly reconstruct DNA copy number signals in cancer cells. Specifically, by introducing allele-specific absolute normalization, BACOM 2.0 can accurately detect deletion types and aneuploidy in cancer cells directly from DNA copy number data. Genes work through complex networks to support cellular processes. Dysregulated genes can cause structural changes in biological networks, also known as network rewiring. Genes with a large number of rewired edges are more likely to be associated with functional alteration leading phenotype transitions, and hence are potential biomarkers in diseases such as cancers. Differential dependency network (DDN) method was proposed to detect such network rewiring and biomarkers. However, the existing DDN method and software tool has two major drawbacks. Firstly, in imbalanced sample groups, DDN suffers from systematic bias and produces false positive differential dependencies. Secondly, the computational time of the block coordinate descent algorithm in DDN increases rapidly with the number of involved samples and molecular entities. To address the imbalanced sample group problem, we propose a sample-scale-wide normalized formulation to correct systematic bias and design a simulation study for testing the performance. To address high computational complexity, we propose several strategies to accelerate DDN learning, including two reformulated algorithms for block-wise coefficient updating in the DDN optimization problem. Specifically, one strategy on discarding predictors and one strategy on accelerating parallel computing. More importantly, experimental results show that new DDN learning speed with combined accelerating strategies is hundreds of times faster than that of the original method on medium-sized data. We applied the DDN method on several biomedical datasets of omics data and detected significant phenotype-specific network rewiring. With a random-graph-based detection strategy, we discovered the hub node defined biomarkers that helped to generate or validate several novel scientific hypotheses in collaborative research projects. For example, the hub genes detected by the DDN methods in proteomics data from artery samples are significantly enriched in the citric acid cycle pathway that plays a critical role in the development of atherosclerosis. To detect intra-omics and inter-omics network rewirings, we propose a method called multiDDN that uses a multi-layer signaling model to integrate multi-omics data. We adapt the block coordinate descent algorithm to solve the multiDDN optimization problem with accelerating strategies. The simulation study shows that, compared with the DDN method on single omics, the multiDDN method has considerable advantage on higher accuracy of detecting network rewiring. We applied the multiDDN method on the real multi-omics data from CPTAC ovarian cancer dataset, and detected multiple hub genes associated with histone protein deacetylation and were previously reported in independent ovarian cancer data analysis. / Doctor of Philosophy / We witnessed the start of the human genome project decades ago and stepped into the era of omics since then. Omics are comprehensive approaches for analyzing genome-wide biomolecular profiles. The rapid development of high-throughput technologies enables us to produce an enormous amount of omics data such as genomics, transcriptomics, and proteomics data, which makes researchers swim in a sea of omics information that once never imagined. Yet, the era of omics brings new challenges to us: to process the huge volumes of data, to summarize the data, to reveal the interactions between entities, to link various types of omics data, and to discover mechanisms hidden behind omics data. In processing omics data, one factor that weakens the strengths of follow up data analysis is sample impurity. We call impure tumor samples contaminated by normal cells as heterogeneous samples. The genomic signals measured from heterogeneous samples are a mixture of signals from both tumor cells and normal cells. To correct the mixed signals and get true signals from pure tumor cells, we propose a computational approach called BACOM 2.0 to estimate normal cell fraction and corrected genomics signals accordingly. By introducing a novel normalization method that identifies the neutral component in mixed signals of genomic copy number data, BACOM 2.0 could accurately detect genes' deletion types and abnormal chromosome numbers in tumor cells. In cells, genes connect to other genes and form complex biological networks to perform their functions. Dysregulated genes can cause structural change in biological networks, also known as network rewiring. In a biological network with network rewiring events, a large quantity of network rewiring linking to a single hub gene suggests concentrated gene dysregulation. This hub gene has more impact on the network and hence is more likely to associate with the functional change of the network, which ultimately leads to abnormal phenotypes such as cancer diseases. Therefore, the hub genes linked with network rewiring are potential indicators of disease status or known as biomarkers. Differential dependency network (DDN) method was proposed to detect network rewiring events and biomarkers from omics data. However, the DDN method still has a few drawbacks. Firstly, for two groups of data with unequal sample sizes, DDN consistently detects false targets of network rewiring. The permutation test, which uses the same method on randomly shuffled samples is supposed to distinguish the true targets from random effects, however, is also suffered from the same reason and could let pass those false targets. We propose a new formulation that corrects the mistakes brought by unequal group size and design a simulation study to test the new formulation's correctness. Secondly, the time used for computing in solving DDN problems is unbearably long when processing omics data with a large number of samples scale or a large number of genes. We propose several strategies to increase DDN's computation speed, including three redesigned formulas for efficiently updating the results, one rule to preselect predictor variables, and one accelerating skill of utilizing multiple CPU cores simultaneously. In the timing test, the DDN method with increased computing speed is much faster than the original method. To detect network rewirings within the same omics data or between different types of omics, we propose a method called multiDDN that uses an integrated model to process multiple types of omics data. We solve the new problem by adapting the block coordinate descending algorithm. The test on simulated data shows multiDDN is better than single omics DDN. We applied DDN or multiDDN method on several datasets of omics data and detected significant network rewiring associated with diseases. We detected hub nodes from the network rewiring events. These hub genes as potential biomarkers help us to ask new meaningful questions in related researches. molecular data integration differential network analysis biomarker
2	Differential Network Analysis based on Omic Data for Cancer Biomarker Discovery Zuo, Yiming 16 June 2017 (has links) Recent advances in high-throughput technique enables the generation of a large amount of omic data such as genomics, transcriptomics, proteomics, metabolomics, glycomics etc. Typically, differential expression analysis (e.g., student's t-test, ANOVA) is performed to identify biomolecules (e.g., genes, proteins, metabolites, glycans) with significant changes on individual level between biologically disparate groups (disease cases vs. healthy controls) for cancer biomarker discovery. However, differential expression analysis on independent studies for the same clinical types of patients often led to different sets of significant biomolecules and had only few in common. This may be attributed to the fact that biomolecules are members of strongly intertwined biological pathways and highly interactive with each other. Without considering these interactions, differential expression analysis could lead to biased results. Network-based methods provide a natural framework to study the interactions between biomolecules. Commonly used data-driven network models include relevance network, Bayesian network and Gaussian graphical models. In addition to data-driven network models, there are many publicly available databases such as STRING, KEGG, Reactome, and ConsensusPathDB, where one can extract various types of interactions to build knowledge-driven networks. While both data- and knowledge-driven networks have their pros and cons, an appropriate approach to incorporate the prior biological knowledge from publicly available databases into data-driven network model is desirable for more robust and biologically relevant network reconstruction. Recently, there has been a growing interest in differential network analysis, where the connection in the network represents a statistically significant change in the pairwise interaction between two biomolecules in different groups. From the rewiring interactions shown in differential networks, biomolecules that have strongly altered connectivity between distinct biological groups can be identified. These biomolecules might play an important role in the disease under study. In fact, differential expression and differential network analyses investigate omic data from two complementary perspectives: the former focuses on the change in individual biomolecule level between different groups while the latter concentrates on the change in pairwise biomolecules level. Therefore, an approach that can integrate differential expression and differential network analyses is likely to discover more reliable and powerful biomarkers. To achieve these goals, we start by proposing a novel data-driven network model (i.e., LOPC) to reconstruct sparse biological networks. The sparse networks only contains direct interactions between biomolecules which can help researchers to focus on the more informative connections. Then we propose a novel method (i.e., dwgLASSO) to incorporate prior biological knowledge into data-driven network model to build biologically relevant networks. Differential network analysis is applied based on the networks constructed for biologically disparate groups to identify cancer biomarker candidates. Finally, we propose a novel network-based approach (i.e., INDEED) to integrate differential expression and differential network analyses to identify more reliable and powerful cancer biomarker candidates. INDEED is further expanded as INDEED-M to utilize omic data at different levels of human biological system (e.g., transcriptomics, proteomics, metabolomics), which we believe is promising to increase our understanding of cancer. Matlab and R packages for the proposed methods are developed and available at Github (https://github.com/Hurricaner1989) to share with the research community. / Ph. D. / High-throughput technique such as transcriptomics, proteomics and metabolomics is widely used to generate ‘big’ data for cancer biomarker discovery. Typically, differential expression analysis is performed to identify cancer biomarkers. However, discrepancies from independent studies for the same clinical types of samples using differential expression analysis are observed. This may be attributed to that biomolecules such as genes, proteins and metabolites are members of strongly intertwined biological pathways and highly interactive with each other. Without considering these interactions, differential expression analysis could lead to biased results. In this dissertation, we propose to identify cancer biomarker candidates using network-based approaches. We start by proposing a novel data-driven network model (i.e., LOPC) to reconstruct sparse biological networks. Then we propose a novel method (i.e., wgLASSO) to incorporate prior biological knowledge from public available databases into purely data-driven network model to build biologically relevant networks. In addition, a novel differential network analysis method (i.e., dwgLASSO) is proposed to identify cancer biomarkers. Finally, we propose a novel network-based approach (i.e., INDEED) to integrate differential expression and differential network analyses. INDEED is further expanded as INDEED-M to utilize omic data at different levels of human biological system (e.g., transcriptomics, proteomics, and metabolomics) to identify cancer biomarkers from a systems biology perspective. Matlab and R packages for the proposed methods are developed and shared with the research community. differential expression analysis differential network analysis cancer biomarker discovery
3	Evaluation of network inference algorithms and their effects on network analysis for the study of small metabolomic data sets Greenyer, Haley 24 May 2022 (has links) Motivation: Alzheimer’s Disease (AD) is a highly prevalent, neurodegenerative disease which causes gradual cognitive decline. As documented in the literature, evi- dence has recently mounted for the role of metabolic dysfunction in AD. Metabolomic data has therefore been increasingly used in AD studies. Metabolomic disease studies often suffer from small sample sizes and inflated false discovery rates. It is therefore of great importance to identify algorithms best suited for the inference of metabolic networks from small cohort disease studies. For future benchmarking, and for the development of new metabolic network inference methods, it is similarly important to identify appropriate performance measures for small sample sizes. Results: The performances of 13 different network inference algorithms, includ- ing correlation-based, regression-based, information theoretic, and hybrid methods, were assessed through benchmarking and structural network analyses. Benchmark- ing was performed on simulated data with known structures across six sample sizes using three different summative performance measures: area under the Receiver Op- erating Characteristic Curve, area under the Precision Recall Curve, and Matthews Correlation Coefficient. Structural analyses (commonly applied in disease studies), including betweenness, closeness, and eigenvector centrality were applied to simu- lated data. Differential network analysis was additionally applied to experimental AD data. Based on the performance measure benchmarking and network analysis results, I identified Probabilistic Context Likelihood Relatedness of Correlation with Biweight Midcorrelation (PCLRCb) (a novel variation of the PCLRC algorithm) to be best suited for the prediction of metabolic networks from small-cohort disease studies. Additionally, I identified Matthews Correlation Coefficient as the best mea- sure with which to evaluate the performance of metabolic network inference methods across small sample sizes. / Graduate Alzheimer's Metabolomics Differential Network Analysis sample size network inference mouse model
4	Machine learning enabled bioinformatics tools for analysis of biologically diverse samples Lu, Yingzhou 25 August 2023 (has links) Advanced molecular profiling technologies, utilizing the entire human genome, have opened new avenues to study biological systems. In recent decades, the generation of vast volumes of multi-omics data, spanning a broad range of phenotypes. Development of advanced bioinformatics tools to identify informative biomarkers from these data becomes increasingly important. These tools are crucial to extract meaningful biomarkers from this data, especially for understanding the biological pathways responsible for disease development. The identification of signature genes and the analysis of differentially networked genes are two fundamental and critically important tasks. However, many current methodologies employ test statistics that don't align perfectly with the signature definition, potentially leading to the identification of imprecise signatures. It may be challenging because the test statistics employed by many prevailing methods fall short of fulfilling the exact definition of a marker genes, inherently leaving them susceptible to deriving inaccurate features. The problem is further compounded when attempting to identify marker genes across biologically diverse samples, especially when comparing more than two biological conditions. Additionally, traditional differential group analysis or co-expression analysis under singular conditions often falls short in certain scenarios. For instance, the subtle expression levels of transcription factors (TFs) make their detection daunting, despite their pivotal role in guiding gene expression. Pinpointing the intricate network landscape of complex ailments and isolating core genes for subsequent analysis are challenging tasks. Yet, these marker genes are instrumental in identifing potential pivotal pathways. Multi-omics data, with its inherent complexity and diversity, presents unique challenges that traditional methods might struggle to address effectively. Recognizing this, our team sought to introduce new and innovative techniques specifically designed to handle this intricate dataset. To overcome these challenges, it is vital to develop and adopt innovative methods tailored to handle the complexity and diversity inherent in multi-omics data. In response to these challenges, we have pioneered the Cosine-based One-sample Test (COT), a method meticulously crafted for the analysis of biologically diverse samples. Tailored to discern marker genes across a spectrum of subtypes using their expression profiles, COT employs a one-sample test framework. The test statistic within COT utilizes cosine similarity, comparing a molecule's expression profile across various subtypes with the precise mathematical representation of ideal marker genes. To ensure ease of application and accessibility, we've encapsulated the COT workflow within a Python package. To assess its effectiveness, we undertook an exhaustive evaluation, juxtaposing the marker genes detection capabilities of COT against its contemporaries. This evaluation employed realistic simulation data. Our findings indicated that COT was not only adept at handling gene expression data but was also proficient with proteomics data. This data, sourced from enriched tissue or cell subtype samples, further accentuated COT's superior performance. We demonstrated the heightened effectiveness of COT when applied to gene expression and proteomics data originating from distinct tissue or cell subtypes. This led to innovative findings and hypotheses in several biomedical case studies. Additionally, we have enhanced the Differential Dependency Network (DDN) framework to detect network rewiring between different conditions where significantly rewired network modes serve as informative biomarkers. Using cross-condition data and a block-wise Lasso network model, DDN detects significant network rewiring together with a subnetwork of hub molecular entities. In DDN 3.0, we took the imbalanced sample size into the consideration, integrated several acceleration strategies to enable it to handle large datasets, and enhanced the network presentation for more informative network displays including color-coded differential dependency network and gradient heatmap. We applied it to the simulated data and real data to detect critical changes in molecular network topology. The current tool stands as a valuable blueprint for the development and validation of mechanistic disease models. This foundation aids in offering a coherent interpretation of data, deepening our understanding of disease biology, and sparking new hypotheses ripe for subsequent validation and exploration. As we chart our future course, our vision is to expand the scope of tools like COT and DDN 3.0, explore the vast realm of multi-omics data, including those from longitudinal studies or clinical trials. We're looking at incorporating datasets from longitudinal studies and clinical trials – domains where data complexity scales to new heights. We believe that these tools can facilitate more nuanced and comprehensive understanding of disease development and progression. Furthermore, by integrating these methods with other advanced bioinformatics and machine learning tools, we aim to create a holistic pipeline that will allow for seamless extraction of significant biomarkers and actionable insights from multi-omics data. This is a promising step towards precision medicine, where individual genomic information can guide personalized treatment strategies. / Doctor of Philosophy / Recent advances in technology have allowed us to study human biology on a much larger scale than ever before. These technologies have produced a lot of data on many different types of traits. As a result, it's becoming increasingly important to develop tools that can sift through this data and find meaningful biomarkers – essentially, indicators that can help us understand what causes diseases. Two key parts of this process are identifying 'signature genes' and analyzing groups of genes that work together differently depending on the circumstances. But, current methods have their drawbacks – they don't always pick out the right genes and can struggle when comparing more than two groups at once. There are also other challenges when it comes to identifying groups of genes that express differently or work together under one set of conditions. For instance, some important genes – known as transcription factors (TFs) – control the activity of other genes. But because TFs are often expressed at low levels, they're hard to detect, even though they play a key role in controlling gene activity. And, it can be tough to identify 'hub' genes, which are central to gene networks and can help us understand the potential key pathways in diseases. To address these challenges, we introduced the Cosine based One-sample Test (COT), a novel approach to identify pivotal genes across diverse samples. COT gauges the alignment of a gene's expression profile with the quintessential marker genes' definition. Our evaluations underscore COT's robust performance, paving the way for deeper disease understanding. Further enhancing our toolkit, we've refined the Differential Dependency Network (DDN), a method to unravel the dynamic interplay of genes under diverse conditions. DDN 3.0 is a more robust iteration, adept at accommodating varied sample sizes, efficiently processing vast datasets, and offering richer visualizations of gene networks. Its prowess in pinpointing crucial alterations in gene networks is noteworthy. The Cosine based One-sample Test (COT) and the Differential Dependency Network (DDN) are revolutionary tools, poised to significantly elevate genomics research. COT, with its precision in gauging the alignment of a gene's expression pattern with predefined ideal gene markers, emerges as an invaluable asset in the hunt for marker genes. It acts as a fine-tuned sieve, meticulously screening vast datasets to unveil these crucial genetic signposts. On the other hand, DDN offers a comprehensive framework to decipher the intricate web of gene interactions under diverse conditions. It meticulously analyzes the interplay between genes, spotlighting potential 'hub' genes and highlighting shifts in their dynamic relationships. Together, COT and DDN not only pave the way for the identification of pivotal marker genes but also furnish a richer, more nuanced understanding of the genomic landscape. By leveraging these tools, researchers are empowered to unravel the intricate tapestry of genes, laying the foundation for groundbreaking discoveries in genomics. Looking to the future, we plan to apply COT and DDN 3.0 to more complex datasets. We believe these tools will give us a better understanding of how diseases develop and progress. By integrating these methods with other advanced tools, we're aiming to create a complete system for extracting important biomarkers and insights from this complex data. This is a big step towards precision medicine, where a person's unique genetic information could guide their treatment strategy. machine learning biomarkers pathway analysis differential network analysis multi- omics integration
5	BioNetStat: uma ferramenta para análise diferencial de redes biológicas / BioNetStat: a tool for biological networks differential analysis Carvalho, Vinícius Jardim 08 February 2018 (has links) A diversidade de interações que ocorre dentro de sistemas biológicos, considerando desde as organelas de uma célula até toda a biosfera, pode ser modelada por meio da teoria de redes. A dinâmica das interações entre os elementos é uma propriedade intrínseca desses sistemas. Diversas ferramentas foram propostas para comparar redes, que representam os muitos estados assumidos por um sistema. Porém, nenhuma delas é capaz de comparar características estruturais de mais de duas redes simultaneamente. Devido à grande quantidade de estados que um sistema pode assumir, construímos uma ferramenta estatística para comparar duas ou mais redes e indicar variáveis chave no processo estudado. A principal proposta deste trabalho foi comparar redes de correlação usando medidas baseadas nos espectros dos grafos (conjunto de autovalores das matrizes de adjacência), como a distribuição espectral. Essa medida está associada a diversas características estruturais das redes como o número de caminhos, diâmetro e cliques. Além da distribuição espectral, também comparamos as redes por entropia espectral, distribuição dos graus e pelas centralidades dos nós. Usamos dois diferentes conjuntos de dados biológicos (expressão gênica de células tumorais e metabolismo vegetal) para realizar os testes de desempenho da ferramenta e para os estudos de caso. O método proposto está implementado em um pacote do programa R, chamado BioNetStat, com interface gráfica para o usuário leigo em programação. Constatamos que os testes são eficientes em diferenciar mais de duas redes. Além disso, o aumento do número de redes comparadas e a queda dos números de unidades amostrais, diminui o poder estatístico do teste. Mostramos ainda que ocorre uma economia de tempo significativa ao realizarmos uma única análise para comparar muitas redes ao invés de compará-las par-a-par. Além disto, o método apontou grupos de variáveis com papel central nos sistemas biológicos estudados que não foram encontrados nas análises onde apenas a expressão ou concentração dos elementos foi estudada. Foi possível assim diferenciar células de tipos cancerígenos ou órgãos de organismos vegetais através das centralidades das redes. As variáveis levantadas possibilitam ao usuário gerar hipóteses sobre seus papeis nos processos em estudo. O BioNetStat pode assim ajudar a detectar possíveis novas descobertas associadas a mecanismos de funcionamento de sistemas. / The diversity of interactions, which are among elements of the biological systems, can be studied based on the networks theory. Moreover, the dynamic of these interactions is an inherent trait of those systems. In this sense, several tools have been proposed to compare networks, in that each network represents a state assumed by the system. However, the biological systems generally can assume much more than two biological states and none of the tools are able to compare structural characteristics among more than two networks simultaneously. To solve this issue, we developed a statistical tool to compare two or more networks and highlight key variables of a system. Here we describe the new method, called BioNetStat, that is able to compare correlation networks using traits that are based on graph spectra (the group of eigenvalues of the adjacency matrix), such as the spectral distribution. This measure is associated with several structural characteristics of networks such as the number of walks, diameter, and cliques. In addition to the spectral distribution, BioNetStat can also compare networks to the node centralities. We used two different biological datasets, tumoral cells genes expressions and plant metabolism, to evaluate the performance of BioNetStat and as case studies. The tool is implemented in an R package, and it also has a user-friendly interface. We showed that BioNetStat is efficient in distinguishing more than two networks. In comparison with a similar tool (GSCA), the increase in the number of compared networks reduces less the statistical power of the BioNetStat than the GSCA. Furthermore, BioNetStat is able to find signaling pathways in a bigger proportion than the GSCA, complementing tools proposed in the literature. In the case studies, the method pointed out variables, and sets of variables, with a central role in biological systems, which were not highlighted when only gene expression pattern or metabolomics were studied. For instance, BioNetStat allowed us to differentiate among cancer types and plant organs. The BioNetStat results bring new findings on what differentiate the states, giving us a systemic view of our study subject and affording the proposition of new hypotheses about the studied processes. Análise de redes Análise diferencial de redes Biologia de sistemas Coexpression network Correlation network Differential network analysis Network analysis Networks theory Redes de co-expressão Redes de correlação Systems biology Teoria de redes
6	Condition-specific differential subnetwork analysis for biological systems Jhamb, Deepali 04 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Biological systems behave differently under different conditions. Advances in sequencing technology over the last decade have led to the generation of enormous amounts of condition-specific data. However, these measurements often fail to identify low abundance genes/proteins that can be biologically crucial. In this work, a novel text-mining system was first developed to extract condition-specific proteins from the biomedical literature. The literature-derived data was then combined with proteomics data to construct condition-specific protein interaction networks. Further, an innovative condition-specific differential analysis approach was designed to identify key differences, in the form of subnetworks, between any two given biological systems. The framework developed here was implemented to understand the differences between limb regeneration-competent Ambystoma mexicanum and –deficient Xenopus laevis. This study provides an exhaustive systems level analysis to compare regeneration competent and deficient subnetworks to show how different molecular entities inter-connect with each other and are rewired during the formation of an accumulation blastema in regenerating axolotl limbs. This study also demonstrates the importance of literature-derived knowledge, specific to limb regeneration, to augment the systems biology analysis. Our findings show that although the proteins might be common between the two given biological conditions, they can have a high dissimilarity based on their biological and topological properties in the subnetwork. The knowledge gained from the distinguishing features of limb regeneration in amphibians can be used in future to chemically induce regeneration in mammalian systems. The approach developed in this dissertation is scalable and adaptable to understand differential subnetworks between any two biological systems. This methodology will not only facilitate the understanding of biological processes and molecular functions which govern a given system but also provide novel intuitions about the pathophysiology of diseases/conditions. Limb regeneration Text mining Differential network analysis Subnetwork analysis Concept based mining Extremities (Anatomy) -- Regeneration Extremities (Anatomy) -- Physiology Text processing (Computer science) Data mining

1

Page generated in 0.1191 seconds