• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 24
  • 6
  • 5
  • 1
  • 1
  • Tagged with
  • 46
  • 46
  • 22
  • 16
  • 14
  • 9
  • 8
  • 7
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Paralelização de inferência em redes credais utilizando computação distribuída para fatoração de matrizes esparsas / Parallelization of credal network inference using distributed computing for sparse matrix factorization.

Ramon Fortes Pereira 25 April 2017 (has links)
Este estudo tem como objetivo melhorar o desempenho computacional dos algoritmos de inferência em redes credais, aplicando técnicas de computação paralela e sistemas distribuídos em algoritmos de fatoração de matrizes esparsas. Grosso modo, técnicas de computação paralela são técnicas para transformar um sistema em um sistema com algoritmos que possam ser executados concorrentemente. E a fatoração de matrizes são técnicas da matemática para decompor uma matriz em um produto de duas ou mais matrizes. As matrizes esparsas são matrizes que possuem a maioria de seus valores iguais a zero. E as redes credais são semelhantes as redes bayesianas, que são grafos acíclicos que representam uma probabilidade conjunta através de probabilidades condicionais e suas relações de independência. As redes credais podem ser consideradas como uma extensão das redes bayesianas para lidar com incertezas ou a má qualidade dos dados. Para aplicar a técnica de paralelização de fatoração de matrizes esparsas na inferência de redes credais, a inferência utiliza-se da técnica de eliminação de variáveis onde o grafo acíclico da rede credal é associado a uma matriz esparsa e cada variável eliminada é análoga a eliminação de uma coluna. / This study\'s objective is the computational performance improvement of credal network inference algorithms by applying computational parallel and distributed system techniques of sparse matrix factorization algorithms. Roughly, computational parallel techniques are used to transform systems in systems with algorithms that can be executed concurrently. And the matrix factorization is a group of mathematical techniques to decompose a matrix in a product of two or more matrixes. The sparse matrixes are matrixes which have most of their values equal to zero. And credal networks are similar to Bayesian networks, which are acyclic graphs representing a joint probability through conditional probabilities and their independence relations. Credal networks can be considered as a Bayesian network extension because of their manner of leading to uncertainty and the poor data quality. To apply parallel techniques of sparse matrix factorization in credal network inference the variable elimination method was used, where the credal network acyclic graph is associated to a sparse matrix and every eliminated variable is analogous to an eliminated column.
32

Stochastic process analysis for Genomics and Dynamic Bayesian Networks inference.

Lebre, Sophie 14 September 2007 (has links) (PDF)
This thesis is dedicated to the development of statistical and computational methods for the analysis of DNA sequences and gene expression time series.<br /><br />First we study a parsimonious Markov model called Mixture Transition Distribution (MTD) model which is a mixture of Markovian transitions. The overly high number of constraints on the parameters of this model hampers the formulation of an analytical expression of the Maximum Likelihood Estimate (MLE). We propose to approach the MLE thanks to an EM algorithm. After comparing the performance of this algorithm to results from the litterature, we use it to evaluate the relevance of MTD modeling for bacteria DNA coding sequences in comparison with standard Markovian modeling.<br /><br />Then we propose two different approaches for genetic regulation network recovering. We model those genetic networks with Dynamic Bayesian Networks (DBNs) whose edges describe the dependency relationships between time-delayed genes expression. The aim is to estimate the topology of this graph despite the overly low number of repeated measurements compared with the number of observed genes. <br /><br />To face this problem of dimension, we first assume that the dependency relationships are homogeneous, that is the graph topology is constant across time. Then we propose to approximate this graph by considering partial order dependencies. The concept of partial order dependence graphs, already introduced for static and non directed graphs, is adapted and characterized for DBNs using the theory of graphical models. From these results, we develop a deterministic procedure for DBNs inference. <br /><br />Finally, we relax the homogeneity assumption by considering the succession of several homogeneous phases. We consider a multiple changepoint<br />regression model. Each changepoint indicates a change in the regression model parameters, which corresponds to the way an expression level depends on the others. Using reversible jump MCMC methods, we develop a stochastic algorithm which allows to simultaneously infer the changepoints location and the structure of the network within the phases delimited by the changepoints. <br /><br />Validation of those two approaches is carried out on both simulated and real data analysis.
33

Robust inference of gene regulatory networks : System properties, variable selection, subnetworks, and design of experiments

Nordling, Torbjörn E. M. January 2013 (has links)
In this thesis, inference of biological networks from in vivo data generated by perturbation experiments is considered, i.e. deduction of causal interactions that exist among the observed variables. Knowledge of such regulatory influences is essential in biology. A system property–interampatteness–is introduced that explains why the variation in existing gene expression data is concentrated to a few “characteristic modes” or “eigengenes”, and why previously inferred models have a large number of false positive and false negative links. An interampatte system is characterized by strong INTERactions enabling simultaneous AMPlification and ATTEnuation of different signals and we show that perturbation of individual state variables, e.g. genes, typically leads to ill-conditioned data with both characteristic and weak modes. The weak modes are typically dominated by measurement noise due to poor excitation and their existence hampers network reconstruction. The excitation problem is solved by iterative design of correlated multi-gene perturbation experiments that counteract the intrinsic signal attenuation of the system. The next perturbation should be designed such that the expected response practically spans an additional dimension of the state space. The proposed design is numerically demonstrated for the Snf1 signalling pathway in S. cerevisiae. The impact of unperturbed and unobserved latent state variables, that exist in any real biological system, on the inferred network and required set-up of the experiments for network inference is analysed. Their existence implies that a subnetwork of pseudo-direct causal regulatory influences, accounting for all environmental effects, in general is inferred. In principle, the number of latent states and different paths between the nodes of the network can be estimated, but their identity cannot be determined unless they are observed or perturbed directly. Network inference is recognized as a variable/model selection problem and solved by considering all possible models of a specified class that can explain the data at a desired significance level, and by classifying only the links present in all of these models as existing. As shown, these links can be determined without any parameter estimation by reformulating the variable selection problem as a robust rank problem. Solution of the rank problem enable assignment of confidence to individual interactions, without resorting to any approximation or asymptotic results. This is demonstrated by reverse engineering of the synthetic IRMA gene regulatory network from published data. A previously unknown activation of transcription of SWI5 by CBF1 in the IRMA strain of S. cerevisiae is proven to exist, which serves to illustrate that even the accumulated knowledge of well studied genes is incomplete. / Denna avhandling behandlar inferens av biologiskanätverk från in vivo data genererat genom störningsexperiment, d.v.s. bestämning av kausala kopplingar som existerar mellan de observerade variablerna. Kunskap om dessa regulatoriska influenser är väsentlig för biologisk förståelse. En system egenskap—förstärksvagning—introduceras. Denna förklarar varför variationen i existerande genexpressionsdata är koncentrerat till några få ”karakteristiska moder” eller ”egengener” och varför de modeller som konstruerats innan innehåller många falska positiva och falska negativa linkar. Ett system med förstärksvagning karakteriseras av starka kopplingar som möjliggör simultan FÖRSTÄRKning och förSVAGNING av olika signaler. Vi demonstrerar att störning av individuella tillståndsvariabler, t.ex. gener, typiskt leder till illakonditionerat data med både karakteristiska och svaga moder. De svaga moderna domineras typiskt av mätbrus p.g.a. dålig excitering och försvårar rekonstruktion av nätverket. Excitationsproblemet löses med iterativdesign av experiment där korrelerade störningar i multipla gener motverkar systemets inneboende försvagning av signaller. Följande störning bör designas så att det förväntade svaret praktiskt spänner ytterligare en dimension av tillståndsrummet. Den föreslagna designen demonstreras numeriskt för Snf1 signalleringsvägen i S. cerevisiae. Påverkan av ostörda och icke observerade latenta tillståndsvariabler, som existerar i varje verkligt biologiskt system, på konstruerade nätverk och planeringen av experiment för nätverksinferens analyseras. Existens av dessa tillståndsvariabler innebär att delnätverk med pseudo-direkta regulatoriska influenser, som kompenserar för miljöeffekter, generellt bestäms. I princip så kan antalet latenta tillstånd och alternativa vägar mellan noder i nätverket bestämmas, men deras identitet kan ej bestämmas om de inte direkt observeras eller störs. Nätverksinferens behandlas som ett variabel-/modelselektionsproblem och löses genom att undersöka alla modeller inom en vald klass som kan förklara datat på den önskade signifikansnivån, samt klassificera endast linkar som är närvarande i alla dessa modeller som existerande. Dessa linkar kan bestämmas utan estimering av parametrar genom att skriva om variabelselektionsproblemet som ett robustrangproblem. Lösning av rangproblemet möjliggör att statistisk konfidens kan tillskrivas individuella linkar utan approximationer eller asymptotiska betraktningar. Detta demonstreras genom rekonstruktion av det syntetiska IRMA genreglernätverket från publicerat data. En tidigare okänd aktivering av transkription av SWI5 av CBF1 i IRMA stammen av S. cerevisiae bevisas. Detta illustrerar att t.o.m. den ackumulerade kunskapen om välstuderade gener är ofullständig. / <p>QC 20130508</p>
34

Exploring the Boundaries of Gene Regulatory Network Inference

Tjärnberg, Andreas January 2015 (has links)
To understand how the components of a complex system like the biological cell interact and regulate each other, we need to collect data for how the components respond to system perturbations. Such data can then be used to solve the inverse problem of inferring a network that describes how the pieces influence each other. The work in this thesis deals with modelling the cell regulatory system, often represented as a network, with tools and concepts derived from systems biology. The first investigation focuses on network sparsity and algorithmic biases introduced by penalised network inference procedures. Many contemporary network inference methods rely on a sparsity parameter such as the L1 penalty term used in the LASSO. However, a poor choice of the sparsity parameter can give highly incorrect network estimates. In order to avoid such poor choices, we devised a method to optimise the sparsity parameter, which maximises the accuracy of the inferred network. We showed that it is effective on in silico data sets with a reasonable level of informativeness and demonstrated that accurate prediction of network sparsity is key to elucidate the correct network parameters. The second investigation focuses on how knowledge from association networks can be transferred to regulatory network inference procedures. It is common that the quality of expression data is inadequate for reliable gene regulatory network inference. Therefore, we constructed an algorithm to incorporate prior knowledge and demonstrated that it increases the accuracy of network inference when the quality of the data is low. The third investigation aimed to understand the influence of system and data properties on network inference accuracy. L1 regularisation methods commonly produce poor network estimates when the data used for inference is ill-conditioned, even when the signal to noise ratio is so high that all links in the network can be proven to exist for the given significance. In this study we elucidated some general principles for under what conditions we expect strongly degraded accuracy. Moreover, it allowed us to estimate expected accuracy from conditions of simulated data, which was used to predict the performance of inference algorithms on biological data. Finally, we built a software package GeneSPIDER for solving problems encountered during previous investigations. The software package supports highly controllable network and data generation as well as data analysis and exploration in the context of network inference. / <p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.</p><p> </p>
35

Inférence des réseaux de régulation de la synthèse des protéines de réserve du grain de blé tendre (Triticum aestivum L.) en réponse à l'approvisionnement en azote et en soufre / Inference and analysis of regulatory networks involved in wheat (Triticum aestivum L.) grain storage protein synthesis and their response to nitrogen and sulfur supply

Vincent, Jonathan 10 September 2014 (has links)
La teneur et la composition en protéines de réserve du grain de blé tendre (Triticum aestivum L.) sont les principaux déterminants de sa valeur d’usage et de sa qualité nutritionnelle. La composition en protéines de réserve du grain est déterminée par la teneur en assimilâts azotés et soufrés par grain via des lois d’échelle qui pourraient être les propriétés émergentes de réseaux de régulation. Plusieurs facteurs de transcription intervenant dans cette régulation ont été mis en évidence, mais les voies et mécanismes impliqués sont encore très peu connus. Le constat est identique en ce qui concerne l’impact de la nutrition azotée et soufrée sur ce réseau de régulation. Le développement des outils de génomique fonctionnelle et de bioinformatique permet aujourd’hui d’aborder ces régulations de manière globale via une approche systémique mettant en relation plusieurs niveaux de régulation. L’objectif du travail présenté est d’explorer les réseaux de régulation –omiques impliqués dans le contrôle de l’accumulation des protéines de réserve dans le grain de blé tendre et leur réponse à l’approvisionnement en azote et en soufre. Une approche d’inférence de réseaux basée sur la découverte de règles a été étendue, implémentée sous la forme d’une plateforme web. L’utilisation de cette plateforme a permis de définir des sémantiques multiples afin d’inférer dans un cadre global, des règles possédant différentes significations biologiques. Des facteurs de transcription spécifiques de certains organes et certaines phases de développement ont été mis en évidence et un intérêt particulier a été apporté à leur position dans les réseaux de règles inférés, notamment en relation avec les protéines de réserve. Les travaux initiés dans cette thèse ouvrent un champ d’investigation innovant pour l’identification de nouvelles cibles de sélection variétale pour l’amélioration de la valeur technologique et de la qualité nutritionnelle du blé. Ils devraient ainsi permettre de mieux maîtriser la composition en protéines de réserve et ainsi produire des blés adaptés à des utilisations ciblées ou carencé en certaines fractions protéiques impliquées dans des phénomènes d’allergénicité et d’intolérance du gluten, ce dans un contexte d’agriculture durable et plus économe en intrants. / Grain storage protein content and composition are the main determinants of bread wheat (Triticum aestivum L.) end-use value. Scaling laws governing grain protein composition according to grain nitrogen and sulfur content could be the outcome of a finely tuned regulation network. Although it was demonstrated that the main regulation of grain storage proteins accumulation occurs at the transcriptomic level in cereals, knowledge of the underlying molecular mechanisms is elusive. Moreover, the effects of nitrogen and sulfur on these mechanisms are unknown. The issue of skyrocketing data generation in research projects is addressed by developing high-throughput bioinformatics approaches. Extracting knowledge on from such massive amounts of data is therefore an important challenge. The work presented herein aims at elucidating regulatory networks involved in grain storage protein synthesis and their response to nitrogen and sulfur supply using a rule discovery approach. This approach was extended, implemented in the form of a web-oriented platform dedicated to the inference and analysis of regulatory networks from qualitative and quantitative –omics data. This platform allowed us to define different semantics in a comprehensive framework; each semantic having its own biological meaning, thus providing us with global informative networks. Spatiotemporal specificity of transcription factors expression was observed and particular attention was paid to their relationship with grain storage proteins in the inferred networks. The work initiated here opens up a field of innovative investigation to identify new targets for plant breeding and for an improved end-use value and nutritional quality of wheat in the context of inputs limitation. Further analyses should enhance the understanding of the control of grain protein composition and allow providing wheat adapted to specific uses or deficient in protein fractions responsible for gluten allergenicity and intolerance.
36

Information-theoretic variable selection and network inference from microarray data

Meyer, Patrick E. 16 December 2008 (has links)
Statisticians are used to model interactions between variables on the basis of observed<p>data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets<p>having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of<p>samples. The detection of functional relationships, when such uncertainty is contained in<p>data, constitutes a major challenge.<p>Our work focuses on variable selection and network inference from datasets having<p>many variables and few samples (high variable-to-sample ratio), such as microarray data.<p>Variable selection is the topic of machine learning whose objective is to select, among a<p>set of input variables, those that lead to the best predictive model. The application of<p>variable selection methods to gene expression data allows, for example, to improve cancer<p>diagnosis and prognosis by identifying a new molecular signature of the disease. Network<p>inference consists in representing the dependencies between the variables of a dataset by<p>a graph. Hence, when applied to microarray data, network inference can reverse-engineer<p>the transcriptional regulatory network of cell in view of discovering new drug targets to<p>cure diseases.<p>In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset<p>Information for Variable Elimination) a new method of feature selection and MRNET (Minimum<p>Redundancy NETwork), a new algorithm of network inference. Both tools rely on<p>the computation of mutual information, an information-theoretic measure of dependency.<p>More precisely, MASSIVE and MRNET use approximations of the mutual information<p>between a subset of variables and a target variable based on combinations of mutual informations<p>between sub-subsets of variables and the target. The used approximations allow<p>to estimate a series of low variate densities instead of one large multivariate density. Low<p>variate densities are well-suited for dealing with high variable-to-sample ratio datasets,<p>since they are rather cheap in terms of computational cost and they do not require a large<p>amount of samples in order to be estimated accurately. Numerous experimental results<p>show the competitiveness of these new approaches. Finally, our thesis has led to a freely<p>available source code of MASSIVE and an open-source R and Bioconductor package of<p>network inference. / Doctorat en sciences, Spécialisation Informatique / info:eu-repo/semantics/nonPublished
37

Exact Bayesian Inference in Graphical Models : Tree-structured Network Inference and Segmentation / Inférence bayésienne exacte dans les modèles graphiques : inférence de réseaux à structure arborescente et segmentation

Schwaller, Loïc 09 September 2016 (has links)
Cette thèse porte sur l'inférence de réseaux. Le cadre statistique naturel à ce genre de problèmes est celui des modèles graphiques, dans lesquels les relations de dépendance et d'indépendance conditionnelles vérifiées par une distribution multivariée sont représentées à l'aide d'un graphe. Il s'agit alors d'apprendre la structure du modèle à partir d'observations portant sur les sommets. Nous considérons le problème d'un point de vue bayésien. Nous avons également décidé de nous concentrer sur un sous-ensemble de graphes permettant d'effectuer l'inférence de manière exacte et efficace, à savoir celui des arbres couvrants. Il est en effet possible d'intégrer une fonction définie sur les arbres couvrants en un temps cubique par rapport au nombre de variables à la condition que cette fonction factorise selon les arêtes, et ce malgré le cardinal super-exponentiel de cet ensemble. En choisissant les distributions a priori sur la structure et les paramètres du modèle de manière appropriée, il est possible de tirer parti de ce résultat pour l'inférence de modèles graphiques arborescents. Nous proposons un cadre formel complet pour cette approche.Nous nous intéressons également au cas où les observations sont organisées en série temporelle. En faisant l'hypothèse que la structure du modèle graphique latent subit un certain nombre de brusques changements, le but est alors de retrouver le nombre et la position de ces points de rupture. Il s'agit donc d'un problème de segmentation. Sous certaines hypothèses de factorisation, l'exploration exhaustive de l'ensemble des segmentations est permise et, combinée aux résultats sur les arbres couvrants, permet d'obtenir, entre autres, la distribution a posteriori des points de ruptures en un temps polynomial à la fois par rapport au nombre de variables et à la longueur de la série. / In this dissertation we investigate the problem of network inference. The statistical frame- work tailored to this task is that of graphical models, in which the (in)dependence relation- ships satis ed by a multivariate distribution are represented through a graph. We consider the problem from a Bayesian perspective and focus on a subset of graphs making structure inference possible in an exact and e cient manner, namely spanning trees. Indeed, the integration of a function de ned on spanning trees can be performed with cubic complexity with respect to number of variables under some factorisation assumption on the edges, in spite of the super-exponential cardinality of this set. A careful choice of prior distributions on both graphs and distribution parameters allows to use this result for network inference in tree-structured graphical models, for which we provide a complete and formal framework.We also consider the situation in which observations are organised in a multivariate time- series. We assume that the underlying graph describing the dependence structure of the distribution is a ected by an unknown number of abrupt changes throughout time. Our goal is then to retrieve the number and locations of these change-points, therefore dealing with a segmentation problem. Using spanning trees and assuming that segments are inde- pendent from one another, we show that this can be achieved with polynomial complexity with respect to both the number of variables and the length of the series.
38

Classification et inférence de réseaux pour les données RNA-seq / Clustering and network inference for RNA-seq data

Gallopin, Mélina 09 December 2015 (has links)
Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des données issues des technologies de séquençage du transcriptome (RNA-seq). Les difficultés de modélisation des données de comptage RNA-seq sont liées à leur caractère discret et au faible nombre d'échantillons disponibles, limité par le coût financier du séquençage. Une première partie de travaux de cette thèse porte sur la classification à l'aide de modèle de mélange. L'objectif de la classification est la détection de modules de gènes co-exprimés. Un choix naturel de modélisation des données RNA-seq est un modèle de mélange de lois de Poisson. Mais des transformations simples des données permettent de se ramener à un modèle de mélange de lois gaussiennes. Nous proposons de comparer, pour chaque jeu de données RNA-seq, les différentes modélisations à l'aide d'un critère objectif permettant de sélectionner la modélisation la plus adaptée aux données. Par ailleurs, nous présentons un critère de sélection de modèle prenant en compte des informations biologiques externes sur les gènes. Ce critère facilite l'obtention de classes biologiquement interprétables. Il n'est pas spécifique aux données RNA-seq. Il est utile à toute analyse de co-expression à l'aide de modèles de mélange visant à enrichir les bases de données d'annotations fonctionnelles des gènes. Une seconde partie de travaux de cette thèse porte sur l'inférence de réseau à l'aide d'un modèle graphique. L'objectif de l'inférence de réseau est la détection des relations de dépendance entre les niveaux d'expression des gènes. Nous proposons un modèle d'inférence de réseau basé sur des lois de Poisson, prenant en compte le caractère discret et la grande variabilité inter-échantillons des données RNA-seq. Cependant, les méthodes d'inférence de réseau nécessitent un nombre d'échantillons élevé.Dans le cadre du modèle graphique gaussien, modèle concurrent au précédent, nous présentons une approche non-asymptotique pour sélectionner des sous-ensembles de gènes pertinents, en décomposant la matrice variance en blocs diagonaux. Cette méthode n'est pas spécifique aux données RNA-seq et permet de réduire la dimension de tout problème d'inférence de réseau basé sur le modèle graphique gaussien. / This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data.The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases.The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model.
39

Computational Cancer Research: Network-based analysis of cancer data disentangles clinically relevant alterations from molecular measurements

Seifert, Michael 12 September 2022 (has links)
Cancer is a very complex genetic disease driven by combinations of mutated genes. This complexity strongly complicates the identification of driver genes and puts enormous challenges to reveal how they influence cancerogenesis, prognosis or therapy response. Thousands of molecular profiles of the major human types of cancer have been measured over the last years. Apart from well-studied frequently mutated genes, still only little is known about the role of rarely mutated genes in cancer or the interplay of mutated genes in individual cancers. Gene expression and mutation profiles can be measured routinely, but computational methods for the identification of driver candidates along with the prediction of their potential impacts on downstream targets and clinically relevant characteristics only rarely exist. Instead of only focusing on frequently mutated genes, each cancer patient should better be analyzed by using the full information in its cancer-specific molecular profiles to improve the understanding of cancerogenesis and to more precisely predict prognosis and therapy response of individual patients. This requires novel computational methods for the integrative analysis of molecular cancer data. A promising way to realize this is to consider cancer as a disease of cellular networks. Therefore, I have developed a novel network-based approach for the integrative analysis of molecular cancer data over the last years. This approach directly learns gene regulatory networks form gene expression and copy number data and further enables to quantify impacts of altered genes on clinically relevant downstream targets using network propagation. This habilitation thesis summarizes the results of seven of my publications. All publications have a focus on the integrative analysis of molecular cancer data with an overarching connection to the newly developed network-based approach. In the first three publications, networks were learned to identify major regulators that distinguish characteristic gene expression signatures with applications to astrocytomas, oligodendrogliomas, and acute myeloid leukemia. Next, the central publication of this habilitation thesis, which combines network inference with network propagation, is introduced. The great value of this approach is demonstrated by quantifying potential direct and indirect impacts of rare and frequent gene copy number alterations on patient survival. Further, the publication of the corresponding user-friendly R package regNet is introduced. Finally, two additional publications that also strongly highlight the value of the developed network-based approach are presented with the aims to predict cancer gene candidates within the region of the 1p/19q co-deletion of oligodendrogliomas and to determine driver candidates associated with radioresistance and relapse of prostate cancer. All seven publications are embedded into a brief introduction that motivates the scientific background and the major objectives of this thesis. The background is briefly going from the hallmarks of cancer over the complexity of cancer genomes down to the importance of networks in cancer. This includes a short introduction of the mathematical concepts that underlie the developed network inference and network propagation algorithms. Further, I briefly motivate and summarize my studies before the original publications are presented. The habilitation thesis is completed with a general discussion of the major results with a specific focus on the utilized network-based data analysis strategies. Major biologically and clinically relevant findings of each publication are also briefly summarized.
40

Redes complexas de expressão gênica: síntese, identificação, análise e aplicações / Gene expression complex networks: synthesis, identification, analysis and applications

Lopes, Fabricio Martins 21 February 2011 (has links)
Os avanços na pesquisa em biologia molecular e bioquímica permitiram o desenvolvimento de técnicas capazes de extrair informações moleculares de milhares de genes simultaneamente, como DNA Microarrays, SAGE e, mais recentemente RNA-Seq, gerando um volume massivo de dados biológicos. O mapeamento dos níveis de transcrição dos genes em larga escala é motivado pela proposição de que o estado funcional de um organismo é amplamente determinado pela expressão de seus genes. No entanto, o grande desafio enfrentado é o pequeno número de amostras (experimentos) com enorme dimensionalidade (genes). Dessa forma, se faz necessário o desenvolvimento de novas técnicas computacionais e estatísticas que reduzam o erro de estimação intrínseco cometido na presença de um pequeno número de amostras com enorme dimensionalidade. Neste contexto, um foco importante de pesquisa é a modelagem e identificação de redes de regulação gênica (GRNs) a partir desses dados de expressão. O objetivo central nesta pesquisa é inferir como os genes estão regulados, trazendo conhecimento sobre as interações moleculares e atividades metabólicas de um organismo. Tal conhecimento é fundamental para muitas aplicações, tais como o tratamento de doenças, estratégias de intervenção terapêutica e criação de novas drogas, bem como para o planejamento de novos experimentos. Nessa direção, este trabalho apresenta algumas contribuições: (1) software de seleção de características; (2) nova abordagem para a geração de Redes Gênicas Artificiais (AGNs); (3) função critério baseada na entropia de Tsallis; (4) estratégias alternativas de busca para a inferência de GRNs: SFFS-MR e SFFS-BA; (5) investigação biológica das redes gênicas envolvidas na biossíntese de tiamina, usando a Arabidopsis thaliana como planta modelo. O software de seleção de características consiste de um ambiente de código livre, gráfico e multiplataforma para problemas de bioinformática, que disponibiliza alguns algoritmos de seleção de características, funções critério e ferramentas de visualização gráfica. Em particular, implementa um método de inferência de GRNs baseado em seleção de características. Embora existam vários métodos propostos na literatura para a modelagem e identificação de GRNs, ainda há um problema muito importante em aberto: como validar as redes identificadas por esses métodos computacionais? Este trabalho apresenta uma nova abordagem para validação de tais algoritmos, considerando três aspectos principais: (a) Modelo para geração de Redes Gênicas Artificiais (AGNs), baseada em modelos teóricos de redes complexas, os quais são usados para simular perfis temporais de expressão gênica; (b) Método computacional para identificação de redes gênicas a partir de dados temporais de expressão; e (c) Validação das redes identificadas por meio do modelo AGN. O desenvolvimento do modelo AGN permitiu a análise e investigação das características de métodos de inferência de GRNs, levando ao desenvolvimento de um estudo comparativo entre quatro métodos disponíveis na literatura. A avaliação dos métodos de inferência levou ao desenvolvimento de novas metodologias para essa tarefa: (a) uma função critério, baseada na entropia de Tsallis, com objetivo de inferir os inter-relacionamentos gênicos com maior precisão; (b) uma estratégia alternativa de busca para a inferência de GRNs, chamada SFFS-MR, a qual tenta explorar uma característica local das interdependências regulatórias dos genes, conhecida como predição intrinsecamente multivariada; e (c) uma estratégia de busca, interativa e flutuante, que baseia-se na topologia de redes scale-free, como uma característica global das GRNs, considerada como uma informação a priori, com objetivo de oferecer um método mais adequado para essa classe de problemas e, com isso, obter resultados com maior precisão. Também é objetivo deste trabalho aplicar a metodologia desenvolvida em dados biológicos, em particular na identificação de GRNs relacionadas a funções específicas de Arabidopsis thaliana. Os resultados experimentais, obtidos a partir da aplicação das metodologias propostas, mostraram que os respectivos ganhos de desempenho foram significativos e adequados para os problemas a que foram propostos. / Thanks to recent advances in molecular biology and biochemistry, allied to an ever increasing amount of experimental data, the functional state of thousands of genes can now be extracted simultaneously by using methods such as DNA microarrays, SAGE, and more recently RNA-Seq, generating a massive volume of biological data. The mapping of gene transcription levels at large scale is motivated by the proposition that information of the functional state of an organism is broadly determined by its gene expression. However, the main limitation faced is the small number of samples (experiments) with huge dimensionalities (genes). Thus, it is necessary to develop new computational and statistics techniques to reduce the inherent estimation error committed in the presence of a small number of samples with large dimensionality. In this context, particularly important related investigations are the modeling and identification of gene regulatory networks from expression data sets. The main objective of this research is to infer how genes are regulated, bringing knowledge about the molecular interactions and metabolic activities of an organism. Such a knowledge is fundamental for many applications, such as disease treatment, therapeutic intervention strategies and drugs design, as well as for planning high-throughput new experiments. In this direction, this work presents some contributions: (1) feature selection software; (2) new approach for the generation of artificial gene networks (AGN); (3) criterion function based on Tsallis entropy; (4) alternative search strategies for GRNs inference: SFFS-MR and SFFS-BA; (5) biological investigation of GRNs involved in the thiamine biosynthesis by adopting the Arabidopsis thaliana as a model plant. The feature selection software is an open-source multiplataform graphical environment for bioinformatics problems, which supports many feature selection algorithms, criterion functions and graphic visualization tools. In particular, a feature selection method for GRNs inference is also implemented in the software. Although there are several methods proposed in the literature for the modeling and identification of GRNs, an important open problem regards: how to validate such methods and its results? This work presents a new approach for validation of such algorithms by considering three main aspects: (a) Artificial Gene Networks (AGNs) model generation through theoretical models of complex networks, which is used to simulate temporal expression data; (b) computational method for GRNs identification from temporal expression data; and (c) Validation of the identified AGN-based network through comparison with the original network. Through the development of the AGN model was possible the analysis and investigation of the characteristics of GRNs inference methods, leading to the development of a comparative study of four inference methods available in literature. The evaluation of inference methods led to the development of new methodologies for this task: (a) a new criterion function based on Tsallis entropy, in order to infer the genetic inter-relationships with better precision; (b) an alternative search strategy for the GRNs inference, called SFFS-MR, which tries to exploit a local property of the regulatory gene interdependencies, which is known as intrinsically multivariate prediction; and (c) a search strategy, interactive and floating, which is based on scale-free network topology, as a global property of the GRNs, which is considered as a priori information, in order to provide a more appropriate method for this class of problems and thereby achieve results with better precision. It is also an objective of this work, to apply the developed methodology in biological data, particularly in identifying GRNs related to specific functions of the Arabidopsis thaliana. The experimental results, obtained from the application of the proposed methodologies, indicate that the respective performances of each methodology were significant and adequate to the problems that have been proposed.

Page generated in 0.2826 seconds