Global ETD Search

1	Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes / A systemic approach to statistical analysis to transcriptomic data through co-expression network analysis Brunet, Anne-Claire 17 June 2016 (has links) Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<<p), car collectées sur des échantillons de tailles très limitées au regard du très grand nombre de variables (ici l'expression des gènes).La première partie de la thèse est consacrée à la présentation de méthodes d'apprentissage supervisé, telles que les forêts aléatoires de Breiman et les modèles de régressions pénalisées, utilisées dans le contexte de la grande dimension pour sélectionner les gènes (variables d'expression) qui sont les plus pertinents pour l'étude de la pathologie d'intérêt. Nous évoquons les limites de ces méthodes pour la sélection de gènes qui soient pertinents, non pas uniquement pour des considérations d'ordre statistique, mais qui le soient également sur le plan biologique, et notamment pour les sélections au sein des groupes de variables fortement corrélées, c'est à dire au sein des groupes de gènes co-exprimés. Les méthodes d'apprentissage classiques considèrent que chaque gène peut avoir une action isolée dans le modèle, ce qui est en pratique peu réaliste. Un caractère biologique observable est la résultante d'un ensemble de réactions au sein d'un système complexe faisant interagir les gènes les uns avec les autres, et les gènes impliqués dans une même fonction biologique ont tendance à être co-exprimés (expression corrélée). Ainsi, dans une deuxième partie, nous nous intéressons aux réseaux de co-expression de gènes sur lesquels deux gènes sont reliés si ils sont co-exprimés. Plus précisément, nous cherchons à mettre en évidence des communautés de gènes sur ces réseaux, c'est à dire des groupes de gènes co-exprimés, puis à sélectionner les communautés les plus pertinentes pour l'étude de la pathologie, ainsi que les "gènes clés" de ces communautés. Cela favorise les interprétations biologiques, car il est souvent possible d'associer une fonction biologique à une communauté de gènes. Nous proposons une approche originale et efficace permettant de traiter simultanément la problématique de la modélisation du réseau de co-expression de gènes et celle de la détection des communautés de gènes sur le réseau. Nous mettons en avant les performances de notre approche en la comparant à des méthodes existantes et populaires pour l'analyse des réseaux de co-expression de gènes (WGCNA et méthodes spectrales). Enfin, par l'analyse d'un jeu de données réelles, nous montrons dans la dernière partie de la thèse que l'approche que nous proposons permet d'obtenir des résultats convaincants sur le plan biologique, plus propices aux interprétations et plus robustes que ceux obtenus avec les méthodes d'apprentissage supervisé classiques. / Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<<p) collected from small samples with regard to the very large number of variables (gene expression variables). The first part of the thesis is devoted to a description of some supervised learning methods, such as random forest and penalized regression models. The following methods can be used for selecting the most relevant disease-related genes. However, the statistical relevance of the selections doesn't determine the biological relevance, and particularly when genes are selected within a group of highly correlated variables or co-expressed genes. Common supervised learning methods consider that every gene can have an isolated action in the model which is not so much realistic. An observable biological phenomenum is the result of a set of reactions inside a complex system which makes genes interact with each other, and genes that have a common biological function tend to be co-expressed (correlation between expression variables). Then, in a second part, we are interested in gene co-expression networks, where genes are linked if they are co-expressed. More precisely, we aim to identify communities of co-expressed genes, and then to select the most relevant disease-related communities as well as the "key-genes" of these communities. It leads to a variety of biological interpretations, because a community of co-expressed genes is often associated with a specific biological function. We propose an original and efficient approach that permits to treat simultaneously the problem of modeling the gene co-expression network and the problem of detecting the communities in network. We put forward the performances of our approach by comparing it to the existing methods that are popular for analysing gene co-expression networks (WGCNA and spectral approaches). The last part presents the results produced by applying our proposed approach on a real-world data set. We obtain convincing and robust results that help us make more diverse biological interpretations than with results produced by common supervised learning methods. Données transcriptomiques Réseaux de gènes Transcriptomic data Co-expression network Variable selection Dimensionality reduction Penalized regression Network clustering Machine learning
2	Sauvegarde des données dans les réseaux P2P / Data Backup in P2P Networks Tout, Rabih 25 June 2010 (has links) Aujourd’hui la sauvegarde des données est une solution indispensable pour éviter de les perdre. Plusieurs méthodes et stratégies de sauvegarde existent et utilisent différents types de support. Les méthodes de sauvegarde les plus efficaces exigent souvent des frais d’abonnement au service liés aux coûts du matériel et de l’administration investis par les fournisseurs. Le grand succès des réseaux P2P et des applications de partage de fichiers a rendu ces réseaux exploitables par un grand nombre d’applications surtout avec la possibilité de partager les ressources des utilisateurs entre eux. Les limites des solutions de sauvegarde classiques qui ne permettent pas le passage à l’échelle ont rendu les réseaux P2P intéressants pour les applications de sauvegarde. L’instabilité dans les réseaux P2P due au taux élevé de mouvement des pairs, rend la communication entre les pairs très difficile. Dans le contexte de la sauvegarde, la communication entre les nœuds est indispensable, ce qui exige une grande organisation dans le réseau. D’autre part, la persistance des données sauvegardées dans le réseau reste un grand défi car la sauvegarde n’a aucun intérêt si les données sauvegardées sont perdues et que la restauration devient impossible. L’objectif de notre thèse est d’améliorer l’organisation des sauvegardes dans les réseaux P2P et de garantir la persistance des données sauvegardées. Nous avons donc élaboré une approche de planification qui permet aux nœuds de s’organiser dans le but de mieux communiquer entre eux. D’autre part, pour garantir la persistance des données sauvegardées, nous avons proposé une approche de calcul probabiliste qui permet de déterminer, selon les variations dans le système, le nombre de répliques nécessaire pour qu’au moins une copie persiste dans le système après un certain temps défini. Nos approches ont été implémentées dans une application de sauvegarde P2P. / Nowadays, data backup is an essential solution to avoid losing data. Several backup methods and strategies exist. They use different media types. Most efficient backup methods are not free due to the cost of hardware and administration invested by suppliers. The great success of P2P networks and file sharing applications has encouraged the use of these networks in multiple applications especially with the possibility of sharing resources between network users. The limitations of traditional backup solutions in large scale networks have made P2P networks an interesting solution for backup applications. Instability in P2P networks due to peers’ high movement rate makes communication between these peers very difficult. To achieve data backup, communications between peers is essential and requires a network organization. On the other hand, the persistence of backed up data in the network remains a major challenge. Data backup is useless if backed up copies are lost. The objective of this thesis is to improve the backup organization and ensure backed up data persistence in P2P networks. We have therefore developed a planning approach that allows nodes to organize themselves in order to better communicate with each other. On the other hand, to ensure data persistency, we proposed a probabilistic approach to compute the minimum replicas number needed for a given data so that at least one copy remains in the system after a given time. Our two approaches have been implemented in a P2P backup application. Pairs par pairs Sauvegarde Réseau de partage des fichiers Programmes Redondance Persistance Code de programmation Garantie probabilitique Peer-to-Peer Backup Network clustering Scheduling Redundancy Persistence Erasure coding Probabilistic guarantee 004
3	Uncovering and Managing the Impact of Methodological Choices for the Computational Construction of Socio-Technical Networks from Texts Diesner, Jana 01 September 2012 (has links) This thesis is motivated by the need for scalable and reliable methods and technologies that support the construction of network data based on information from text data. Ultimately, the resulting data can be used for answering substantive and graph-theoretical questions about socio-technical networks. One main limitation with constructing network data from text data is that the validation of the resulting network data can be hard to infeasible, e.g. in the cases of covert, historical and large-scale networks. This thesis addresses this problem by identifying the impact of coding choices that must be made when extracting network data from text data on the structure of networks and network analysis results. My findings suggest that conducting reference resolution on text data can alter the identity and weight of 76% of the nodes and 23% of the links, and can cause major changes in the value of commonly used network metrics. Also, performing reference resolution prior to relation extraction leads to the retrieval of completely different sets of key entities in comparison to not applying this pre-processing technique. Based on the outcome of the presented experiments, I recommend strategies for avoiding or mitigating the identified issues in practical applications. When extracting socio-technical networks from texts, the set of relevant node classes might go beyond the classes that are typically supported by tools for named entity extraction. I address this lack of technology by developing an entity extractor that combines an ontology for sociotechnical networks that originates from the social sciences, is theoretically grounded and has been empirically validated in prior work, with a supervised machine learning technique that is based on probabilistic graphical models. This thesis does not stop at showing that the resulting prediction models achieve state of the art accuracy rates, but I also describe the process of integrating these models into an existing and publically available end-user product. As a result, users can apply these models to new text data in a convenient fashion. While a plethora of methods for building network data from information explicitly or implicitly contained in text data exists, there is a lack of research on how the resulting networks compare with respect to their structure and properties. This also applies to networks that can be extracted by using the aforementioned entity extractor as part of the relation extraction process. I address this knowledge gap by comparing the networks extracted by using this process to network data built with three alternative methods: text coding based on thesauri that associate text terms with node classes, the construction of network data from meta-data on texts, such as key words and index terms, and building network data in collaboration with subject matter experts. The outcomes of these comparative analyses suggest that thesauri generated with the entity extractor developed for this thesis need adjustments with respect to particular categories and types of errors. I am providing tools and strategies to assist with these refinements. My results also show that once these changes have been made and in contrast to manually constructed thesauri, the prediction models generalize with acceptable accuracy to other domains (news wire data, scientific writing, emails) and writing styles (formal, casual). The comparisons of networks constructed with different methods show that ground truth data built by subject matter experts are hardly resembled by any automated method that analyzes text bodies, and even less so by exploiting existing meta-data from text corpora. Thus, aiming to reconstruct social networks from text data leads to largely incomplete networks. Synthesizing the findings from this work, I outline which types of information on socio-technical networks are best captured by what network data construction method, and how to best combine these methods in order to gain a more comprehensive view on a network. When both, text data and relational data, are available as a source of information on a network, people have previously integrated these data by enhancing social networks with content nodes that represent salient terms from the text data. I present a methodological advancement to this technique and test its performance on the datasets used for the previously mentioned evaluation studies. By using this approach, multiple types of behavioral data, namely interactions between people as well as their language use, can be taken into account. I conclude that extracting content nodes from groups of structurally equivalent agents can be an appropriate strategy for enabling the comparison of the content that people produce, perceive or disseminate. These equivalence classes can represent a variety of social roles and social positions that network members occupy. At the same time, extracting content nodes from groups of structurally coherent agents can be suitable for enabling the enhancement of social networks with content nodes. The results from applying the latter approach to text data include a comparison of the outcome of topic modeling; an efficient and unsupervised information extraction technique, to the outcomes of alternative methods, including entity extraction based on supervised machine learning. My findings suggest that key entities from meta-data knowledge networks might serve as proper labels for unlabeled topics. Also, unsupervised and supervised learning leads to the retrieval of similar entities as highly likely members of highly likely topics, and key nodes from text-based knowledge networks, respectively. In summary, the contributions made with this thesis help people to collect, manage and analyze rich network data at any scale. This is a precondition for asking substantive and graph-theoretical questions, testing hypotheses, and advancing theories about networks. This thesis uses an interdisciplinary and computationally rigorous approach to work towards this goal; thereby advancing the intersection of network analysis, natural language processing and computing. socio-technical networks semantic networks information networks entity extraction relation extraction reference resolution co-occurrence based network construction accuracy assessment network clustering grouping in networks Software Engineering
4	CLUSTERING AND VISUALIZATION OF GENOMIC DATA Sutharzan, Sreeskandarajan 26 July 2019 (has links) No description available. Bioinformatics Biology Botany RNA-Seq Gene Ontology network clustering nucleotide sequence clustering Self Organizing Map SOM NGS Influenza A virus HA Segment 4 retina regeneration epithelial to mesenchymal transition hypoxia prime number algorithm machine learning

1

Page generated in 0.1215 seconds