Global ETD Search

21	Topological and domain Knowledge-based subgraph mining : application on protein 3D-structures / Fouille de sous-graphes basée sur la topologie et la connaissance du domaine : application sur les structures 3D de protéines Dhifli, Wajdi 11 December 2013 (has links) Cette thèse est à l'intersection de deux domaines de recherche en plein expansion, à savoir la fouille de données et la bioinformatique. Avec l'émergence des bases de graphes au cours des dernières années, de nombreux efforts ont été consacrés à la fouille des sous-graphes fréquents. Mais le nombre de sous-graphes fréquents découverts est exponentiel, cela est dû principalement à la nature combinatoire des graphes. Beaucoup de sous-graphes fréquents ne sont pas pertinents parce qu'ils sont redondants ou tout simplement inutiles pour l'utilisateur. En outre, leur nombre élevé peut nuire ou même rendre parfois irréalisable toute utilisation ultérieure. La redondance dans les sous-graphes fréquents est principalement due à la similarité structurelle et / ou sémantique, puisque la plupart des sous-graphes découverts diffèrent légèrement dans leur structures et peuvent exprimer des significations similaires ou même identiques. Dans cette thèse, nous proposons deux approches de sélection des sous-graphes représentatifs parmi les fréquents afin d'éliminer la redondance. Chacune des approches proposées s'intéresse à un type spécifique de redondance. La première approche s'adresse à la redondance sémantique où la similarité entre les sous-graphes est mesurée en fonction de la similarité entre les étiquettes de leurs noeuds, en utilisant les connaissances de domaine. La deuxième approche s'adresse à la redondance structurelle où les sous-graphes sont représentés par des descripteurs topologiques définis par l'utilisateur, et la similarité entre les sous-graphes est mesurée en fonction de la distance entre leurs descriptions topologiques respectives. Les principales données d'application de cette thèse sont les structures 3D des protéines. Ce choix repose sur des raisons biologiques et informatiques. D'un point de vue biologique, les protéines jouent un rôle crucial dans presque tous les processus biologiques. Ils sont responsables d'une variété de fonctions physiologiques. D'un point de vue informatique, nous nous sommes intéressés à la fouille de données complexes. Les protéines sont un exemple parfait de ces données car elles sont faites de structures complexes composées d'acides aminés interconnectés qui sont eux-mêmes composées d'atomes interconnectés. Des grandes quantités de structures protéiques sont actuellement disponibles dans les bases de données en ligne. Les structures 3D des protéines peuvent être transformées en graphes où les acides aminés représentent les noeuds du graphe et leurs connexions représentent les arêtes. Cela permet d'utiliser des techniques de fouille de graphes pour les étudier. L'importance biologique des protéines et leur complexité ont fait d'elles des données d'application appropriées pour cette thèse. / This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequentsubgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis. Sélection de motifs Fouille de motifs Sous-graphe fréquent Sous-graphe représentant non-substitué Graphe représentant topologique Structure de protéine Feature selection Pattern mining Frequent subgraph Representative unsubstituted subgraph Topological representative subgraph Protein structure
22	Topological and domain Knowledge-based subgraph mining : application on protein 3D-structures Dhifli, Wajdi 11 December 2013 (has links) (PDF) This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequentsubgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre [SPI:OTHER] Engineering Sciences/Other Feature selection Pattern mining Frequent subgraph Representative unsubstituted subgraph Topological representative subgraph Protein structure
23	Spanning Halin Subgraphs Involving Forbidden Subgraphs Yang, Ping 09 May 2016 (has links) In structural graph theory, connectivity is an important notation with a lot of applications. Tutte, in 1961, showed that a simple graph is 3-connected if and only if it can be generated from a wheel graph by repeatedly adding edges between nonadjacent vertices and applying vertex splitting. In 1971, Halin constructed a class of edge-minimal 3-connected planar graphs, which are a generalization of wheel graphs and later were named “Halin graphs” by Lovasz and Plummer. A Halin graph is obtained from a plane embedding of a tree with no stems having degree 2 by adding a cycle through its leaves in the natural order determined according to the embedding. Since Halin graphs were introduced, many useful properties, such as Hamiltonian, hamiltonian-connected and pancyclic, have been discovered. Hence, it will reveal many properties of a graph if we know the graph contains a spanning Halin subgraph. But unfortunately, until now, there is no positive result showing under which conditions a graph contains a spanning Halin subgraph. In this thesis, we characterize all forbidden pairs implying graphs containing spanning Halin subgraphs. Consequently, we provide a complete proof conjecture of Chen et al. Our proofs are based on Chudnovsky and Seymour’s decomposition theorem of claw-free graphs, which were published recently in a series of papers. Forbidden pairs Spanning subgraph Halin graph Strong spanning Halin subgraph 3-connected graph Claw-free Z_3-free B_{1 2}-free.
24	Cost-based optimization of graph queries in relational database management systems Trissl, Silke 14 June 2012 (has links) Graphen sind in vielen Bereichen des Lebens zu finden, wobei wir speziell an Graphen in der Biologie interessiert sind. Knoten in solchen Graphen sind chemische Komponenten, Enzyme, Reaktionen oder Interaktionen, die durch Kanten miteinander verbunden sind. Eine effiziente Ausführung von Graphanfragen ist eine Herausforderung. In dieser Arbeit präsentieren wir GRIcano, ein System, das die effiziente Ausführung von Graphanfragen erlaubt. Wir nehmen an, dass Graphen in relationalen Datenbankmanagementsystemen (RDBMS) gespeichert sind. Als Graphanfragesprache schlagen wir eine erweiterte Version der Pathway Query Language (PQL) vor. Der Hauptbestandteil von GRIcano ist ein kostenbasierter Anfrageoptimierer. Diese Arbeit enthält Beiträge zu allen drei benötigten Komponenten des Optimierers, der relationalen Algebra, Implementierungen und Kostenmodellen. Die Operatoren der relationalen Algebra sind nicht ausreichend, um Graphanfragen auszudrücken. Daher stellen wir zuerst neue Operatoren vor. Wir schlagen den Erreichbarkeits-, Distanz-, Pfadlängen- und Pfadoperator vor. Zusätzlich geben wir Regeln für die Umformung von Ausdrücken an. Des Weiteren präsentieren wir Implementierungen für jeden vorgeschlagenen Operator. Der Hauptbeitrag ist GRIPP, eine Indexstruktur, die die effiziente Ausführung von Erreichbarkeitsanfragen auf sehr großen Graphen erlaubt. Wir zeigen, wie GRIPP und die rekursive Anfragestrategie genutzt werden können, um Implementierungen für alle Operatoren bereitzustellen. Die dritte Komponente von GRIcano ist das Kostenmodell, das Kardinalitätsabschätzungen der Operatoren und Kostenfunktionen für die Implementierungen benötigt. Basierend auf umfangreichen Experimenten schlagen wir in dieser Arbeit Funktionen dafür vor. Der neue Ansatz unserer Kostenmodelle ist, dass die Funktionen nur Kennzahlen der Graphen verwenden. Abschließend zeigen wir die Wirkungsweise von GRIcano durch Beispielanfragen auf echten biologischen Graphen. / Graphs occur in many areas of life. We are interested in graphs in biology, where nodes are chemical compounds, enzymes, reactions, or interactions that are connected by edges. Efficiently querying these graphs is a challenging task. In this thesis we present GRIcano, a system that efficiently executes graph queries. For GRIcano we assume that graphs are stored and queried using relational database management systems (RDBMS). We propose an extended version of the Pathway Query Language PQL to express graph queries. The core of GRIcano is a cost-based query optimizer. This thesis makes contributions to all three required components of the optimizer, the relational algebra, implementations, and cost model. Relational algebra operators alone are not sufficient to express graph queries. Thus, we first present new operators to rewrite PQL queries to algebra expressions. We propose the reachability, distance, path length, and path operator. In addition, we provide rewrite rules for the newly proposed operators in combination with standard relational algebra operators. Secondly, we present implementations for each proposed operator. The main contribution is GRIPP, an index structure that allows us to answer reachability queries on very large graphs. GRIPP has advantages over other existing index structures, which we review in this work. In addition, we show how to employ GRIPP and the recursive query strategy as implementation for all four proposed operators. The third component of GRIcano is the cost model, which requires cardinality estimates for operators and cost functions for implementations. Based on extensive experimental evaluation of our proposed algorithms we present functions to estimate the cardinality of operators and the cost of executing a query. The novelty of our approach is that these functions only use key figures of the graph. We finally present the effectiveness of GRIcano using exemplary graph queries on real biological networks. Graphanfragen Anfrageoptimierung Erreichbarkeitsanfragen Pfadanfragen Subgraph-homeomorphismus Graph Query Query Optimization Reachability Query Path Query Subgraph homeomorphism 004 Informatik 28 Informatik, Datenverarbeitung ST 270 ddc:004
25	Towards New Bounds for the 2-Edge Connected Spanning Subgraph Problem Legault, Philippe January 2017 (has links) Given a complete graph K_n = (V,E) with non-negative edge costs c ∈ R^E, the problem multi-2EC_cost is that of finding a 2-edge connected spanning multi-subgraph of K_n with minimum cost. It is believed that there are no efficient ways to solve the problem exactly, as it is NP-hard. Methods such as approximation algorithms, which rely on lower bounds like the linear programming relaxation multi-2EC^LP of multi-2EC , thus become vital cost cost to obtain solutions guaranteed to be close to the optimal in a fast manner. In this thesis, we focus on the integrality gap αmulti-2EC of multi-2EC^LP , which is a measure of the quality of multi-2EC^LP as a lower bound. Although we currently only know cost that 6/5 ≤ αmulti-2EC_cost ≤ 3 , the integrality gap for multi-2EC_cost has been conjectured to be 6/5. We explore the idea of using the structure of solutions for αmulti-2EC_cost and the concept of convex combination to obtain improved bounds for αmulti-2EC_cost. We focus our efforts on a family J of half-integer solutions that appear to give the largest integrality gap for multi-2EC_cost. We successfully show that the conjecture αmulti-2EC_cost = 6/5 is true for any cost functions optimized by some x∗ ∈ J. We also study the related problem 2EC_size, which consists of finding the minimum size 2-edge connected spanning subgraph of a 2-edge connected graph. The problem is NP-hard even at its simplest, when restricted to cubic 3-edge connected graphs. We study that case in the hope of finding a more general method, and we show that every 3-edge connected cubic graph G = (V ′, E′), with n = \|V ′\| allows a 2EC_size solution for G of size at most 7n/6 This improves upon Boyd, Iwata and Takazawa’s guarantee of 6n/5 and extend Takazawa’s 7n/6 guarantee for bipartite cubic 3-edge connected graphs to all cubic 3-edge connected graphs. Approximation algorithm Integrality gap Cubic graphs Half-integer
26	Porovnání přístupů k dotazování chemických sloučenin / Comparison of Approaches for Querying of Chemical Compounds Šípek, Vojtěch January 2019 (has links) The purpose of this thesis is to perform an analysis of approaches to querying chemical databases and to validate or invalidate its results. Currently, there exists no work which would compare the performance and memory usage of the best performing approaches on the same data set. In this thesis, we address this lack of information and we create an un-biased benchmark of the most popular index building methods for subgraph querying of chemical databases. Also, we compare the results of such benchmark with the performance results of an SQL and a graph database. 1
27	Subgraph Isomorphism Search In Massive Graph Data / Isomorphisme de Sous-Graphes dans les graphes de données massifs Nabti, Chems Eddine 15 December 2017 (has links) L'interrogation de graphes de données est un problème fondamental qui connait un grand intérêt, en particulier pour les données structurées massives où les graphes constituent une alternative prometteuse aux bases de données relationnelles pour la modélisation des grandes masses de données. Cependant, l'interrogation des graphes de données est différente et plus complexe que l'interrogation des données relationnelles à base de tables. La tâche principale impliquée dans l'interrogation de graphes de données est la recherche d'isomorphisme de sous-graphes qui est un problème NP-complet.La recherche d'isomorphisme de sous-graphes est un problème très important impliqué dans divers domaines comme la reconnaissance de formes, l'analyse des réseaux sociaux, la biologie, etc. Il consiste à énumérer les sous-graphes d'un graphe de données qui correspondent à un graphe requête. Les solutions les plus connues de ce problème sont basées sur le retour arrière (backtracking). Elles explorent un grand espace de recherche, ce qui entraîne un coût de traitement élevé, notamment dans le cas de données massives.Pour réduire le temps et la complexité en espace mémoire dans la recherche d'isomorphisme de sous-graphes, nous proposons d'utiliser des graphes compressés. Dans notre approche, la recherche d'isomorphisme de sous-graphes est réalisée sur une représentation compressée des graphes sans les décompresser. La compression des graphes s'effectue en regroupant les sommets en super-sommets. Ce concept est connu dans la théorie des graphes par la décomposition modulaire. Il sert à générer une représentation en arbre d'un graphe qui met en évidence des groupes de sommets qui ont les mêmes voisins. Avec cette compression, nous obtenons une réduction substantielle de l'espace de recherche et par conséquent, une économie significative dans le temps de traitement.Nous proposons également une nouvelle représentation des sommets du graphe, qui simplifie le filtrage de l'espace de recherche. Ce nouveau mécanisme appelé compact neighborhood Index (CNI) encode l'information de voisinage autour d'un sommet en un seul entier. Cet encodage du voisinage réduit la complexité du temps de filtrage de cubique à quadratique. Ce qui est considérable pour les données massifs.Nous proposons également un algorithme de filtrage itératif qui repose sur les caractéristiques des CNIs pour assurer un élagage global de l'espace de recherche.Nous avons évalué nos approches sur plusieurs datasets et nous les avons comparées avec les algorithmes de l’état de l’art / Querying graph data is a fundamental problem that witnesses an increasing interest especially for massive structured data where graphs come as a promising alternative to relational databases for big data modeling. However, querying graph data is different and more complex than querying relational table-based data. The main task involved in querying graph data is subgraph isomorphism search which is an NP-complete problem. Subgraph isomorphism search, is an important problem which is involved in various domains such as pattern recognition, social network analysis, biology, etc. It consists to enumerate the subgraphs of a data graph that match a query graph. The most known solutions of this problem are backtracking-based. They explore a large search space which results in a high computational cost when we deal with massive graph data. To reduce time and memory space complexity of subgraph isomorphism search. We propose to use compressed graphs. In our approach, subgraph isomorphism search is achieved on compressed representations of graphs without decompressing them. Graph compression is performed by grouping vertices into super vertices. This concept is known, in graph theory, as modular decomposition. It is used to generate a tree representation of a graph that highlights groups of vertices that have the same neighbors. With this compression we obtain a substantial reduction of the search space and consequently a significant saving in the processing time. We also propose a novel encoding of vertices that simplifies the filtering of the search space. This new mechanism is called compact neighborhood Index (CNI). A CNI distills all the information around a vertex in a single integer. This simple neighborhood encoding reduces the time complexity of vertex filtering from cubic to quadratic which is considerable for big graphs. We propose also an iterative local global filtering algorithm that relies on the characteristics of CNIs to ensure a global pruning of the search space.We evaluated our approaches on several real-word datasets and compared them with the state of the art algorithms Isomorphisme de sous-graphe Graphes de données massifs Subgraph Isomorphism Search Massive Graph Data 004
28	Exploratory search through large video corpora Castañón, Gregory David 21 June 2016 (has links) Activity retrieval is a growing field in electrical engineering that specializes in the search and retrieval of relevant activities and events in video corpora. With the affordability and popularity of cameras for government, personal and retail use, the quantity of available video data is rapidly outscaling our ability to reason over it. Towards the end of empowering users to navigate and interact with the contents of these video corpora, we propose a framework for exploratory search that emphasizes activity structure and search space reduction over complex feature representations. Exploratory search is a user driven process wherein a person provides a system with a query describing the activity, event, or object he is interested in finding. Typically, this description takes the implicit form of one or more exemplar videos, but it can also involve an explicit description. The system returns candidate matches, followed by query refinement and iteration. System performance is judged by the run-time of the system and the precision/recall curve of of the query matches returned. Scaling is one of the primary challenges in video search. From vast web-video archives like youtube (1 billion videos and counting) to the 30 million active surveillance cameras shooting an estimated 4 billion hours of footage every week in the United States, trying to find a set of matches can be like looking for a needle in a haystack. Our goal is to create an efficient archival representation of video corpora that can be calculated in real-time as video streams in, and then enables a user to quickly get a set of results that match. First, we design a system for rapidly identifying simple queries in large-scale video corpora. Instead of focusing on feature design, our system focuses on the spatiotemporal relationships between those features as a means of disambiguating an activity of interest from background. We define a semantic feature vocabulary of concepts that are both readily extracted from video and easily understood by an operator. As data streams in, features are hashed to an inverted index and retrieved in constant time after the system is presented with a user's query. We take a zero-shot approach to exploratory search: the user manually assembles vocabulary elements like color, speed, size and type into a graph. Given that information, we perform an initial downsampling of the archived data, and design a novel dynamic programming approach based on genome-sequencing to search for similar patterns. Experimental results indicate that this approach outperforms other methods for detecting activities in surveillance video datasets. Second, we address the problem of representing complex activities that take place over long spans of space and time. Subgraph and graph matching methods have seen limited use in exploratory search because both problems are provably NP-hard. In this work, we render these problems computationally tractable by identifying the maximally discriminative spanning tree (MDST), and using dynamic programming to optimally reduce the archive data based on a custom algorithm for tree-matching in attributed relational graphs. We demonstrate the efficacy of this approach on popular surveillance video datasets in several modalities. Finally, we design an approach for successive search space reduction in subgraph matching problems. Given a query graph and archival data, our algorithm iteratively selects spanning trees from the query graph that optimize the expected search space reduction at each step until the archive converges. We use this approach to efficiently reason over video surveillance datasets, simulated data, as well as large graphs of protein data. Electrical engineering Computer vision Dynamic programming Machine learning Subgraph matching Video search
29	Efficient Virtual Network Embedding onto A Hierarchical-Based Substrate Network Framework Ghazar, Tay 12 March 2013 (has links) The current Internet architecture presents a barrier to accommodate the vigorous arising demand for deploying new network services and applications. The next-generation architecture views the network virtualization as the gateway to overcome this limitation. Network virtualization promises to run efficiently and securely multiple dedicated virtual networks (VNs) over a shared physical infrastructure. Each VN is tailored to host a unique application based on the user’s preferences. This thesis addresses the problem of the efficient embedding of multiple VNs onto a shared substrate network (SN). The contribution of this thesis are twofold: First, a novel hierarchical SN management framework is proposed that efficiently selects the optimum VN mapping scheme for the requested VN from more than one proposed VN mapping candidates obtained in parallel. In order to accommodate the arbitrary architecture of the VNs, the proposed scheme divides the VN request into smaller subgraphs, and individually maps them on the SN using a variation of the exact subgraph matching techniques. Second, the physical resources pricing policy is introduced that is based on time-ofuse, that reflects the effect of resource congestion introduced by VN users. The preferences of the VN users are first represented through corresponding demand-utility functions that quantify the sensitivity of the applications hosted by the VNs to resource consumption and time-of-use. A novel model of time-varying VNs is presented, where users are allowed to up- or down-scale the requested resources to continuously maximize their utility while minimizing the VNs embedding cost. In contrast to existing solutions, the proposed work does not impose any limitations on the size or topology of the VN requests. Instead, the search is customized according to the VN size and the associated utility. Extensive simulations are then conducted to demonstrate the improvement achieved through the proposed work in terms of network utilization, the ratio of accepted VN requests and the SP profits. network virtualization virtual networks congestion pricing time-of-use pricing subgraph matching network management hierarchical structure
30	A Distributed Graph Mining Framework Based On Mapreduce Alkan, Sertan 01 January 2010 (has links) (PDF) The frequent patterns hidden in a graph can reveal crucial information about the network the graph represents. Existing techniques to mine the frequent subgraphs in a graph database generally rely on the premise that the data can be fit into main memory of the device that the computation takes place. Even though there are some algorithms that are designed using highly optimized methods to some extent, many lack the solution to the problem of scalability. In this thesis work, our aim is to find and enumerate the subgraphs that are at least as frequent as the designated threshold in a given graph. Here, we propose a new distributed algorithm for frequent subgraph mining problem that can scale horizontally as the computing cluster size increases. The method described here, uses a partitioning method and Map/Reduce programming model to distribute the computation of frequent subgraphs. In the core of this algorithm, we make use of an existing graph partitioning method to split the given data in the distributed file system and to merge and join the computed subgraphs without losing information. The frequent subgraph computation in each split is done using another known method which can enumerate the frequent patterns. Although current algorithms can efficiently find frequent patterns, they are not parallel or distributed algorithms in that even when they partition the data, they are designed to work on a single machine. Furthermore, these algorithms are computationally expensive but not fault tolerant and are not designed to work on a distributed file system. Using the Map/Reduce paradigm, we distribute the computation of frequent patterns to every machine in a cluster. Our algorithm, first bi-partitions the data via successive Map/Reduce jobs, then invokes another Map/Reduce job to compute the subgraphs in partitions using CloseGraph, recovers the whole set by invoking a series of Map/Reduce jobs to merge-join the previously found patterns. The implementation uses an open source Map/Reduce environment, Hadoop. In our experiments, our method can scale up to large graphs, as the graph data size gets bigger, this method performs better than the existing algorithms. QA Computer Software 76.75-76.765

Search results