Global ETD Search

Return to search

Query and mining in large graph databases.

图结构能够描述数据对象之间的复杂关系，因而被广泛应用于多种领域。随着相关应用领域的发展，图数据库的规模变得庞大且仍在不断增长。这给研究者在图查询和图挖掘方面带来新的挑战。本文主要研究以下三个问题：如何确定两个图的顶点对应关系，使得其中一个图的子结构匹配到另一个图的相似子结构；如何从含有多个小图的数据库中，找到与查询图相似的图；如何在由不同类别的图组成的数据库中，选取特征子图并对图进行分类。 / 在本文中，对于第一个问题，我们提出了新的两段式图匹配算法。在第一阶段，我们采用了一个新的启发式策略，能够先选取锚顶点并向外扩展，进而快速得到初始匹配。在第二阶段，我们设计了新的算法对初始匹配加以改进，并且证明了新的匹配优于初始匹配。这个两段式图匹配算法能够快速有效地获得两个图的高质量匹配。为解决第二个问题，我们首先定义一个新的度量以衡量两图间的距离。它基于两图间的最大公共子图，能够很好地捕捉两个图的相同及不同之处。由于最大公共子图的计算是NP完全问题，为了快速回答top-k相似图查询，我们提出了一个高效算法，能够极大地减少最大公共子图的计算次数。这个算法根据距离度量的三种下界进行剪枝以筛选掉不合格的图。其中，前两种下界的计算基于两图的结构信息，第三种下界可由距离度量的三角不等式性质推出。我们还设计了三种不同的索引结构来支持剪枝，它们能够在剪枝效果和索引时间方面达到不同程度的平衡。关于第三个问题，我们发现了目前广泛使用的特征判别函数的两个主要缺陷，并据此提出了一个新的多样性特征判别函数。它不仅能衡量特征的判别性，而且能衡量特征的多样性。我们从多个方面分析了这个函数的性质，发现它能更好地区分不同类别的图。基于这个函数，我们设计了新的特征选取算法，获得很高的分类精度。 / Graph has powerful ability to model complex structural relationships among data objects and has been widely used in various applications. Along with the development of the application domains, graph databases become large and are growing rapidly in size. This brings researchers new challenges on graph query and mining, among which we mainly focus on investigating the following three problems: how to find the correspondence between the nodes of two large graphs so that some substructures in one graph are mapped to similar substructures in the other; another problem is how to retrieve similar graphs for a query graph from a graph database consisting of a large number of graphs; and the last problem is how to extract subgraph features to build an automated classification model for a graph database containing graphs which belong to different classes. / In this thesis, for the first problem, we propose a novel two-step approach which can efficiently match two large graphs over thousands of nodes with high matching quality. In the first stage, we design an anchor-selection/expansion scheme to construct a good initial matching heuristically. In the second stage, we propose a new approach to refine the initial matching and give the optimality of our refinement algorithm. Our approach can produce an approximate matching result with high quality and efficiency. To address the second problem, we introduce a new graph distance measure based on the maximum common subgraphs (MCS) of two graphs which can thoroughly capture the common as well as different structures of two graphs. Since computing the MCS of two graphs is NP-complete, to answer the top-k graph similarity query efficiently, we propose a fast algorithm which can significantly reduce the number of MCS computations. This algorithm prunes the unqualified graphs based on three lower bounds in which the first two are derived based on the structures of two graphs and the third is obtained based on the triangle property of the distance measure. Three index schemes are designed with different tradeoffs between pruning power and construction cost to assist the query processing. For the third problem, we identify two main issues of the current widely-used discriminative score for feature selection, and introduce a new diversified discriminative score to explore the additional value of the diversity together with the discriminativity. We analyze the properties of the newly-proposed diversified discriminative score from several perspectives and demonstrate that this score can make positive/negative graphs more separable. New algorithms are also proposed to select features based on the new score and they are shown to have high classification accuracy. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Zhu, Yuanyuan. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 137-146). / Abstract also in Chinese. / Abstract --- p.i / Abstract in Chinese --- p.iii / Acknowledgments --- p.iv / Contents --- p.vi / List of Tables --- p.x / List of Figures --- p.xi / Notations --- p.1 / Chapter 1. --- Introduction --- p.1 / Chapter 1.1. --- Motivation --- p.2 / Chapter 1.1.1. --- Large Graph Matching --- p.3 / Chapter 1.1.2. --- Top-k Graph Similarity Query --- p.4 / Chapter 1.1.3. --- Diversified Discriminative Feature Selection --- p.6 / Chapter 1.2. --- Contribution --- p.7 / Chapter 2. --- Preliminaries --- p.10 / Chapter 3. --- Related Work --- p.16 / Chapter 3.1. --- Graph Matching --- p.16 / Chapter 3.1.1. --- Exact Graph Matching --- p.16 / Chapter 3.1.2. --- Approximate Graph Matching --- p.17 / Chapter 3.2. --- Graph Similarity Query --- p.19 / Chapter 3.3. --- Graph Classification --- p.20 / Chapter 4. --- Large Graph Matching --- p.23 / Chapter 4.1. --- Problem Statement --- p.23 / Chapter 4.2. --- An Overview: Construction and Refinement --- p.24 / Chapter 4.3. --- Matching Construction --- p.26 / Chapter 4.3.1. --- Global and Local Node Similarity --- p.26 / Chapter 4.3.2. --- Anchor Selection and Expansion --- p.33 / Chapter 4.3.3. --- Discussion on τ for Anchor Selection --- p.36 / Chapter 4.4. --- Matching Refinement --- p.39 / Chapter 4.4.1. --- Vertex Cover Based Refinement --- p.39 / Chapter 4.4.2. --- Refinement and Its Optimality --- p.41 / Chapter 4.4.3. --- Randomly Refinement Excluding C - F₁ --- p.46 / Chapter 4.4.4. --- Randomly Refinement Including C - F₁ --- p.51 / Chapter 4.5. --- Labeled Graph Handling --- p.54 / Chapter 4.6. --- Experiments --- p.56 / Chapter 4.6.1. --- Comparison with the Approximate Algorithms --- p.59 / Chapter 4.6.2. --- Comparison with the Exact Algorithm --- p.63 / Chapter 4.6.3. --- Parameter and Scalability Testing --- p.65 / Chapter 4.6.4. --- Sensitivity of Randomness (PN) --- p.69 / Chapter 4.6.5. --- Effectiveness of Label Distribution --- p.70 / Chapter 4.7. --- Summary --- p.72 / Chapter 5. --- Top-k Graph Similarity Query --- p.73 / Chapter 5.1. --- Problem Statement --- p.73 / Chapter 5.2. --- The Framework --- p.78 / Chapter 5.3. --- Pruning without Indexing --- p.80 / Chapter 5.3.1. --- Edge Frequency Based Lower Bound --- p.80 / Chapter 5.3.2. --- Adjacency List Based Lower Bound --- p.82 / Chapter 5.3.3. --- Query Processing --- p.84 / Chapter 5.4. --- Pruning with Indexing --- p.85 / Chapter 5.4.1. --- The Triangle Property of Graph Distance --- p.86 / Chapter 5.4.2. --- Query Processing --- p.88 / Chapter 5.4.3. --- Indexing --- p.92 / Chapter 5.4.4. --- Discussion on the Generality of Our Framework --- p.94 / Chapter 5.5. --- Experiments --- p.94 / Chapter 5.5.1. --- Similarity Measures Evaluation --- p.96 / Chapter 5.5.2. --- Query Performance Evaluation --- p.98 / Chapter 5.5.3. --- Indexing Cost Evaluation --- p.102 / Chapter 5.6. --- Summary --- p.103 / Chapter 6. --- Diversified Discriminative Feature Selection --- p.105 / Chapter 6.1. --- Problem Statement --- p.105 / Chapter 6.2. --- Discriminative Score --- p.108 / Chapter 6.2.1. --- The Single Feature Discriminative Score --- p.109 / Chapter 6.2.2. --- A New Diversified Discriminative Score --- p.110 / Chapter 6.3. --- Property Statistics of Discriminative Score --- p.113 / Chapter 6.4. --- The Algorithms --- p.117 / Chapter 6.5. --- Ensemble D&D --- p.121 / Chapter 6.6. --- Experiments --- p.123 / Chapter 6.6.1. --- D&D Performance Analysis --- p.126 / Chapter 6.6.2. --- Comparison with Existing Algorithms --- p.127 / Chapter 6.6.3. --- Performance on Patterns Mined by GAIA --- p.129 / Chapter 6.7. --- Summary --- p.131 / Chapter 7. --- Conclusion and FutureWork --- p.132 / Chapter 7.1. --- Conclusion --- p.132 / Chapter 7.2. --- Future work --- p.134 / Bibliography --- p.136

Databases

Graph theory--Data processing

Data mining

Querying (Computer science)

Data structures (Computer science)

Identifer	oai:union.ndltd.org:cuhk.edu.hk/oai:cuhk-dr:cuhk_328752
Date	January 2013
Contributors	Zhu, Yuanyuan, Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management.
Source Sets	The Chinese University of Hong Kong
Language	English, Chinese
Detected Language	English
Type	Text, bibliography
Format	electronic resource, electronic resource, remote, 1 online resource (xii, 146 leaves) : ill.
Rights	Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Page generated in 0.0023 seconds

Query and mining in large graph databases.

Description

Links & Downloads

Tags

Additional Fields