• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Enforcing Temporal and Ontological Dependencies Over Graphs

Alipourlangouri, Morteza January 2022 (has links)
Graphs provide powerful abstractions, and are widely used in different areas. There has been an increasing demand in using the graph data model to represent data in many applications such as network management, web page analysis, knowledge graphs, social networks. These graphs are usually dynamic and represent the time evolving relationships between entities. Enforcing and maintaining data quality in graphs is a critical task for decision making, operational efficiency and accurate data analysis as recent studies have shown that data scientists spend 60-80% of their time cleaning and organizing data [2]. This effort motivates the need for effective data cleaning tools to reduce the user burden. The study of data quality management focuses along a set of dimensions, including data consistency, data deduplication, information completeness, data currency, and data accuracy. Achieving all these data characteristics is often not possible in practice due to personnel costs, and for performance reasons. In this thesis, we focus on tackling three problems in two data quality dimensions: data consistency and data deduplication. To address the problem of data consistency over temporal graphs, we present a new class of data dependencies called Temporal Graph Functional Dependency (TGFDs). TGFDs generalize functional dependencies to temporal graphs as a sequence of graph snapshots that are induced by time intervals, and enforce both topological constraints and attribute value dependencies that must be satisfied by these snapshots. We establish the complexity results for the satisfiability and implication problems of TGFDs. We propose a sound and complete axiomatization system for TGFDs. We also present efficient parallel algorithms to detect inconsistencies in temporal graphs as violations of TGFDs. To address the data deduplication problem, we first address the problem of key discovery for graphs. Keys for graphs use topology and value constraints to uniquely identify entities in a graph database and keys are the main tools for data deduplication in graphs. We present two properties that define a key, including minimality and support and an algorithm to mine keys over graphs via frequent subgraph expansion. However, existing key constraints identify entities by enforcing label equality on node types. These constraints can be too restrictive to characterize structures and node labels that are syntactically different but semantically equivalent. Lastly, we propose a new class of key constraints, Ontological Graph Keys (OGKs) that extend conventional graph keys by ontological subgraph matching between entity labels and an external ontology. We study the entity matching problem with OGKs. We develop efficient algorithms to perform entity matching based on a Chase procedure. The proposed dependencies and algorithms in this thesis improve consistency detection in temporal graphs, automate the discovery of keys in graphs, and enrich the semantic expressiveness of graph keys. / Dissertation / Doctor of Science (PhD)
2

Towards effective analysis of big graphs : from scalability to quality

Tian, Chao January 2017 (has links)
This thesis investigates the central issues underlying graph analysis, namely, scalability and quality. We first study the incremental problems for graph queries, which aim to compute the changes to the old query answer, in response to the updates to the input graph. The incremental problem is called bounded if its cost is decided by the sizes of the query and the changes only. No matter how desirable, however, our first results are negative: for common graph queries such as graph traversal, connectivity, keyword search and pattern matching, their incremental problems are unbounded. In light of the negative results, we propose two new characterizations for the effectiveness of incremental computation, and show that the incremental computations above can still be effectively conducted, by either reducing the computations on big graphs to small data, or incrementalizing batch algorithms by minimizing unnecessary recomputation. We next study the problems with regards to improving the quality of the graphs. To uniquely identify entities represented by vertices in a graph, we propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. As an application, we study the entity matching problem, which is to find all pairs of entities in a graph that are identified by a given set of keys. Although the problem is proved to be intractable, and cannot be parallelized in logarithmic rounds, we provide two parallel scalable algorithms for it. In addition, to catch numeric inconsistencies in real-life graphs, we extend graph functional dependencies with linear arithmetic expressions and comparison predicates, referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity, since if we allow non-linear arithmetic expressions, even of degree at most 2, the satisfiability and implication problems become undecidable. A localizable incremental algorithm is developed to detect errors using NGDs, where the cost is determined by small neighbors of nodes in the updates instead of the entire graph. Finally, a rule-based method to clean graphs is proposed. We extend graph entity dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of ground truth, we fix violations of GEDs in the graph by combining data repairing and object identification. The method finds certain fixes to errors detected by GEDs, i.e., as long as the GEDs and the ground truth are correct, the fixes are assured correct as their logical consequences. Several fundamental results underlying the method are established, and an algorithm is developed to implement the method. We also parallelize the method and guarantee to reduce its running time with the increase of processors.

Page generated in 0.094 seconds