Research in natural language processing (NLP) focuses recently on the development of learned language models called word embedding models like word2vec, fastText, and BERT. Pre-trained on large amounts of unstructured text in natural language, those embedding models constitute a rich source of common knowledge in the domain of the text used for the training. In the NLP community, significant improvements are achieved by using those models together with deep neural network models. To support applications to benefit from word embeddings, we extend the capabilities of traditional relational database systems, which are still by far the most common DBMSs but only provide limited text analysis features. Therefore, we implement (a) novel database operations involving embedding representations to allow a database user to exploit the knowledge encoded in word embedding models for advanced text analysis operations. The integration of those operations into database query language enables users to construct queries using novel word embedding operations in conjunction with traditional query capabilities of SQL. To allow efficient retrieval of embedding representations and fast execution of the operations, we implement (b) novel search algorithms and index structures for approximated kNN-Joins and integrate those into a relational database management system. Moreover, we investigate techniques to optimize embedding representations of text values in database systems. Therefore, we design (c) a novel context adaptation algorithm. This algorithm utilizes the structured data present in the database to enrich the embedding representations of text values to model their context-specific semantic in the database. Besides, we provide (d) support for selecting a word embedding model suitable for a user's application. Therefore, we developed a data processing pipeline to construct a dataset for domain-specific word embedding evaluation. Finally, we propose (e) novel embedding techniques for pre-training on tabular data to support applications working with text values in tables. Our proposed embedding techniques model semantic relations arising from the alignment of words in tabular layouts that can only hardly be derived from text documents, e.g., relations between table schema and table body. In this way, many applications, which either employ embeddings in supervised machine learning models, e.g., to classify cells in spreadsheets, or through the application of arithmetic operations, e.g., table discovery applications, can profit from the proposed embedding techniques.:1 INTRODUCTION
1.1 Contribution
1.2 Outline
2 REPRESENTATION OF TEXT FOR NATURAL LANGUAGE PROCESSING
2.1 Natural Language Processing Systems
2.2 Word Embedding Models
2.2.1 Matrix Factorization Methods
2.2.2 Learned Distributed Representations
2.2.3 Contextualize Word Embeddings
2.2.4 Advantages of Contextualize and Static Word Embeddings
2.2.5 Properties of Static Word Embeddings
2.2.6 Node Embeddings
2.2.7 Non-Euclidean Embedding Techniques
2.3 Evaluation of Word Embeddings
2.3.1 Similarity Evaluation
2.3.2 Analogy Evaluation
2.3.3 Cluster-based Evaluation 2.4 Application for Tabular Data
2.4.1 Semantic Search
2.4.2 Data Curation
2.4.3 Data Discovery
3 SYSTEM OVERVIEW
3.1 Opportunities of an Integration
3.2 Characteristics of Word Vectors
3.3 Objectives and Challenges
3.4 Word Embedding Operations
3.5 Performance Optimization of Operations
3.6 Context Adaptation
3.7 Requirements for Model Recommendation
3.8 Tabular Embedding Models
4 MANAGEMENT OF EMBEDDING REPRESENTATIONS IN DATABASE SYSTEMS
4.1 Integration of Operations in an RDBMS
4.1.1 System Architecture
4.1.2 Storage Formats
4.1.3 User-Defined Functions
4.1.4 Web Application
4.2 Nearest Neighbor Search
4.2.1 Tree-based Methods
4.2.2 Proximity Graphs
4.2.3 Locality-Sensitive Hashing
4.2.4 Quantization Techniques
4.3 Applicability of ANN Techniques for Word Embedding kNN-Joins
4.4 Related Work on kNN Search in Database Systems
4.5 ANN-Joins for Relational Database Systems
4.5.1 Index Architecture
4.5.2 Search Algorithm
4.5.3 Distance Calculation
4.5.4 Optimization Capabilities
4.5.5 Estimation of the Number of Targets 4.5.6 Flexible Product Quantization
4.5.7 Further Optimizations
4.5.8 Parameter Tuning
4.5.9 kNN-Joins for Word2Bits
4.6 Evaluation
4.6.1 Experimental Setup
4.6.2 Influence of Index Parameters on Precision and Execution Time
4.6.3 Performance of Subroutines
4.6.4 Flexible Product Quantization
4.6.5 Accuracy of the Target Size Estimation
4.6.6 Performance of Word2Bits kNN-Join
4.7 Summary
5 CONTEXT ADAPTATION FOR WORD EMBEDDING OPTIMIZATION
5.1 Related Work
5.1.1 Graph and Text Joint Embedding Methods
5.1.2 Retrofitting Approaches
5.1.3 Table Embedding Models
5.2 Relational Retrofitting Approach
5.2.1 Data Preparation
5.2.2 Relational Retrofitting Problem
5.2.3 Relational Retrofitting Algorithm
5.2.4 Online-RETRO
5.3 Evaluation Platform: Retro Live
5.3.1 Functionality
5.3.2 Interface
5.4 Evaluation
5.4.1 Datasets
5.4.2 Training of Embeddings
5.4.3 Machine Learning Models
5.4.4 Evaluation of ML Models
5.4.5 Run-time Measurements
5.4.6 Online Retrofitting
5.5 Summary
6 MODEL RECOMMENDATION
6.1 Related Work
6.1.1 Extrinsic Evaluation
6.1.2 Intrinsic Evaluation
6.2 Architecture of FacetE
6.3 Evaluation Dataset Construction Pipeline
6.3.1 Web Table Filtering and Facet Candidate Generation
6.3.2 Check Soft Functional Dependencies
6.3.3 Post-Filtering
6.3.4 Categorization
6.4 Evaluation of Popular Word Embedding Models
6.4.1 Domain-Agnostic Evaluation
6.4.2 Evaluation of a Single Facet
6.4.3 Evaluation of an Object Set
6.5 Summary
7 TABULAR TEXT EMBEDDINGS
7.1 Related Work
7.1.1 Static Table Embedding Models
7.1.2 Contextualized Table Embedding Models
7.2 Web Table Embedding Model
7.2.1 Preprocessing
7.2.2 Text Serialization
7.2.3 Encoding Model
7.2.4 Embedding Training
7.3 Applications for Table Embeddings
7.3.1 Table Union Search
7.3.2 Classification Tasks
7.4 Evaluation
7.4.1 Intrinsic Evaluation
7.4.2 Table Union Search Evaluation
7.4.3 Table Layout Classification
7.4.4 Spreadsheet Cell Classification
7.5 Summary
8 CONCLUSION
8.1 Summary
8.2 Directions for Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
A CONVEXITY OF RELATIONAL RETROFITTING
B EVALUATION OF THE RELATIONAL RETROFITTING HYPERPARAMETERS
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:76665 |
Date | 18 November 2021 |
Creators | Günther, Michael |
Contributors | Lehner, Wolfgang, Bizer, Christian, Technische Universität Dresden |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Relation | info:eu-repo/grantAgreement/Deutsche Forschungsgemeinschaft/Graduiertenkolleg 1907 RoSI: Rollenbasierte Software-Infrastrukturen für durchgängig-kontextsensitive Systeme/221322883/ |
Page generated in 0.0076 seconds