• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 127
  • 9
  • 9
  • 5
  • 4
  • 3
  • 3
  • 1
  • 1
  • 1
  • Tagged with
  • 186
  • 67
  • 59
  • 57
  • 56
  • 43
  • 40
  • 39
  • 38
  • 36
  • 36
  • 34
  • 31
  • 28
  • 22
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Word Embeddings in Database Systems

Günther, Michael 18 November 2021 (has links)
Research in natural language processing (NLP) focuses recently on the development of learned language models called word embedding models like word2vec, fastText, and BERT. Pre-trained on large amounts of unstructured text in natural language, those embedding models constitute a rich source of common knowledge in the domain of the text used for the training. In the NLP community, significant improvements are achieved by using those models together with deep neural network models. To support applications to benefit from word embeddings, we extend the capabilities of traditional relational database systems, which are still by far the most common DBMSs but only provide limited text analysis features. Therefore, we implement (a) novel database operations involving embedding representations to allow a database user to exploit the knowledge encoded in word embedding models for advanced text analysis operations. The integration of those operations into database query language enables users to construct queries using novel word embedding operations in conjunction with traditional query capabilities of SQL. To allow efficient retrieval of embedding representations and fast execution of the operations, we implement (b) novel search algorithms and index structures for approximated kNN-Joins and integrate those into a relational database management system. Moreover, we investigate techniques to optimize embedding representations of text values in database systems. Therefore, we design (c) a novel context adaptation algorithm. This algorithm utilizes the structured data present in the database to enrich the embedding representations of text values to model their context-specific semantic in the database. Besides, we provide (d) support for selecting a word embedding model suitable for a user's application. Therefore, we developed a data processing pipeline to construct a dataset for domain-specific word embedding evaluation. Finally, we propose (e) novel embedding techniques for pre-training on tabular data to support applications working with text values in tables. Our proposed embedding techniques model semantic relations arising from the alignment of words in tabular layouts that can only hardly be derived from text documents, e.g., relations between table schema and table body. In this way, many applications, which either employ embeddings in supervised machine learning models, e.g., to classify cells in spreadsheets, or through the application of arithmetic operations, e.g., table discovery applications, can profit from the proposed embedding techniques.:1 INTRODUCTION 1.1 Contribution 1.2 Outline 2 REPRESENTATION OF TEXT FOR NATURAL LANGUAGE PROCESSING 2.1 Natural Language Processing Systems 2.2 Word Embedding Models 2.2.1 Matrix Factorization Methods 2.2.2 Learned Distributed Representations 2.2.3 Contextualize Word Embeddings 2.2.4 Advantages of Contextualize and Static Word Embeddings 2.2.5 Properties of Static Word Embeddings 2.2.6 Node Embeddings 2.2.7 Non-Euclidean Embedding Techniques 2.3 Evaluation of Word Embeddings 2.3.1 Similarity Evaluation 2.3.2 Analogy Evaluation 2.3.3 Cluster-based Evaluation 2.4 Application for Tabular Data 2.4.1 Semantic Search 2.4.2 Data Curation 2.4.3 Data Discovery 3 SYSTEM OVERVIEW 3.1 Opportunities of an Integration 3.2 Characteristics of Word Vectors 3.3 Objectives and Challenges 3.4 Word Embedding Operations 3.5 Performance Optimization of Operations 3.6 Context Adaptation 3.7 Requirements for Model Recommendation 3.8 Tabular Embedding Models 4 MANAGEMENT OF EMBEDDING REPRESENTATIONS IN DATABASE SYSTEMS 4.1 Integration of Operations in an RDBMS 4.1.1 System Architecture 4.1.2 Storage Formats 4.1.3 User-Defined Functions 4.1.4 Web Application 4.2 Nearest Neighbor Search 4.2.1 Tree-based Methods 4.2.2 Proximity Graphs 4.2.3 Locality-Sensitive Hashing 4.2.4 Quantization Techniques 4.3 Applicability of ANN Techniques for Word Embedding kNN-Joins 4.4 Related Work on kNN Search in Database Systems 4.5 ANN-Joins for Relational Database Systems 4.5.1 Index Architecture 4.5.2 Search Algorithm 4.5.3 Distance Calculation 4.5.4 Optimization Capabilities 4.5.5 Estimation of the Number of Targets 4.5.6 Flexible Product Quantization 4.5.7 Further Optimizations 4.5.8 Parameter Tuning 4.5.9 kNN-Joins for Word2Bits 4.6 Evaluation 4.6.1 Experimental Setup 4.6.2 Influence of Index Parameters on Precision and Execution Time 4.6.3 Performance of Subroutines 4.6.4 Flexible Product Quantization 4.6.5 Accuracy of the Target Size Estimation 4.6.6 Performance of Word2Bits kNN-Join 4.7 Summary 5 CONTEXT ADAPTATION FOR WORD EMBEDDING OPTIMIZATION 5.1 Related Work 5.1.1 Graph and Text Joint Embedding Methods 5.1.2 Retrofitting Approaches 5.1.3 Table Embedding Models 5.2 Relational Retrofitting Approach 5.2.1 Data Preparation 5.2.2 Relational Retrofitting Problem 5.2.3 Relational Retrofitting Algorithm 5.2.4 Online-RETRO 5.3 Evaluation Platform: Retro Live 5.3.1 Functionality 5.3.2 Interface 5.4 Evaluation 5.4.1 Datasets 5.4.2 Training of Embeddings 5.4.3 Machine Learning Models 5.4.4 Evaluation of ML Models 5.4.5 Run-time Measurements 5.4.6 Online Retrofitting 5.5 Summary 6 MODEL RECOMMENDATION 6.1 Related Work 6.1.1 Extrinsic Evaluation 6.1.2 Intrinsic Evaluation 6.2 Architecture of FacetE 6.3 Evaluation Dataset Construction Pipeline 6.3.1 Web Table Filtering and Facet Candidate Generation 6.3.2 Check Soft Functional Dependencies 6.3.3 Post-Filtering 6.3.4 Categorization 6.4 Evaluation of Popular Word Embedding Models 6.4.1 Domain-Agnostic Evaluation 6.4.2 Evaluation of a Single Facet 6.4.3 Evaluation of an Object Set 6.5 Summary 7 TABULAR TEXT EMBEDDINGS 7.1 Related Work 7.1.1 Static Table Embedding Models 7.1.2 Contextualized Table Embedding Models 7.2 Web Table Embedding Model 7.2.1 Preprocessing 7.2.2 Text Serialization 7.2.3 Encoding Model 7.2.4 Embedding Training 7.3 Applications for Table Embeddings 7.3.1 Table Union Search 7.3.2 Classification Tasks 7.4 Evaluation 7.4.1 Intrinsic Evaluation 7.4.2 Table Union Search Evaluation 7.4.3 Table Layout Classification 7.4.4 Spreadsheet Cell Classification 7.5 Summary 8 CONCLUSION 8.1 Summary 8.2 Directions for Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A CONVEXITY OF RELATIONAL RETROFITTING B EVALUATION OF THE RELATIONAL RETROFITTING HYPERPARAMETERS
52

Evaluation of Sentence Representations in Semantic Text Similarity Tasks / Utvärdering av meningsrepresentation för semantisk textlikhet

Balzar Ekenbäck, Nils January 2021 (has links)
This thesis explores the methods of representing sentence representations for semantic text similarity using word embeddings and benchmarks them against sentence based evaluation test sets. Two methods were used to evaluate the representations: STS Benchmark and STS Benchmark converted to a binary similarity task. Results showed that preprocessing of the word vectors could significantly boost performance in both tasks and conclude that word embed-dings still provide an acceptable solution for specific applications. The study also concluded that the dataset used might not be ideal for this type of evalua-tion, as the sentence pairs in general had a high lexical overlap. To tackle this, the study suggests that a paraphrasing dataset could act as a complement but that further investigation would be needed. / Denna avhandling undersöker metoder för att representera meningar i vektor-form för semantisk textlikhet och jämför dem med meningsbaserade testmäng-der. För att utvärdera representationerna användes två metoder: STS Bench-mark, en vedertagen metod för att utvärdera språkmodellers förmåga att ut-värdera semantisk likhet, och STS Benchmark konverterad till en binär lik-hetsuppgift. Resultaten visade att förbehandling av texten och ordvektorerna kunde ge en signifikant ökning i resultatet för dessa uppgifter. Studien konklu-derade även att datamängden som användes kanske inte är ideal för denna typ av utvärdering, då meningsparen i stort hade ett högt lexikalt överlapp. Som komplement föreslår studien en parafrasdatamängd, något som skulle kräva ytterligare studier.
53

[en] CORPUS FOR ACADEMIC DOMAIN: MODELS AND APPLICATIONS / [pt] CORPUS PARA O DOMÍNIO ACADÊMICO: MODELOS E APLICAÇÕES

IVAN DE JESUS PEREIRA PINTO 16 November 2021 (has links)
[pt] Dados acadêmicos (e.g., Teses, Dissertações) englobam aspectos de toda uma sociedade, bem como seu conhecimento científico. Neles, há uma riqueza de informações a ser explorada por modelos computacionais, e que podem ser positivos para sociedade. Os modelos de aprendizado de máquina, em especial, possuem uma crescente necessidade de dados para treinamento, que precisam ser estruturados e de tamanho considerável. Seu uso na área de processamento de linguagem natural é pervasivo nas mais diversas tarefas. Este trabalho realiza o esforço de coleta, construção, análise do maior corpus acadêmico conhecido na língua portuguesa. Foram treinados modelos de vetores de palavras, bag-of-words e transformer. O modelo transformer BERTAcadêmico apresentou os melhores resultados, com 77 por cento de f1-score na classificação da Grande Área de conhecimento e 63 por cento de f1-score na classificação da Área de conhecimento nas categorizações de Teses e Dissertações. É feita ainda uma análise semântica do corpus acadêmico através da modelagem de tópicos, e uma visualização inédita das áreas de conhecimento em forma de clusters. Por fim, é apresentada uma aplicação que faz uso dos modelos treinados, o SucupiraBot. / [en] Academic data (i.e., Thesis, Dissertation) encompasses aspects of a whole society, as well as its scientific knowledge. There is a wealth of information to be explored by computational models, and that can be positive for society. Machine learning models in particular, have an increasing need for training data, that are efficient and of considerable size. Its use in the area of natural language processing (NLP) is pervasive in many different tasks. This work makes the effort of collecting, constructing, analyzing and training of models for the biggest known academic corpus in the Portuguese language. Word embeddings, bag of words and transformers models have been trained. The Bert-Academico has shown the better result, with 77 percent of f1-score in Great area of knowledge and 63 percent in knowledge area classification of Thesis and Dissertation. A semantic analysis of the academic corpus is made through topic modelling, and an unprecedented visualization of the knowledge areas is presented. Lastly, an application that uses the trained models is showcased, the SucupiraBot.
54

Word embeddings for monolingual and cross-language domain-specific information retrieval / Ordinbäddningar för enspråkig och tvärspråklig domänspecifik informationssökning

Wigder, Chaya January 2018 (has links)
Various studies have shown the usefulness of word embedding models for a wide variety of natural language processing tasks. This thesis examines how word embeddings can be incorporated into domain-specific search engines for both monolingual and cross-language search. This is done by testing various embedding model hyperparameters, as well as methods for weighting the relative importance of words to a document or query. In addition, methods for generating domain-specific bilingual embeddings are examined and tested. The system was compared to a baseline that used cosine similarity without word embeddings, and for both the monolingual and bilingual search engines the use of monolingual embedding models improved performance above the baseline. However, bilingual embeddings, especially for domain-specific terms, tended to be of too poor quality to be used directly in the search engines. / Flera studier har visat att ordinbäddningsmodeller är användningsbara för många olika språkteknologiuppgifter. Denna avhandling undersöker hur ordinbäddningsmodeller kan användas i sökmotorer för både enspråkig och tvärspråklig domänspecifik sökning. Experiment gjordes för att optimera hyperparametrarna till ordinbäddningsmodellerna och för att hitta det bästa sättet att vikta ord efter hur viktiga de är i dokumentet eller sökfrågan. Dessutom undersöktes metoder för att skapa domänspecifika tvåspråkiga inbäddningar. Systemet jämfördes med en baslinje utan inbäddningar baserad på cosinuslikhet, och för både enspråkiga och tvärspråkliga sökningar var systemet som använde enspråkiga inbäddningar bättre än baslinjen. Däremot var de tvåspråkiga inbäddningarna, särskilt för domänspecifika ord, av låg kvalitet och gav för dåliga resultat för direkt användning inom sökmotorer.
55

Semantically Aligned Sentence-Level Embeddings for Agent Autonomy and Natural Language Understanding

Fulda, Nancy Ellen 01 August 2019 (has links)
Many applications of neural linguistic models rely on their use as pre-trained features for downstream tasks such as dialog modeling, machine translation, and question answering. This work presents an alternate paradigm: Rather than treating linguistic embeddings as input features, we treat them as common sense knowledge repositories that can be queried using simple mathematical operations within the embedding space, without the need for additional training. Because current state-of-the-art embedding models were not optimized for this purpose, this work presents a novel embedding model designed and trained specifically for the purpose of "reasoning in the linguistic domain".Our model jointly represents single words, multi-word phrases, and complex sentences in a unified embedding space. To facilitate common-sense reasoning beyond straightforward semantic associations, the embeddings produced by our model exhibit carefully curated properties including analogical coherence and polarity displacement. In other words, rather than training the model on a smorgaspord of tasks and hoping that the resulting embeddings will serve our purposes, we have instead crafted training tasks and placed constraints on the system that are explicitly designed to induce the properties we seek. The resulting embeddings perform competitively on the SemEval 2013 benchmark and outperform state-of- the-art models on two key semantic discernment tasks introduced in Chapter 8.The ultimate goal of this research is to empower agents to reason about low level behaviors in order to fulfill abstract natural language instructions in an autonomous fashion. An agent equipped with an embedding space of sucient caliber could potentially reason about new situations based on their similarity to past experience, facilitating knowledge transfer and one-shot learning. As our embedding model continues to improve, we hope to see these and other abilities become a reality.
56

Sudoku Variants on the Torus

Wyld, Kira A 01 January 2017 (has links)
This paper examines the mathematical properties of Sudoku puzzles defined on a Torus. We seek to answer the questions for these variants that have been explored for the traditional Sudoku. We do this process with two such embeddings. The end result of this paper is a deeper mathematical understanding of logic puzzles of this type, as well as a fun new puzzle which could be played.
57

Embeddings of Product Graphs Where One Factor is a Hypercube

Turner, Bethany 29 April 2011 (has links)
Voltage graph theory can be used to describe embeddings of product graphs if one factor is a Cayley graph. We use voltage graphs to explore embeddings of various products where one factor is a hypercube, describing some minimal and symmetrical embeddings. We then define a graph product, the weak symmetric difference, and illustrate a voltage graph construction useful for obtaining an embedding of the weak symmetric difference of an arbitrary graph with a hypercube.
58

Word embeddings and Patient records : The identification of MRI risk patients

Kindberg, Erik January 2019 (has links)
Identification of risks ahead of MRI examinations is identified as a cumbersome and time-consuming process at the Linköping University Hospital radiology clinic. The hospital staff often have to search through large amounts of unstructured patient data to find information about implants. Word embeddings has been identified as a possible tool to speed up this process. The purpose of this thesis is to evaluate this method, and that is done by training a Word2Vec model on patient journal data and analyzing the close neighbours of key search words by calculating cosine similarity. The 50 closest neighbours of each search words are categorized and annotated as relevant to the task of identifying risk patients ahead of MRI examinations or not. 10 search words were explored, leading to a total of 500 terms being annotated. In total, 14 different categories were observed in the result and out of these 8 were considered relevant. Out of the 500 terms, 340 (68%) were considered relevant. In addition, 48 implant models could be observed which are particularly interesting because if a patient have an implant, hospital staff needs to determine it’s exact model and the MRI conditions of that model. Overall these findings points towards a positive answer for the aim of the thesis, although further developments are needed.
59

Auslander-Reiten theory for systems of submodule embeddings

Unknown Date (has links)
In this dissertation, we will investigate aspects of Auslander-Reiten theory adapted to the setting of systems of submodule embeddings. Using this theory, we can compute Auslander-Reiten quivers of such categories, which among other information, yields valuable information about the indecomposable objects in such a category. A main result of the dissertation is an adaptation to this situation of the Auslander and Ringel-Tachikawa Theorem which states that for an artinian ring R of finite representation type, each R-module is a direct sum of finite-length indecomposable R-modules. In cases where this applies, the indecomposable objects obtained in the Auslander-Reiten quiver give the building blocks for the objects in the category. We also briefly discuss in which cases systems of submodule embeddings form a Frobenius category, and for a few examples explore pointwise Calabi-Yau dimension of such a category. / by Audrey Moore. / Thesis (Ph.D.)--Florida Atlantic University, 2009. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2009. Mode of access: World Wide Web.
60

Exponential Family Embeddings

Rudolph, Maja January 2018 (has links)
Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. Exponential family embeddings extend the idea of word embeddings to other types of high-dimensional data. Exponential family embeddings have three ingredients; embeddings as latent variables, a predefined conditioning set for each observation called the context and a conditional likelihood from the exponential family. The embeddings are inferred with a scalable algorithm. This thesis highlights three advantages of the exponential family embeddings model class: (A) The approximations used for existing methods such as word2vec can be understood as a biased stochastic gradients procedure on a specific type of exponential family embedding model --- the Bernoulli embedding. (B) By choosing different likelihoods from the exponential family we can generalize the task of learning distributed representations to different application domains. For example, we can learn embeddings of grocery items from shopping data, embeddings of movies from click data, or embeddings of neurons from recordings of zebrafish brains. On all three applications, we find exponential family embedding models to be more effective than other types of dimensionality reduction. They better reconstruct held-out data and find interesting qualitative structure. (C) Finally, the probabilistic modeling perspective allows us to incorporate structure and domain knowledge in the embedding space. We develop models for studying how language varies over time, differs between related groups of data, and how word usage differs between languages. Key to the success of these methods is that the embeddings share statistical information through hierarchical priors or neural networks. We demonstrate the benefits of this approach in empirical studies of Senate speeches, scientific abstracts, and shopping baskets.

Page generated in 0.0596 seconds