Global ETD Search

1	Embedding Network Information for Machine Learning-based Intrusion Detection DeFreeuw, Jonathan Daniel 18 January 2019 (has links) As computer networks grow and demonstrate more complicated and intricate behaviors, traditional intrusion detections systems have fallen behind in their ability to protect network resources. Machine learning has stepped to the forefront of intrusion detection research due to its potential to predict future behaviors. However, training these systems requires network data such as NetFlow that contains information regarding relationships between hosts, but requires human understanding to extract. Additionally, standard methods of encoding this categorical data struggles to capture similarities between points. To counteract this, we evaluate a method of embedding IP addresses and transport-layer ports into a continuous space, called IP2Vec. We demonstrate this embedding on two separate datasets, CTU'13 and UGR'16, and combine the UGR'16 embedding with several machine learning methods. We compare the models with and without the embedding to evaluate the benefits of including network behavior into an intrusion detection system. We show that the addition of embeddings improve the F1-scores for all models in the multiclassification problem given in the UGR'16 data. / MS / As computer networks grow and demonstrate more complicated and intricate behaviors, traditional network protection tools like firewalls struggle to protect personal computers and servers. Machine learning has stepped to the forefront to counteract this by learning and predicting behavior on a network. However, this learned behavior fails to capture much of the information regarding relationships between computers on a network. Additionally, standard techniques to convert network information into numbers struggles to capture many of the similarities between machines. To counteract this, we evaluate a method to capture relationships between IP addresses and ports, called an embedding. We demonstrate this embedding on two different datasets of network traffic, and evaluate the embedding on one dataset with several machine learning methods. We compare the models with and without the embedding to evaluate the benefits of including network behavior into an intrusion detection system. We show that including network behavior into machine learning models improves the performance of classifying attacks found in the UGR’16 data. word embeddings intrusion detection
2	Detecting Lexical Semantic Change Using Probabilistic Gaussian Word Embeddings Moss, Adam January 2020 (has links) In this work, we test two novel methods of using word embeddings to detect lexical semantic change, attempting to overcome limitations associated with conventional approaches to this problem. Using a diachronic corpus spanning over a hundred years, we generate word embeddings for each decade with the intention of evaluating how meaning changes are represented in embeddings for the same word across time. Our approach differs from previous works in this field in that we encode words as probabilistic Gaussian distributions and bimodal probabilistic Gaussian mixtures, rather than conventional word vectors. We provide a discussion and analysis of our results, comparing the approaches we implemented with those used in previous works. We also conducted further analysis on whether additional information regarding the nature of semantic change could be discerned from particular qualities of the embeddings we generated for our experiments. In our results, we find that encoding words as probabilistic Gaussian embeddings can provide an enhanced degree of reliability with regard to detecting lexical semantic change. Furthermore, we are able to represent additional information regarding the nature of such changes through the variance of these embeddings. Encoding words as bimodal Gaussian mixtures however is generally unsuccessful for this task, proving to be not reliable enough at distinguishing between discrete senses to effectively detect and measure such changes. We provide potential explanations for the results we observe, and propose improvements that can be made to our approach to potentially improve performance. historical linguistics historical semantics lexical semantic change diachronic semantic change word embeddings probabilistic word embeddings gaussian word embeddings
3	A recurrent neural network architecture for biomedical event trigger classification Bopaiah, Jeevith 01 January 2018 (has links) A “biomedical event” is a broad term used to describe the roles and interactions between entities (such as proteins, genes and cells) in a biological system. The task of biomedical event extraction aims at identifying and extracting these events from unstructured texts. An important component in the early stage of the task is biomedical trigger classification which involves identifying and classifying words/phrases that indicate an event. In this thesis, we present our work on biomedical trigger classification developed using the multi-level event extraction dataset. We restrict the scope of our classification to 19 biomedical event types grouped under four broad categories - Anatomical, Molecular, General and Planned. While most of the existing approaches are based on traditional machine learning algorithms which require extensive feature engineering, our model relies on neural networks to implicitly learn important features directly from the text. We use natural language processing techniques to transform the text into vectorized inputs that can be used in a neural network architecture. As per our knowledge, this is the first time neural attention strategies are being explored in the area of biomedical trigger classification. Our best results were obtained from an ensemble of 50 models which produced a micro F-score of 79.82%, an improvement of 1.3% over the previous best score. LSTM word embeddings biomedical triggers attention layer Artificial Intelligence and Robotics
4	TimeLink: Visualizing Diachronic Word Embeddings and Topics Williams, Lemara Faith 11 June 2024 (has links) The task of analyzing a collection of documents generated over time is daunting. A natural way to ease the task is by summarizing documents into the topics that exist within these documents. The temporal aspect of topics can frame relevance based on when topics are introduced and when topics stop being mentioned. It creates trends and patterns that can be traced by individual key terms taken from the corpus. If trends are being established, there must be a way to visualize them through the key terms. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique. However, creating a visual system for terms is not easy. Work has been done to develop word embeddings, allowing researchers to treat words like any number. This makes it possible to create simple charts based on word embeddings like scatter plots. However, these methods are inefficient due to loss of effectiveness with multiple time slices and point overlap. A visualization method that addresses these problems while also visualizing diachronic word embeddings in an interesting way with added semantic meaning is hard to find. These problems are managed through TimeLink. TimeLink is proposed as a dashboard system to help users gain insights from the movement of diachronic word embeddings. It comprises a Sankey diagram showing the path of a selected key term to a cluster in a time period. This local cluster is also mapped to a global topic based on an original corpus of documents from which the key terms are drawn. On the dashboard, different tools are given to users to aid in a focused analysis, such as filtering key terms and emphasizing specific clusters. TimeLink provides insightful visualizations focused on temporal word embeddings while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time. / Master of Science / The task of analyzing documents collected over time is daunting. Grouping documents into topics can help frame relevancy based on when topics are introduced and hampered. The creation of topics also enables the ability to visualize trends and patterns. Creating a visual system to support this analysis can help users quickly gain insights from the data, significantly easing the burden from the original analysis technique of browsing individual documents. A visualization system for this analysis typically focuses on the terms that affect established topics. Some visualization methods, like scatter plots, implement this but can be inefficient due to loss of effectiveness as more data is introduced. TimeLink is proposed as a dashboard system to aid users in drawing insights from the development of terms over time. In addition to addressing problems in other visualizations, it visualizes the movement of terms intuitively and adds semantic meaning. TimeLink provides insightful visualizations focused on the movement of terms while maintaining the insights provided by global topic evolution, advancing our understanding of how topics evolve over time. High Dimensional Visualizations Clustering Diachronic Word Embeddings Topic Modeling
5	Biomedical Semantic Embeddings: Using Hybrid Sentences to Construct Biomedical Word Embeddings and its Applications Shaik, Arshad 12 1900 (has links) Word embeddings is a useful method that has shown enormous success in various NLP tasks, not only in open domain but also in biomedical domain. The biomedical domain provides various domain specific resources and tools that can be exploited to improve performance of these word embeddings. However, most of the research related to word embeddings in biomedical domain focuses on analysis of model architecture, hyper-parameters and input text. In this paper, we use SemMedDB to design new sentences called `Semantic Sentences'. Then we use these sentences in addition to biomedical text as inputs to the word embedding model. This approach aims at introducing biomedical semantic types defined by UMLS, into the vector space of word embeddings. The semantically rich word embeddings presented here rivals state of the art biomedical word embedding in both semantic similarity and relatedness metrics up to 11%. We also demonstrate how these semantic types in word embeddings can be utilized. machine learning word embeddings biomedical resources skip-gram model
6	Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages Bhowmik, Kowshik January 2022 (has links) No description available. Artificial Intelligence Cross-Lingual Word Embeddings Word Embeddings Low-Resource Languages Bilingual Lexicon Induction Computational Linguistics Natural Language Processing
7	[en] A FAST AND SPACE-ECONOMICAL APPROACH TO WORD MOVER S DISTANCE / [pt] UMA ABORDAGEM RÁPIDA E ECONÔMICA PARA WORD MOVER S DISTANCE MATHEUS TELLES WERNER 02 April 2020 (has links) [pt] O Word Mover s Distance (WMD) proposto por Kusner et al. [ICML,2015] é uma função de distância entre documentos que se aproveita das relações semânticas entre palavras extraidas por suas Word Embeddings. Essa função de distância se mostrou bastante eficaz, obtendo taxas de erro estado da arte para problemas de classificação, porém ao mesmo tempo inviável para largas coleções ou grandes documentos devido a ser necessário computar um problema de transporte em um grafo bipartido completo para cada par de documentos. Assumindo algumas hipóteses, que são respaldadas por propriedades empíricas das distâncias entre as Word Embeddings, nós simplificamos o WMD de forma a obter uma nova função de distância o qual requer a solução de um problema de fluxo máximo em um grafo esparço, que pode ser resolvido mais rapidamente do que um problema de transporte em um grafo denso. Nossos experimentos mostram que conseguimos obter ganhos de performance até três ordens de magnitude acima do WMD enquanto mantendo as mesmas taxas de erro na tarefa de classificação de documentos. / [en] The Word Mover s Distance (WMD) proposed in Kusner et. al. [ICML,2015] is a distance between documents that takes advantage of semantic relations among words that are captured by their Word Embeddings. This distance proved to be quite effective, obtaining state-of-the-art error rates for classification tasks, but also impracticable for large collections or documents because it needs to compute a transportation problem on a complete bipartite graph for each pair of documents. By using assumptions, that are supported by empirical properties of the distances between Word Embeddings, we simplify WMD so that we obtain a new distance whose computation requires the solution of a max flow problem in a sparse graph, which can be solved much faster than the transportation problem in a dense graph. Our experiments show that we can obtain a performance gain up to three orders of magnitude over WMD while maintaining the same error rates in document classification tasks. [pt] DISTANCIA ENTRE DOCUMENTOS [pt] WORD MOVER S DISTANCE [pt] WORD EMBEDDINGS [en] DOCUMENT DISTANCE [en] WORD MOVER S DISTANCE [en] WORD EMBEDDINGS
8	Word Embeddings in Database Systems Günther, Michael 18 November 2021 (has links) Research in natural language processing (NLP) focuses recently on the development of learned language models called word embedding models like word2vec, fastText, and BERT. Pre-trained on large amounts of unstructured text in natural language, those embedding models constitute a rich source of common knowledge in the domain of the text used for the training. In the NLP community, significant improvements are achieved by using those models together with deep neural network models. To support applications to benefit from word embeddings, we extend the capabilities of traditional relational database systems, which are still by far the most common DBMSs but only provide limited text analysis features. Therefore, we implement (a) novel database operations involving embedding representations to allow a database user to exploit the knowledge encoded in word embedding models for advanced text analysis operations. The integration of those operations into database query language enables users to construct queries using novel word embedding operations in conjunction with traditional query capabilities of SQL. To allow efficient retrieval of embedding representations and fast execution of the operations, we implement (b) novel search algorithms and index structures for approximated kNN-Joins and integrate those into a relational database management system. Moreover, we investigate techniques to optimize embedding representations of text values in database systems. Therefore, we design (c) a novel context adaptation algorithm. This algorithm utilizes the structured data present in the database to enrich the embedding representations of text values to model their context-specific semantic in the database. Besides, we provide (d) support for selecting a word embedding model suitable for a user's application. Therefore, we developed a data processing pipeline to construct a dataset for domain-specific word embedding evaluation. Finally, we propose (e) novel embedding techniques for pre-training on tabular data to support applications working with text values in tables. Our proposed embedding techniques model semantic relations arising from the alignment of words in tabular layouts that can only hardly be derived from text documents, e.g., relations between table schema and table body. In this way, many applications, which either employ embeddings in supervised machine learning models, e.g., to classify cells in spreadsheets, or through the application of arithmetic operations, e.g., table discovery applications, can profit from the proposed embedding techniques.:1 INTRODUCTION 1.1 Contribution 1.2 Outline 2 REPRESENTATION OF TEXT FOR NATURAL LANGUAGE PROCESSING 2.1 Natural Language Processing Systems 2.2 Word Embedding Models 2.2.1 Matrix Factorization Methods 2.2.2 Learned Distributed Representations 2.2.3 Contextualize Word Embeddings 2.2.4 Advantages of Contextualize and Static Word Embeddings 2.2.5 Properties of Static Word Embeddings 2.2.6 Node Embeddings 2.2.7 Non-Euclidean Embedding Techniques 2.3 Evaluation of Word Embeddings 2.3.1 Similarity Evaluation 2.3.2 Analogy Evaluation 2.3.3 Cluster-based Evaluation 2.4 Application for Tabular Data 2.4.1 Semantic Search 2.4.2 Data Curation 2.4.3 Data Discovery 3 SYSTEM OVERVIEW 3.1 Opportunities of an Integration 3.2 Characteristics of Word Vectors 3.3 Objectives and Challenges 3.4 Word Embedding Operations 3.5 Performance Optimization of Operations 3.6 Context Adaptation 3.7 Requirements for Model Recommendation 3.8 Tabular Embedding Models 4 MANAGEMENT OF EMBEDDING REPRESENTATIONS IN DATABASE SYSTEMS 4.1 Integration of Operations in an RDBMS 4.1.1 System Architecture 4.1.2 Storage Formats 4.1.3 User-Defined Functions 4.1.4 Web Application 4.2 Nearest Neighbor Search 4.2.1 Tree-based Methods 4.2.2 Proximity Graphs 4.2.3 Locality-Sensitive Hashing 4.2.4 Quantization Techniques 4.3 Applicability of ANN Techniques for Word Embedding kNN-Joins 4.4 Related Work on kNN Search in Database Systems 4.5 ANN-Joins for Relational Database Systems 4.5.1 Index Architecture 4.5.2 Search Algorithm 4.5.3 Distance Calculation 4.5.4 Optimization Capabilities 4.5.5 Estimation of the Number of Targets 4.5.6 Flexible Product Quantization 4.5.7 Further Optimizations 4.5.8 Parameter Tuning 4.5.9 kNN-Joins for Word2Bits 4.6 Evaluation 4.6.1 Experimental Setup 4.6.2 Influence of Index Parameters on Precision and Execution Time 4.6.3 Performance of Subroutines 4.6.4 Flexible Product Quantization 4.6.5 Accuracy of the Target Size Estimation 4.6.6 Performance of Word2Bits kNN-Join 4.7 Summary 5 CONTEXT ADAPTATION FOR WORD EMBEDDING OPTIMIZATION 5.1 Related Work 5.1.1 Graph and Text Joint Embedding Methods 5.1.2 Retrofitting Approaches 5.1.3 Table Embedding Models 5.2 Relational Retrofitting Approach 5.2.1 Data Preparation 5.2.2 Relational Retrofitting Problem 5.2.3 Relational Retrofitting Algorithm 5.2.4 Online-RETRO 5.3 Evaluation Platform: Retro Live 5.3.1 Functionality 5.3.2 Interface 5.4 Evaluation 5.4.1 Datasets 5.4.2 Training of Embeddings 5.4.3 Machine Learning Models 5.4.4 Evaluation of ML Models 5.4.5 Run-time Measurements 5.4.6 Online Retrofitting 5.5 Summary 6 MODEL RECOMMENDATION 6.1 Related Work 6.1.1 Extrinsic Evaluation 6.1.2 Intrinsic Evaluation 6.2 Architecture of FacetE 6.3 Evaluation Dataset Construction Pipeline 6.3.1 Web Table Filtering and Facet Candidate Generation 6.3.2 Check Soft Functional Dependencies 6.3.3 Post-Filtering 6.3.4 Categorization 6.4 Evaluation of Popular Word Embedding Models 6.4.1 Domain-Agnostic Evaluation 6.4.2 Evaluation of a Single Facet 6.4.3 Evaluation of an Object Set 6.5 Summary 7 TABULAR TEXT EMBEDDINGS 7.1 Related Work 7.1.1 Static Table Embedding Models 7.1.2 Contextualized Table Embedding Models 7.2 Web Table Embedding Model 7.2.1 Preprocessing 7.2.2 Text Serialization 7.2.3 Encoding Model 7.2.4 Embedding Training 7.3 Applications for Table Embeddings 7.3.1 Table Union Search 7.3.2 Classification Tasks 7.4 Evaluation 7.4.1 Intrinsic Evaluation 7.4.2 Table Union Search Evaluation 7.4.3 Table Layout Classification 7.4.4 Spreadsheet Cell Classification 7.5 Summary 8 CONCLUSION 8.1 Summary 8.2 Directions for Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLES A CONVEXITY OF RELATIONAL RETROFITTING B EVALUATION OF THE RELATIONAL RETROFITTING HYPERPARAMETERS info:eu-repo/classification/ddc/004 ddc:004
9	Word embeddings and Patient records : The identification of MRI risk patients Kindberg, Erik January 2019 (has links) Identification of risks ahead of MRI examinations is identified as a cumbersome and time-consuming process at the Linköping University Hospital radiology clinic. The hospital staff often have to search through large amounts of unstructured patient data to find information about implants. Word embeddings has been identified as a possible tool to speed up this process. The purpose of this thesis is to evaluate this method, and that is done by training a Word2Vec model on patient journal data and analyzing the close neighbours of key search words by calculating cosine similarity. The 50 closest neighbours of each search words are categorized and annotated as relevant to the task of identifying risk patients ahead of MRI examinations or not. 10 search words were explored, leading to a total of 500 terms being annotated. In total, 14 different categories were observed in the result and out of these 8 were considered relevant. Out of the 500 terms, 340 (68%) were considered relevant. In addition, 48 implant models could be observed which are particularly interesting because if a patient have an implant, hospital staff needs to determine it’s exact model and the MRI conditions of that model. Overall these findings points towards a positive answer for the aim of the thesis, although further developments are needed. word2vec word embeddings patient records MRI safety digital healthcare
10	Zpracování češtiny s využitím kontextualizované reprezentace / Czech NLP with Contextualized Embeddings Vysušilová, Petra January 2021 (has links) With the increasing amount of digital data in the form of unstructured text, the importance of natural language processing (NLP) increases. The most suc- cessful technologies of recent years are deep neural networks. This work applies the state-of-the-art methods, namely transfer learning of Bidirectional Encoders Representations from Transformers (BERT), on three Czech NLP tasks: part- of-speech tagging, lemmatization and sentiment analysis. We applied BERT model with a simple classification head on three Czech sentiment datasets: mall, facebook, and csfd, and we achieved state-of-the-art results. We also explored several possible architectures for tagging and lemmatization and obtained new state-of-the-art results in both tagging and lemmatization with fine-tunning ap- proach on data from Prague Dependency Treebank. Specifically, we achieved accuracy 98.57% for tagging, 99.00% for lemmatization, and 98.19% for joint accuracy of both tasks. Best models for all tasks are publicly available. 1

Search results