Global ETD Search

11	Rychlá adaptace počítačové podpory hry Krycí jména pro nové jazyky / Fast Adaptation of Codenames Computer Assistant for New Languages Jareš, Petr January 2021 (has links) This thesis extends a system of an artificial player of a word-association game Codenames to easy addition of support for new languages. The system is able to play Codenames in roles as a guessing player, a clue giver or, by their combination a Duet version player. For analysis of different languages a neural toolkit Stanza was used, which is language independent and enables automated processing of many languages. It was mainly about lemmatization and part of speech tagging for selection of clues in the game. For evaluation of word associations were several models tested, where the best results had a method Pointwise Mutual Information and predictive model fastText. The system supports playing Codenames in 36 languages comprising 8 different alphabets.
12	Rozpoznání pojmenovaných entit v textu Süss, Martin January 2019 (has links) This thesis deals with the named entity recognition (NER) in text. It is realized by machine learning techniques. Recently, techniques for creating word embeddings models have been introduced. These word vectors can encode many useful relationships between words in text data, such as their syntactic or semantic similarity. Modern NER systems use these vector features for improving their quality. However, only few of them investigate in greater detail how much these vectors have impact on recognition and whether they can be optimized for even greater recognition quality. This thesis examines various factors that may affect the quality of word embeddings, and thus the resulting quality of the NER system. A series of experiments have been performed, which examine these factors, such as corpus quality and size, vector dimensions, text preprocessing techniques, and various algorithms (Word2Vec, GloVe and FastText) and their parameters. Their results bring useful findings that can be used within creation of word vectors and thus indirectly increase the resulting quality of NER systems.
13	Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios Dyer, Andrew January 2019 (has links) Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages. However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages. Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs. Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment. Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages. We also find that the performance for these languages only increases slowly with corpus size. Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction. We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs. word embeddings cross-lingual multilingual low-resource corpus size Vecmap FastText alignment orthogonal eigenvalues Laplacian isospectral isomorphic bilingual lexicon induction
14	Clustering and Summarization of Chat Dialogues : To understand a company’s customer base / Klustring och Summering av Chatt-Dialoger Hidén, Oskar, Björelind, David January 2021 (has links) The Customer Success department at Visma handles about 200 000 customer chats each year, the chat dialogues are stored and contain both questions and answers. In order to get an idea of what customers ask about, the Customer Success department has to read a random sample of the chat dialogues manually. This thesis develops and investigates an analysis tool for the chat data, using the approach of clustering and summarization. The approach aims to decrease the time spent and increase the quality of the analysis. Models for clustering (K-means, DBSCAN and HDBSCAN) and extractive summarization (K-means, LSA and TextRank) are compared. Each algorithm is combined with three different text representations (TFIDF, S-BERT and FastText) to create models for evaluation. These models are evaluated against a test set, created for the purpose of this thesis. Silhouette Index and Adjusted Rand Index are used to evaluate the clustering models. ROUGE measure together with a qualitative evaluation are used to evaluate the extractive summarization models. In addition to this, the best clustering model is further evaluated to understand how different data sizes impact performance. TFIDF Unigram together with HDBSCAN or K-means obtained the best results for clustering, whereas FastText together with TextRank obtained the best results for extractive summarization. This thesis applies known models on a textual domain of customer chat dialogues, something that, to our knowledge, has previously not been done in literature. Machine Learning NLP Text Representations Clustering Extractive summarization TFIDF S-BERT FastText K-means DBSCAN HDBSCAN LSA TextRank Word Mover's Distance (WMD) Computer Engineering Datorteknik
15	Text feature mining using pre-trained word embeddings Sjökvist, Henrik January 2018 (has links) This thesis explores a machine learning task where the data contains not only numerical features but also free-text features. In order to employ a supervised classifier and make predictions, the free-text features must be converted into numerical features. In this thesis, an algorithm is developed to perform that conversion. The algorithm uses a pre-trained word embedding model which maps each word to a vector. The vectors for multiple word embeddings belonging to the same sentence are then combined to form a single sentence embedding. The sentence embeddings for the whole dataset are clustered to identify distinct groups of free-text strings. The cluster labels are output as the numerical features. The algorithm is applied on a specific case concerning operational risk control in banking. The data consists of modifications made to trades in financial instruments. Each such modification comes with a short text string which documents the modification, a trader comment. Converting these strings to numerical trader comment features is the objective of the case study. A classifier is trained and used as an evaluation tool for the trader comment features. The performance of the classifier is measured with and without the trader comment feature. Multiple models for generating the features are evaluated. All models lead to an improvement in classification rate over not using a trader comment feature. The best performance is achieved with a model where the sentence embeddings are generated using the SIF weighting scheme and then clustered using the DBSCAN algorithm. / Detta examensarbete behandlar ett maskininlärningsproblem där data innehåller fritext utöver numeriska attribut. För att kunna använda all data för övervakat lärande måste fritexten omvandlas till numeriska värden. En algoritm utvecklas i detta arbete för att utföra den omvandlingen. Algoritmen använder färdigtränade ordvektormodeller som omvandlar varje ord till en vektor. Vektorerna för flera ord i samma mening kan sedan kombineras till en meningsvektor. Meningsvektorerna i hela datamängden klustras sedan för att identifiera grupper av liknande textsträngar. Algoritmens utdata är varje datapunkts klustertillhörighet. Algoritmen appliceras på ett specifikt fall som berör operativ risk inom banksektorn. Data består av modifikationer av finansiella transaktioner. Varje sådan modifikation har en tillhörande textkommentar som beskriver modifikationen, en handlarkommentar. Att omvandla dessa kommentarer till numeriska värden är målet med fallstudien. En klassificeringsmodell tränas och används för att utvärdera de numeriska värdena från handlarkommentarerna. Klassificeringssäkerheten mäts med och utan de numeriska värdena. Olika modeller för att generera värdena från handlarkommentarerna utvärderas. Samtliga modeller leder till en förbättring i klassificering över att inte använda handlarkommentarerna. Den bästa klassificeringssäkerheten uppnås med en modell där meningsvektorerna genereras med hjälp av SIF-viktning och sedan klustras med hjälp av DBSCAN-algoritmen. Word embeddings Feature engineering Unsupervised learning Deep learning fast Text Operational risk Ordvektorer Attributgenerering Oövervakat lärande Djupinlärning fastText Operativ risk Computational Mathematics Beräkningsmatematik
16	Semantically Aligned Sentence-Level Embeddings for Agent Autonomy and Natural Language Understanding Fulda, Nancy Ellen 01 August 2019 (has links) Many applications of neural linguistic models rely on their use as pre-trained features for downstream tasks such as dialog modeling, machine translation, and question answering. This work presents an alternate paradigm: Rather than treating linguistic embeddings as input features, we treat them as common sense knowledge repositories that can be queried using simple mathematical operations within the embedding space, without the need for additional training. Because current state-of-the-art embedding models were not optimized for this purpose, this work presents a novel embedding model designed and trained specifically for the purpose of "reasoning in the linguistic domain".Our model jointly represents single words, multi-word phrases, and complex sentences in a unified embedding space. To facilitate common-sense reasoning beyond straightforward semantic associations, the embeddings produced by our model exhibit carefully curated properties including analogical coherence and polarity displacement. In other words, rather than training the model on a smorgaspord of tasks and hoping that the resulting embeddings will serve our purposes, we have instead crafted training tasks and placed constraints on the system that are explicitly designed to induce the properties we seek. The resulting embeddings perform competitively on the SemEval 2013 benchmark and outperform state-of- the-art models on two key semantic discernment tasks introduced in Chapter 8.The ultimate goal of this research is to empower agents to reason about low level behaviors in order to fulfill abstract natural language instructions in an autonomous fashion. An agent equipped with an embedding space of sucient caliber could potentially reason about new situations based on their similarity to past experience, facilitating knowledge transfer and one-shot learning. As our embedding model continues to improve, we hope to see these and other abilities become a reality. Nancy Fulda dissertation knowledge representation linguistic embeddings word2vec FastText universal sentence encoder BERT embeddings sentence embedding common-sense reasoning Computer Sciences Physical Sciences and Mathematics
17	Language identification for typologically similar low-resource languages: : A case study of Meänkieli, Kven and Finnish / Språkidentifering för typologiskt närbesläktade lågresursspråk: : En fallstudie för meänkieli, kvänska och finska Larsson, Jacob January 2024 (has links) This study examines different methods of language identification for the languages Meänkieli, Kven, and Finnish. The methods explored are two n-gram-based classifiers; Naive Bayes and TextCat and one word embedding-based classifier; fastText. These models were trained on approximately 100.000 sentences taken from the three languages and further divided into four separate datasets to examine how data availability impacts the final performance of the trained models. The study found that the best model for the examined dataset was the fastText classifier, but for languages with less available material a naive Bayes classifier might be more appropriate. / Denna studie utforskar olika metoder av språkidentifering för språken meänkieli, kvänska och finska. Två metoder baserade på n-gram undersöks; naive Bayes och TextCat samt en metod med ordinbäddningar; fastText. Dessa modeller tränades på sammanlagt 100 000 meningar taget från dessa tre språk och delades vidare in i fyra delmängder för att utvärdera hur stor inverkan storleken av träningsdata har på de tränade modellerna. Studien fann att den bästa implementationen utifrån den undersökta datamängden var fastText, medans språk med färre resurser skulle förmodligen gynnas bättre av en språkidentifering byggd med en naive Bayes klassifierare. Natural language processing minority languages naive Bayes classifiers fastText TextCat machine learning low-resource languages Meänkieli Tornedalian Finnish Finnish Kven Språkteknologi minoritetspråk naiv Bayes klassifierare fastText TextCat maskininlärning lågresursspråk meänkieli tornedalsfinska kvänska finska General Language Studies and Linguistics

Page generated in 0.0283 seconds