Global ETD Search

41	Rychlá adaptace počítačové podpory hry Krycí jména pro nové jazyky / Fast Adaptation of Codenames Computer Assistant for New Languages Jareš, Petr January 2021 (has links) This thesis extends a system of an artificial player of a word-association game Codenames to easy addition of support for new languages. The system is able to play Codenames in roles as a guessing player, a clue giver or, by their combination a Duet version player. For analysis of different languages a neural toolkit Stanza was used, which is language independent and enables automated processing of many languages. It was mainly about lemmatization and part of speech tagging for selection of clues in the game. For evaluation of word associations were several models tested, where the best results had a method Pointwise Mutual Information and predictive model fastText. The system supports playing Codenames in 36 languages comprising 8 different alphabets.
42	Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings : An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios Dyer, Andrew January 2019 (has links) Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages. However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages. Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs. Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment. Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages. We also find that the performance for these languages only increases slowly with corpus size. Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction. We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs. word embeddings cross-lingual multilingual low-resource corpus size Vecmap FastText alignment orthogonal eigenvalues Laplacian isospectral isomorphic bilingual lexicon induction
43	Descriptive Music Search With Domain-Specific Word Embeddings / Deskriptiv musiksökning med domänspecifika ordinbäddningar Liu, Alva January 2019 (has links) Descriptive search is a type of exploratory search that allows users to search for content by providing descriptors. Instead of having a specific target in mind, the user looks for a recommendation of items that matches the given descriptors. However in the music domain, descriptive words do not necessarily have the same semantic meaning as they have in a generic text corpus. In this study, we investigate if we can train a shallow neural model on playlist data for descriptive music search, and if the model can capture music-specific word semantics. We carry out three experiments to evaluate our model. The first and the second experiments evaluate if the model can predict tracks that are relevant to given search queries, and the third experiment evaluates whether the model successfully captures domain-specific word semantics. From our experiments, we conclude that our model trained on playlist data indeed can capture music-specific word semantics and generate reasonable track predictions. For future work, we suggest to explore possibilities to re-rank the top results retrieved by the model and diversify and/or personalize the ordering of the results. / Deskriptiv sökning är en typ av utforskande informationshämtning där användare söker efter material med hjälp av beskrivande sökord. Istället för att ange namnet på ett objekt i söksträngen så kan användaren med ord beskriva objekt som efterfrågas. I ett musiksammanhang har dock många beskrivande ord inte samma betydelse som de har i ett generellt sammanhang. Vi undersöker därför i vår studie om vi kan träna ett grunt neuralt nätverk med spellistsdata för deskriptiv musiksökning, och om modellen kan lära sig musik-specifika betydelser av ord. Vi utför totalt tre olika experiment för att utvärdera modellen. De första två experimenten undersöker om modellen kan föreslå relevanta låtar givet beskrivande söksträngar och det sista experimentet undersöker om modellen fångar domän-specifika betydelser av sökorden. Resultaten från våra experiment tyder på att modellen lyckas fånga musik-specifika språkmönster och kan föreslå rimliga låtar för deskriptiva söksträngar. För att göra modellen mer användningsbar föreslår vi att undersöka möjligheterna att omranka toppresultaten från modellen, och diversifiera samt personalisera ordningen av resultaten efter individuella användare. descriptive search word embeddings domain knowledge extrinsic evaluation fastText deskriptiv sökning ordvektorer domänkunskap indirekt utvärdering fastText Computer and Information Sciences Data- och informationsvetenskap
44	Evaluation of Sentence Representations in Semantic Text Similarity Tasks / Utvärdering av meningsrepresentation för semantisk textlikhet Balzar Ekenbäck, Nils January 2021 (has links) This thesis explores the methods of representing sentence representations for semantic text similarity using word embeddings and benchmarks them against sentence based evaluation test sets. Two methods were used to evaluate the representations: STS Benchmark and STS Benchmark converted to a binary similarity task. Results showed that preprocessing of the word vectors could significantly boost performance in both tasks and conclude that word embed-dings still provide an acceptable solution for specific applications. The study also concluded that the dataset used might not be ideal for this type of evalua-tion, as the sentence pairs in general had a high lexical overlap. To tackle this, the study suggests that a paraphrasing dataset could act as a complement but that further investigation would be needed. / Denna avhandling undersöker metoder för att representera meningar i vektor-form för semantisk textlikhet och jämför dem med meningsbaserade testmäng-der. För att utvärdera representationerna användes två metoder: STS Bench-mark, en vedertagen metod för att utvärdera språkmodellers förmåga att ut-värdera semantisk likhet, och STS Benchmark konverterad till en binär lik-hetsuppgift. Resultaten visade att förbehandling av texten och ordvektorerna kunde ge en signifikant ökning i resultatet för dessa uppgifter. Studien konklu-derade även att datamängden som användes kanske inte är ideal för denna typ av utvärdering, då meningsparen i stort hade ett högt lexikalt överlapp. Som komplement föreslår studien en parafrasdatamängd, något som skulle kräva ytterligare studier. sentence representations semantic text similarity sentence simi-larity word embeddings sentence embeddings meningsrepresentation semantisk textlikhet meningslikhet or-dinbäddningar meningsinbäddningar Computer and Information Sciences Data- och informationsvetenskap
45	Designing a Question Answering System in the Domain of Swedish Technical Consulting Using Deep Learning / Design av ett frågebesvarande system inom svensk konsultverksamhet med användning av djupinlärning Abrahamsson, Felix January 2018 (has links) Question Answering systems are greatly sought after in many areas of industry. Unfortunately, as most research in Natural Language Processing is conducted in English, the applicability of such systems to other languages is limited. Moreover, these systems often struggle in dealing with long text sequences. This thesis explores the possibility of applying existing models to the Swedish language, in a domain where the syntax and semantics differ greatly from typical Swedish texts. Additionally, the text length may vary arbitrarily. To solve these problems, transfer learning techniques and state-of-the-art Question Answering models are investigated. Furthermore, a novel, divide-and-conquer based technique for processing long texts is developed. Results show that the transfer learning is partly unsuccessful, but the system is capable of perform reasonably well in the new domain regardless. Furthermore, the system shows great performance improvement on longer text sequences with the use of the new technique. / System som givet en text besvarar frågor är högt eftertraktade inom många arbetsområden. Eftersom majoriteten av all forskning inom naturligtspråkbehandling behandlar engelsk text är de flesta system inte direkt applicerbara på andra språk. Utöver detta har systemen ofta svårt att hantera långa textsekvenser. Denna rapport utforskar möjligheten att applicera existerande modeller på det svenska språket, i en domän där syntaxen och semantiken i språket skiljer sig starkt från typiska svenska texter. Dessutom kan längden på texterna variera godtyckligt. För att lösa dessa problem undersöks flera tekniker inom transferinlärning och frågebesvarande modeller i forskningsfronten. En ny metod för att behandla långa texter utvecklas, baserad på en dekompositionsalgoritm. Resultaten visar på att transfer learning delvis misslyckas givet domänen och modellerna, men att systemet ändå presterar relativt väl i den nya domänen. Utöver detta visas att systemet presterar väl på långa texter med hjälp av den nya metoden. Question Answering Deep Learning Machine Learning Transfer Learning Natural Language Processing Technical Consulting Word Embeddings Divide and Conquer Computer Sciences Datavetenskap (datalogi)
46	Word embeddings for monolingual and cross-language domain-specific information retrieval / Ordinbäddningar för enspråkig och tvärspråklig domänspecifik informationssökning Wigder, Chaya January 2018 (has links) Various studies have shown the usefulness of word embedding models for a wide variety of natural language processing tasks. This thesis examines how word embeddings can be incorporated into domain-specific search engines for both monolingual and cross-language search. This is done by testing various embedding model hyperparameters, as well as methods for weighting the relative importance of words to a document or query. In addition, methods for generating domain-specific bilingual embeddings are examined and tested. The system was compared to a baseline that used cosine similarity without word embeddings, and for both the monolingual and bilingual search engines the use of monolingual embedding models improved performance above the baseline. However, bilingual embeddings, especially for domain-specific terms, tended to be of too poor quality to be used directly in the search engines. / Flera studier har visat att ordinbäddningsmodeller är användningsbara för många olika språkteknologiuppgifter. Denna avhandling undersöker hur ordinbäddningsmodeller kan användas i sökmotorer för både enspråkig och tvärspråklig domänspecifik sökning. Experiment gjordes för att optimera hyperparametrarna till ordinbäddningsmodellerna och för att hitta det bästa sättet att vikta ord efter hur viktiga de är i dokumentet eller sökfrågan. Dessutom undersöktes metoder för att skapa domänspecifika tvåspråkiga inbäddningar. Systemet jämfördes med en baslinje utan inbäddningar baserad på cosinuslikhet, och för både enspråkiga och tvärspråkliga sökningar var systemet som använde enspråkiga inbäddningar bättre än baslinjen. Däremot var de tvåspråkiga inbäddningarna, särskilt för domänspecifika ord, av låg kvalitet och gav för dåliga resultat för direkt användning inom sökmotorer. information retrieval domain-specific information retrieval cross-language information retrieval word embeddings bilingual embeddings informationssökning domänspecifik informationssökning tvärspråklig informationssökning ordinbäddningar tvåspråkiga inbäddningar Computer Sciences Datavetenskap (datalogi)
47	Text feature mining using pre-trained word embeddings Sjökvist, Henrik January 2018 (has links) This thesis explores a machine learning task where the data contains not only numerical features but also free-text features. In order to employ a supervised classifier and make predictions, the free-text features must be converted into numerical features. In this thesis, an algorithm is developed to perform that conversion. The algorithm uses a pre-trained word embedding model which maps each word to a vector. The vectors for multiple word embeddings belonging to the same sentence are then combined to form a single sentence embedding. The sentence embeddings for the whole dataset are clustered to identify distinct groups of free-text strings. The cluster labels are output as the numerical features. The algorithm is applied on a specific case concerning operational risk control in banking. The data consists of modifications made to trades in financial instruments. Each such modification comes with a short text string which documents the modification, a trader comment. Converting these strings to numerical trader comment features is the objective of the case study. A classifier is trained and used as an evaluation tool for the trader comment features. The performance of the classifier is measured with and without the trader comment feature. Multiple models for generating the features are evaluated. All models lead to an improvement in classification rate over not using a trader comment feature. The best performance is achieved with a model where the sentence embeddings are generated using the SIF weighting scheme and then clustered using the DBSCAN algorithm. / Detta examensarbete behandlar ett maskininlärningsproblem där data innehåller fritext utöver numeriska attribut. För att kunna använda all data för övervakat lärande måste fritexten omvandlas till numeriska värden. En algoritm utvecklas i detta arbete för att utföra den omvandlingen. Algoritmen använder färdigtränade ordvektormodeller som omvandlar varje ord till en vektor. Vektorerna för flera ord i samma mening kan sedan kombineras till en meningsvektor. Meningsvektorerna i hela datamängden klustras sedan för att identifiera grupper av liknande textsträngar. Algoritmens utdata är varje datapunkts klustertillhörighet. Algoritmen appliceras på ett specifikt fall som berör operativ risk inom banksektorn. Data består av modifikationer av finansiella transaktioner. Varje sådan modifikation har en tillhörande textkommentar som beskriver modifikationen, en handlarkommentar. Att omvandla dessa kommentarer till numeriska värden är målet med fallstudien. En klassificeringsmodell tränas och används för att utvärdera de numeriska värdena från handlarkommentarerna. Klassificeringssäkerheten mäts med och utan de numeriska värdena. Olika modeller för att generera värdena från handlarkommentarerna utvärderas. Samtliga modeller leder till en förbättring i klassificering över att inte använda handlarkommentarerna. Den bästa klassificeringssäkerheten uppnås med en modell där meningsvektorerna genereras med hjälp av SIF-viktning och sedan klustras med hjälp av DBSCAN-algoritmen. Word embeddings Feature engineering Unsupervised learning Deep learning fast Text Operational risk Ordvektorer Attributgenerering Oövervakat lärande Djupinlärning fastText Operativ risk Computational Mathematics Beräkningsmatematik
48	Lexicalisation souple en réalisation de texte Gazeau, Avril 08 1900 (has links) GenDR est un réalisateur de texte symbolique qui prend en entrée un graphe, une repré- sentation sémantique, et génère les graphes sous forme d’arbres de dépendances syntaxiques lui correspondant. L’une des tâches de GenDR lui permettant d’effectuer cette transduction est la lexicalisation profonde. Il s’agit de choisir les bonnes unités lexicales exprimant les sémantèmes de la représentation sémantique d’entrée. Pour ce faire, GenDR a besoin d’un dictionnaire sémantique établissant la correspondance entre les sémantèmes et les unités lexi- cales correspondantes dans une langue donnée. L’objectif de cette étude est d’élaborer un module de lexicalisation souple construisant automatiquement un dictionnaire sémantique du français riche pour GenDR, son dictionnaire actuel étant très pauvre. Plus le dictionnaire de GenDR est riche, plus sa capacité à paraphra- ser s’élargit, ce qui lui permet de produire la base de textes variés et naturels correspondant à un même sens. Pour y parvenir, nous avons testé deux méthodes. La première méthode consistait à réorganiser les données du Réseau Lexical du Français sous la forme d’un dictionnaire sémantique, en faisant de chacun de ses noeuds une entrée du dictionnaire et des noeuds y étant reliés par un type de lien lexical que nous appelons fonctions lexicales paradigmatiques sémantiquement vides ses lexicalisations. La deuxième méthode consistait à tester la capacité d’un modèle de langue neuronal contextuel à générer des lexicalisations supplémentaires potentielles correspondant aux plus proches voisins du vecteur calculé pour chaque entrée du dictionnaire afin de l’enrichir. Le dictionnaire construit à partir du Réseau lexical du français est compatible avec GenDR et sa couverture a été considérablement élargie. L’utilité des lexicalisations supplémentaires générées par le modèle neuronal s’est avérée limitée, ce qui nous amène à conclure que le modèle testé n’est pas tout à fait apte à accomplir le genre de tâche que nous lui avons de- mandée. / GenDR is an automatic text realiser. Its input is a graph; a semantic representation, and its output is the corresponding syntactic dependencies tree graphs. One of GenDR’s tasks to operate this transduction successfully is called deep lexicalization, i.e. choosing the right lexical units to express the input semantic representation’s semantemes. To do so, GenDR needs access to a semantic dictionnary that maps the semantemes to the corresponding lexical units in a given language. This study aims to develop a flexible lexicalization module to build a rich French semantic dictionary automatically for GenDR, its current one being very poor. The more data the semantic dictionary contains, the more paraphrases GenDR is able to produce, which enables it to generate the basis for natural and diverse texts associated to a same meaning. To achieve this, we have tested two different methods. The first one involved the reorganization of the French Lexical Network in the shape of a semantic dictionary, by using each of the network’s nodes as a dictionary entry and the nodes linked to it by a special lexical relationship we call semantically empty paradigmatic lexical functions as its lexicalizations. The second method involved testing a contextual neural language model’s ability to gen- erate potential additional lexicalizations by calculating the vector of each of the dictionary entries and generating its closest neighbours in order to expand the semantic dictionary’s coverage. The dictionary we built from the data contained in the French Lexical Network is com- patible with GenDR and its coverage has been significantly broadened. Use of the additional lexicalizations produced by the language model turned out to be limited, which brings us to the conclusion that the tested model isn’t completely able to perform the task we’ve asked from it. réalisation automatique de texte interface sémantique-syntaxe lexicalisation plongements lexicaux automatic text realization syntax-semantics interface lexicalization word embeddings Linguistics / Linguistique (UMI : 0290)
49	Classification of Transcribed Voice Recordings : Determining the Claim Type of Recordings Submitted by Swedish Insurance Clients / Klassificering av Transkriberade Röstinspelningar Piehl, Carl January 2021 (has links) In this thesis, we investigate the problem of building a text classifier for transcribed voice recordings submitted by insurance clients. We compare different models in the context of two tasks. The first is a binary classification problem, where the models are tasked with determining if a transcript belongs to a particular type or not. The second is a multiclass problem, where the models have to choose between several types when labelling transcripts, resulting in a data set with a highly imbalanced class distribution. We evaluate four different models: pretrained BERT and three LSTMs with different word embeddings. The used word embeddings are ELMo, word2vec and a baseline model with randomly initialized embedding layer. In the binary task, we are more concerned with false positives than false negatives. Thus, we also use weighted cross entropy loss to achieve high precision for the positive class, while sacrificing recall. In the multiclass task, we use focal loss and weighted cross entropy loss to reduce bias toward majority classes. We find that BERT outperforms the other models and the baseline model is worst across both tasks. The difference in performance is greatest in the multiclass task on classes with fewer samples. This demonstrates the benefit of using large language models in data constrained scenarios. In the binary task, we find that weighted cross entropy loss provides a simple, yet effective, framework for conditioning the model to favor certain types of errors. In the multiclass task, both focal loss and weighted cross entropy loss are shown to reduce bias toward majority classes. However, we also find that BERT fine tuned with regular cross entropy loss does not show bias toward majority classes, having high recall across all classes. / I examensarbetet undersöks klassificering av transkriberade röstinspelningar från försäkringskunder. Flera modeller jämförs på två uppgifter. Den första är binär klassificering, där modellerna ska särskilja på inspelningar som tillhör en specifik klass av ärende från resterande inspelningar. I det andra inkluderas flera olika klasser som modellerna ska välja mellan när inspelningar klassificeras, vilket leder till en ojämn klassfördelning. Fyra modeller jämförs: förtränad BERT och tre LSTM-nätverk med olika varianter av förtränade inbäddningar. De inbäddningar som används är ELMo, word2vec och en basmodell som har inbäddningar som inte förtränats. I det binära klassificeringsproblemet ligger fokus på att minimera antalet falskt positiva klassificeringar, därför används viktad korsentropi. Utöver detta används även fokal förlustfunktion när flera klasser inkluderas, för att minska partiskhet mot majoritetsklasser. Resultaten indikerar att BERT är en starkare modell än de andra modellerna i båda uppgifterna. Skillnaden mellan modellerna är tydligast när flera klasser används, speciellt på de klasser som är underrepresenterade. Detta visar på fördelen av att använda stora, förtränade, modeller när mängden data är begränsad. I det binära klassificeringsproblemet ser vi även att en viktad förlustfunktion ger ett enkelt men effektivt sätt att reglera vilken typ av fel modellen ska vara partisk mot. När flera klasser inkluderas ser vi att viktad korsentropi, samt fokal förlustfunktion, kan bidra till att minska partiskhet mot överrepresenterade klasser. Detta var dock inte fallet för BERT, som visade bra resultat på minoritetsklasser även utan att modifiera förlustfunktionen. Text Classification Word embeddings BERT LSTM Cost-sensitive learning Focal loss Textklassificering Ordinbäddningar BERT LSTM Kostnadskänslig inlärning Fokal förlustfunktion Computer and Information Sciences Data- och informationsvetenskap
50	Optimering av en chattbot för det svenska språket / Optimization of a Chatbot for the Swedish Language Mutaliev, Mohammed, Almimar, Ibrahim January 2021 (has links) Chattbotutvecklare på Softronic använder i dagsläget Rasa-ramverket och dess standardkomponenter för bearbetning av användarinmatning. Det här är problematiskt då standardkomponenterna inte är optimerade för det svenska språket. Till följd av detta efterfrågades en utvärdering av samtliga Rasa-komponenter med syfte att identifiera de mest gynnsamma komponenterna för att maximera klassificeringsträffsäkerhet. I detta examensarbete framtogs och jämfördes flera Rasa-pipelines med olika komponenter för tokenisering, känneteckensextrahering och klassificering. Resultaten av komponenterna för tokenisering visade att Rasas WhitespaceTokenizer överträffade både SpacyTokenizer och StanzaTokenizer. För känneteckensextrahering var CountVectorsFeaturizer, LanguageModelFeaturizer (med LaBSE-modellen) och FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) de mest optimala komponenterna. Den klassificerare som i allmänhet presterade bäst var DIETClassifier, men det fanns flera tillfällen där SklearnIntentClassifier överträffade den. Detta arbete resulterade i flera pipelines som överträffade Rasas standard-pipeline. Av dessa pipelines var det två som presterade bäst. Den första pipeline implementerade komponenterna WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (med den officiella fastText-modellen tränad på svenska Wikipedia) och DIETClassifier med en klassificeringsträffsäkerhet på 91% (F1-score). Den andra pipeline implementerade komponenterna WhitespaceTokenizer, LanguageModelFeaturizer (med LaBSE-modellen) och SklearnIntentClassifier med en klassificeringsträffsäkerhet på 91,5% (F1-score). / Chatbot developers at Softronic currently use the Rasa framework and its default components for processing user input. This is problematic as the default components are not optimized for the Swedish language. Following this an evaluation of all Rasa components was requested with the purpose of identifying the most favorable components to maximize classification accuracy. In this thesis, several Rasa pipelines were developed and compared with different components for tokenization, feature extraction and classification. The results of the tokenization components showed that Rasa's WhitespaceTokenizer surpassed both SpacyTokenizer and StanzaTokenizer. For feature extraction, CountVectorsFeaturizer, LanguageModelFeaturizer (with the LaBSE model) and FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) were the most optimal components. The classifier that generally performed best was DIETClassifier, but there were several occasions where SklearnIntentClassifier surpassed it. This work resulted in several pipelines that exceeded Rasa’s standard pipeline. Of these pipelines, two performed best. The first pipeline implemented the components WhitespaceTokenizer, CountVectorsFeaturizer, FastTextFeaturizer (with the official fastText model trained on Swedish Wikipedia) and DIETClassifier with a classification accuracy of 91% (F1 score). The other pipeline implemented the components WhitespaceTokenizer, LanguageModelFeaturizer (with the LaBSE model) and SklearnIntentClassifier with a classification accuracy of 91.5% (F1 score). Chatbots machine learning natural language processing tokenization feature extraction classification word embeddings transformers Chattbottar maskininlärning naturlig språkbearbetning tokenisering känneteckensextrahering klassificering ordinbäddningar transformatorer Computer Systems Datorsystem

Search results