Global ETD Search

31	Distributionella representationer av ord för effektiv informationssökning : Algoritmer för sökning i kundsupportforum / Distributional Representations of Words for Effective Information Retrieval : Information Retrieval in Customer Support Forums Lachmann, Tim, Sabel, Johan January 2017 (has links) I takt med att informationsmängden ökar i samhället ställs högre krav på mer förfinade metoder för sökning och hantering av information. Att utvinna relevant data från företagsinterna system blir en mer komplex uppgift då större informationsmängder måste hanteras och mycket kommunikation förflyttas till digitala plattformar. Metoder för vektorbaserad ordinbäddning har under senare år gjort stora framsteg; i synnerhet visade Google 2013 banbrytande resultat med modellen Word2vec och överträffade äldre metoder. Vi implementerar en sökmotor som utnyttjar ordinbäddningar baserade på Word2vec och liknande modeller, avsedd att användas på IT-företaget Kundo och för produkten Kundo Forum. Resultaten visar på potential för informationssökning med markant bättre täckning utan minskad precision. Kopplat till huvudområdet informationssökning genomförs också en analys av vilka implikationer en förbättrad sökmotor har ur ett marknads- och produktutvecklingsperspektiv. / As the abundance of information in society increases, so does the need for more sophisticated methods of information retrieval. Extracting information from internal systems becomes a more complex task when handling larger amounts of information and when more communications are transferred to digital platforms. Recent years methods for word embedding in vector space have gained traction. In 2013 Google sent ripples across the field of Natural Language Processing with a new method called Word2vec, significantly outperforming former practices. Among different established methods for information retrieval, we implement a retrieval method utilizing Word2vec and related methods of word embedding for the search engine at IT company Kundo and their product Kundo Forum. We demonstrate the potential to improve information retrieval recall by a significant margin without diminishing precision. Coupled with the primary subject of information retrieval we also investigate potential market and product development implications related to a different kind of search engine. word2vec fasttext glove LSI LSA word embeddings information retrieval search engine machine learning neural networks natural language processing NLP distributional representations word2vec fasttext glove LSI LSA ordinbäddning informationssökning sökmotor maskininlärning språkteknologi neurala nätverk distributionella representationer Computer Sciences Datavetenskap (datalogi)
32	Καρδία in the New Testament and Other Ancient Greek Literature : Using a Corpus Approach to Investigate the Semantics of καρδία Against the Backdrop of New Testament Lexicography Möller, Gustaf January 2024 (has links) The semantics of New Testament words is a complex subject as these words often have backgrounds consisting of a usage in both extrabiblical Greek literature and the Septuagint, in extension also being the object of Hebraic influence. Καρδία, often translated ”heart”, is no exception. In some Greek literature, the organ is referred to literally, but in the New Testament, καρδία is exclusively used figuratively. Another layer of complexity is added when the nature of this figurative usage is considered, as it includes aspects of cognition, volition, morality, and more. In this thesis, I studied how καρδία is used in the New Testament in comparison to the Septuagint, investigating the existing notion of a “biblical usage” of the word. This usage was then compared to its usage in periods ranging from 800–270 BCE, further exploring the existence of a distinct biblical usage but from a diachronic perspective. For this study, I adopted an interdisciplinary approach inspired by computational and corpora linguistics, dedicating a substantial part of this thesis to evaluating the approach within the field of New Testament lexicography. Its usage in the New Testament and the Septuagint was found to be similar, and I was able to propose some areas where this similarity became the most evident. This biblical usage of καρδία was not found to share much similarity with its usage in extrabiblical literature, with a biblical “moral” and “theological” usage standing out as being the main points of contrast. For the purposes of New Testament lexicography, the approach was found beneficial regarding the collection of evidence, although some issues will need to be further investigated. Καρδία Heart New Testament New Testament lexicography Septuagint Ancient Greek literature Lexical semantics Computational linguistics Natural Language Processing Word2Vec Καρδία Hjärta Nya testamentet Nytestamentlig lexikografi Septuaginta Antik grekisk litteratur Lexikal semantik Datorlingvistik Natural Language Processing Word2Vec Religious Studies Religionsvetenskap
33	透過文字探勘技術探討各校高階經營管理（EMBA）學程之特性－以九校國立大學為例 / Analyzing the Profiles of EMBA Program by Text Mining Methodology - A Case of Nine EMBA Programs 林庭竹, Lin, Ting Chu Unknown Date (has links) 近年來，臺灣高階經營管理（EMBA）學程市場逐漸飽和，預計就讀EMBA的企業經理人比例趨緩，再加上兩岸三地EMBA學程崛起，都將影響臺灣EMBA的發展。因此，本研究認為可根據供應面與需求面來進行檢視，分析出目前臺灣EMBA供需兩大層面，由各校教師與學生所嶄露的特徵輪廓，使臺灣的EMBA邁向具有各校特色的適性化學程。在第一階段研究過程中，選取臺灣九校國立頂尖大學所設立的EMBA，作為研究對象。利用Python撰寫爬蟲程式，蒐集九校EMBA教師與學生的文章標題與概要，其中教師文本總計23033篇，學生文本總計7342篇。運用Jieba對文本斷詞後，以14個管理學別視為供應面，需求面則是根據政府訂立的12個職業別，來做為目標字詞，透過Word2Vec模型計算管理學別與教師、職業別與學生文本兩大目標字詞的關聯詞，最後獲得各目標字詞20個關聯詞的詞集。而第二階段透過第一階段所呈現的關聯詞，進一步計算與教師和學生文本字詞的Cosine相似度，來辨別各校教師與學生間所呈現的供需面之共同特徵，代表該EMBA之特質。第一階段研究結果顯示，Word2Vec模型透過特徵向量辨別關聯詞時，可準確辨別出與目標字詞具有相同涵義或相互關聯的字詞，且所找出的20個關聯字詞與目標字詞的Cosine相似度也多大於0.7，因此透過Word2Vec模型建立目標字詞之擴增詞集具有相當高的準確性。而第二階段透過第一階段所呈現的關聯詞所計算的供需面Cosine相似度之排序，可發現各校EMBA由教師與學生成員文本與各目標字詞的相似度排序皆有所不同，因此各學程可透過其差異性作為特色指標，發展出適性化學程，提高臺灣企業經理人就讀EMBA之意願。文字探勘高階經營管理學程 EMBA Word2Vec 特徵輪廓
34	股市趨勢預測之研究 -財經評論文本情感分析 / Predict the trend in the stock by Sentiment analyzing financial posts 蔡宇祥, Tsai, Yu Shiang Unknown Date (has links) 根據過去研究指出，社群網站上的貼文訊息會對群眾情緒造成影響，進而影響股市波動，故對於投資者而言，如果能快速分析大量社群網站的財經文本來推測投資情緒進而預測股市走勢，將可提升投資獲利。過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果，但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別，故其有無法預期未知類別的限制，所以本研究透過深度學習方法，從巨量資料及裡抓出有關於股市之文章，並透過財經文本的混合監督式學習與非監督式學習之情感分析方法，透過非監督式學習對微博財經貼文進行文本主題判別、情緒指數計算與情緒傾向標注，並且透過監督式學習的方式，建立分類模型以預測上海指數走勢，最後配合視覺化工具作趨勢線圖分析，找出具有領先指標特性之主題。在實驗結果中，深度學習方面，本研究透過word2vec抓取有效之股市主題文章，有效篩選了需要分析之文本，主題模型方面，我們最後使用LDA作為本研究標註主題之方法，因為其文本數量大於議題詞數量造成TFIDF矩陣過於稀疏，造成Kmeans分群效果不佳，故後續採用LDA主題模型進行主題標注。情緒傾向標注方面，透過擴充後的情感詞集比起NTUSD有更好的詞性分數判斷效果，計算出的情緒指數之趨勢線能有效預測上海指數之趨勢。此外，並非所有主題模型之情緒指數皆具有領先特性，僅公司表現與上海指數之主題模型的情緒指數能提前反應上海指數趨勢，故本研究用此二主題之文本的情緒指數來建立分類模型。本研究透過比較情緒指數與單純指數指標分類模型的準確度，前者較後者高出7%的準確率。故證實了情感分析確實能有效提升上海指數趨勢預測準確度，幫助投資者增加股市報酬率。情感分析 Word2vec LDA主題模型 K-means 上海股價指數
35	Ambiguous synonyms : Implementing an unsupervised WSD system for division of synonym clusters containing multiple senses Wallin, Moa January 2019 (has links) When clustering together synonyms, complications arise in cases of the words having multiple senses as each sense’s synonyms are erroneously clustered together. The task of automatically distinguishing word senses in cases of ambiguity, known as word sense disambiguation (WSD), has been an extensively researched problem over the years. This thesis studies the possibility of applying an unsupervised machine learning based WSD-system for analysing existing synonym clusters (N = 149) and dividing them correctly when two or more senses are present. Based on sense embeddings induced from a large corpus, cosine similarities are calculated between sense embeddings for words in the clusters, making it possible to suggest divisions in cases where different words are closer to different senses of a proposed ambiguous word. The system output is then evaluated by four participants, all experts in the area. The results show that the system does not manage to correctly divide the clusters in more than 31% of the cases according to the participants. Moreover, it is discovered that some differences exist between the participants’ ratings, although none of the participants predominantly agree with the system’s division of the clusters. Evidently, further research and improvements are needed and suggested for the future. SenseGram unsupervised word sense disambiguation word sense induction word2vec homonymy ambiguity
36	Sentiment analysis of Swedish reviews and transfer learning using Convolutional Neural Networks Sundström, Johan January 2018 (has links) Sentiment analysis is a field within machine learning that focus on determine the contextual polarity of subjective information. It is a technique that can be used to analyze the "voice of the customer" and has been applied with success for the English language for opinionated information such as customer reviews, political opinions and social media data. A major problem regarding machine learning models is that they are domain dependent and will therefore not perform well for other domains. Transfer learning or domain adaption is a research field that study a model's ability of transferring knowledge across domains. In the extreme case a model will train on data from one domain, the source domain, and try to make accurate predictions on data from another domain, the target domain. The deep machine learning model Convolutional Neural Network (CNN) has in recent years gained much attention due to its performance in computer vision both for in-domain classification and transfer learning. It has also performed well for natural language processing problems but has not been investigated to the same extent for transfer learning within this area. The purpose of this thesis has been to investigate how well suited the CNN is for cross-domain sentiment analysis of Swedish reviews. The research has been conducted by investigating how the model perform when trained with data from different domains with varying amount of source and target data. Additionally, the impact on the model’s transferability when using different text representation has also been studied. This study has shown that a CNN without pre-trained word embedding is not that well suited for transfer learning since it performs worse than a traditional logistic regression model. Substituting 20% of source training data with target data can in many of the test cases boost the performance with 7-8% both for the logistic regression and the CNN model. Using pre-trained word embedding produced by a word2vec model increases the CNN's transferability as well as the in-domain performance and outperform the logistic regression model and the CNN model without pre-trained word embedding in the majority of test cases. Sentiment analysis transfer learning domain adaption convolutional neural networks cnn machine learning word2vec Computer and Information Sciences Data- och informationsvetenskap
37	Recognising Moral Foundations in Online Extremist Discourse : A Cross-Domain Classification Study van Luenen, Anne Fleur January 2020 (has links) So far, studies seeking to recognise moral foundations in texts have been relatively successful (Araque et al., 2019; Lin et al., 2018; Mooijman et al., 2017; Rezapouret al., 2019). There are, however, two issues with these studies: Firstly, it is an extensive process to gather and annotate sufficient material for training. Secondly, models are only trained and tested within the same domain. It is yet unexplored how these models for moral foundation prediction perform when tested in other domains, but from their experience with annotation, Hoover et al. (2017) describe how moral sentiments on one topic (e.g. black lives matter) might be completely different from moral sentiments on another (e.g. presidential elections). This study attempts to explore to what extent models generalise to other domains. More specifically, we focus on training on Twitter data from non-extremist sources, and testing on data from an extremist (white nationalist) forum. We conducted two experiments. In our first experiment we test whether it is possible to do cross domain classification of moral foundations. Additionally, we compare the performance of a model using the Word2Vec embeddings used in previous studies to a model using the newer BERT embeddings. We find that although the performance drops significantly on the extremist out-domain test sets, out-domain classification is not impossible. Furthermore, we find that the BERT model generalises marginally better to the out-domain test set, than the Word2Vec model. In our second experiment we attempt to improve the generalisation to extremist test data by providing contextual knowledge. Although this does not improve the model, it does show the model’s robustness against noise. Finally we suggest an alternative approach for accounting for contextual knowledge. Humanities and the Arts Humaniora och konst
38	Mining Parallel Corpora from the Web / Mining Parallel Corpora from the Web Kúdela, Jakub January 2016 (has links) Title: Mining Parallel Corpora from the Web Author: Bc. Jakub Kúdela Author's e-mail address: jakub.kudela@gmail.com Department: Department of Software Engineering Thesis supervisor: Doc. RNDr. Irena Holubová, Ph.D. Supervisor's e-mail address: holubova@ksi.mff.cuni.cz Thesis consultant: RNDr. Ondřej Bojar, Ph.D. Consultant's e-mail adress: bojar@ufal.mff.cuni.cz Abstract: Statistical machine translation (SMT) is one of the most popular ap- proaches to machine translation today. It uses statistical models whose parame- ters are derived from the analysis of a parallel corpus required for the training. The existence of a parallel corpus is the most important prerequisite for building an effective SMT system. Various properties of the corpus, such as its volume and quality, highly affect the results of the translation. The web can be considered as an ever-growing source of considerable amounts of parallel data to be mined and included in the training process, thus increasing the effectiveness of SMT systems. The first part of this thesis summarizes some of the popular methods for acquiring parallel corpora from the web. Most of these methods search for pairs of parallel web pages by looking for the similarity of their structures. How- ever, we believe there still exists a non-negligible amount of parallel...
39	Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System / Automatiska bildförslag till nyhetsartiklar Svensson, Pontus January 2020 (has links) Multimodal machine learning is a subfield of machine learning that aims to relate data from different modalities, such as texts and images. One of the many applications that could be built upon this technique is an image retrieval system that, given a text query, retrieves suitable images from a database. In this thesis, a retrieval system based on canonical correlation is used to suggest images for news articles. Different dense text representations produced by Word2vec and Doc2vec, and image representations produced by pre-trained convolutional neural networks are explored to find out how they affect the suggestions. Which part of an article is best suited as a query to the system is also studied. Also, experiments are carried out to determine if an article's date of publication can be used to improve the suggestions. The results show that Word2vec outperforms Doc2vec in the task, which indicates that the meaning of article texts are not as important as the individual words they consist of. Furthermore, the queries are improved by rewarding words that are particularly significant. nlp natural language processing computer vision image classification news articles suggestions retrieval Word2vec Doc2vec Computer and Information Sciences Data- och informationsvetenskap
40	Object Classification using Language Models From, Gustav January 2022 (has links) In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder. För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning. Classifier BERT machine learning ML language model IMDB word2Vec doc2Vec NLP

Search results