Global ETD Search

21	Specificity Prediction for Sentences in Press Releases He, Tiantian January 2020 (has links) Specificity is an important factor to text analysis. While much research on sentence specificity experiments upon news, very little is known about press releases. Our study is devoted to specificity in press releases, which are journalistic documents that companies share with the press and other media outlets. In this research, we analyze press releases about digital transformation written by pump companies, and develop tools for automatic measurement of sentence specificity. The goal of the research is to 1) explore the effects of data combination, 2) analyze features for specificity prediction, and 3) compare the effectiveness of classification and probability estimation. Through our experiment on various combinations of training data, we find that adding news data to the model effectively improves probability estimation, but the effects on classification are not noticeable. In terms of features, we find that the sentence length plays an essential role in specificity prediction. We remove twelve insignificant features, and this modification results in a model running faster as well as achieving comparable scores. We also find that both classification and probability estimation have drawbacks. With regard to probability estimation, models can score well by only making predictions around the threshold. Binary classification depends on the threshold, and threshold setting requires consideration. Besides, classification scores cannot sift out models that make unreliable judgement about high and low specificity sentences.
22	Clustering Short Texts: Categorizing Initial Utterances from Customer Service Dialogue Agents Hang, Sijia January 2021 (has links) Text classification involves labeled data, which is not always available, or requires expensive manual labour.User-generated short texts are being produced in abundance in customer service sectors through transcripts of phone calls or chats online. This kind of unstructured textual data can be noisy and thus poses challenges to unsupervised classification methods developed for standard documents such as news articles.This thesis project explores some possible methods of unsupervised classification of user-generated short texts in Swedish on a real-world dataset of short texts collected from first utterances in a Conversational Interactive Voice Response solution. Such texts represent a spectrum of sub domains that customer service representative may handle, but are not extensively explored in the literature. We experiment with three types of pretrained word embeddings as text representation methods, and two clustering algorithms on two representative, but different, subsets of the data as well as the full dataset. The experimental results show that the static fastText embeddings are better suited than state-of-the-art contextual embeddings, such as those derived from BERT, at representing noisy short texts for clustering. In addition, we conduct manual (re-)labeling of selected subsets of the data as an exploratory analysis of the dataset and it shows that the provided labels are not reliable for meaningful evaluation.Furthermore, as the data often covers several overlapping concepts in a narrow domain, the existing pretrained embeddings are not effective at capturing the nuanced differences and the clustering algorithms do not separate the data points that fit the operational objectives according to provided labels. Nevertheless, our qualitative analysis shows that unsupervised clustering algorithms could contribute to the goal of minimizing manual efforts in the data labeling process to a certain degree in the preprocessing step, but more could be achieved in a semi-supervised ``human-in-the-loop'' manner.
23	Transfer Learning for Multilingual Offensive Language Detection with BERT Casula, Camilla January 2020 (has links) The popularity of social media platforms has led to an increase in user-generated content being posted on the Internet. Users, masked behind what they perceive as anonymity, can express offensive and hateful thoughts on these platforms, creating a need to detect and filter abusive content. Since the amount of data available on the Internet is impossible to analyze manually, automatic tools are the most effective choice for detecting offensive and abusive messages. Academic research on the detection of offensive language on social media has been on the rise in recent years, with more and more shared tasks being organized on the topic. State-of-the-art deep-learning models such as BERT have achieved promising results on offensive language detection in English. However, multilingual offensive language detection systems, which focus on several languages at once, have remained underexplored until recently. In this thesis, we investigate whether transfer learning can be useful for improving the performance of a classifier for detecting offensive speech in Danish, Greek, Arabic, Turkish, German, and Italian. More specifically, we first experiment with using machine-translated data as input to a classifier. This allows us to evaluate whether machine translated data can help classification. We then experiment with fine-tuning multiple pre-trained BERT models at once. This parallel fine-tuning process, named multi-channel BERT (Sohn and Lee, 2019), allows us to exploit cross-lingual information with the goal of understanding its impact on the detection of offensive language. Both the use of machine translated data and the exploitation of cross-lingual information could help the task of detecting offensive language in cases in which there is little or no annotated data available, for example for low-resource languages. We find that using machine translated data, either exclusively or mixed with gold data, to train a classifier on the task can often improve its performance. Furthermore, we find that fine-tuning multiple BERT models in parallel can positively impact classification, although it can lead to robustness issues for some languages.
24	Improving an Information Retrieval System by Using Machine Learning to Improve User Relevance Feedback / Förbättring av ett informationssökningssystem genom att använda maskininlärning för att förbättra relevansåterkoppling från en användare Nordin, Alexandra January 2016 (has links) The aim of this thesis work is to improve the performance of an already existing information retrieval system that uses relevance feedback for performing query expansion. It is a constant goal to improve this system because the docu- ments that are retrieved are a base for various data analysis tasks. It is therefore important that the precision and re- call are high. A user can choose to give relevance feedback when executing a query, meaning the user can mark docu- ments in the search result as relevant or irrelevant and redo the search based on this feedback. The original query will then be expanded based on the user’s feedback. The ap- proach presented in this thesis uses the documents marked as relevant or irrelevant to train a classifier that can classify unknown documents from the search result as either rele- vant, irrelevant or unknown. The aim is to classify unknown documents and add them to the set of feedback documents that are used for the query expansion. The assumption that this thesis is based on is that the more feedback a user gives, the better the query expansion will perform. The system developed in this thesis is evaluated for the English language. The results in this thesis show that integrating the classifier in the existing system improved the perfor- mance in three out of four use cases. The existing system already has a good performance, but small improvements are important. It would therefore be beneficial to integrate it into the existing system. / I detta examensarbetet så är målet att förbättra ett exi- sterande informationssökningssystem som använder sig av relevansåterkoppling för att utföra sökfrågeexpansion. Det finns en konstant efterfrågan att förbättra prestandan av detta system då de dokument som returneras används för olika dataanalysuppgifter. Därför är det viktigt att både precision och täckning är så högt som möjligt. En använ- dare kan välja att ge relevansåterkoppling, vilket betyder att användaren markerar dokument som är relevanta och irrelevanta, vilket sedan används för att utföra sökfråge- expansion. Den initiala sökfrågan expanderas utifrån in- formation från relevansåterkopplingen. Tillvägagångssättet som presenteras i detta arbete använder de markerade do- kumenten för att träna en maskininlärningsmodell som kan klassificera oklassade document som relevanta, irrelevanat eller okända. Målet är att klassificera okända dokument och sedan lägga till dem till uppsättningen av relevansåterkopp- lingsdokument som användaren har markerat. Antagandet som denna metod baseras på är att ju mer relevansåter- koppling som ges, desto bättre sökfrågeexpansion kan sy- stemet utföra. Systemet som utvecklades i detta examens- arbete är byggt för och evaluerat mot data som äs skrivet på engelska. Resultaten i detta arbete visar att denna metod förbättrade resultaten i tre utav fyra testfall. Prestandan för det existerande systemet är redan på en hög nivå, men små förbättringar är viktiga. Det skulle vara en fördel att integrera detta i det existerande systemet. Informationssökning Maskininlärning Språkteknologi Computer Sciences Datavetenskap (datalogi)
25	Emotional Content in Novels for Literary Genre Prediction : And Impact of Feature Selection on Text Classification Models Yako, Mary January 2021 (has links) Automatic literary genre classification presents a challenging task for Natural Language Processing (NLP) systems, mainly because literary texts have deeper levels of meanings, hold distinctive themes, and communicate certain messages and emotions. We conduct a study where we experiment with building literary genre classifiers based on emotions in novels, to investigate the effects that features pertinent to emotions have on models of genre prediction. We begin by performing an analysis of emotions describing emotional composition and density in the dataset. The experiments are carried out on a dataset consisting of novels categorized in eight different genres. Genre prediction models are built using three algorithms: Random Forest, Support Vector Machine, and k-Nearest Neighbor. We build models based on emotion-words counts and emotional words in a novel, and compare them to models of commonly used features, the bag-of-words and the TF-IDF features. Moreover, we use a feature selection dimensionality reduction procedure on the TF-IDF feature set and study its impact on classification performance. Finally, we train and test the classifiers on a combination of the two most optimal emotion-related feature sets, and compare them on classifiers trained and tested on a combination of bag-of-words and the reduced TF-IDF features. Our results confirm that: using features of emotional content in novels improves classification performance a 75% F1 compared to a bag-of-words baseline of 71% F1; TF-IDF feature filtering method positively impacts genre classification performance on literary texts.
26	Unsupervised Lexical Semantic Change Detection with Context-Dependent Word Representations / Oövervakad inlärning av lexikalsemantisk förändring genom kontextberoende ordrepresentationer You, Huiling January 2021 (has links) In this work, we explore the usefulness of contextualized embeddings from language models on lexical semantic change (LSC) detection. With diachronic corpora spanning two time periods, we construct word embeddings for a selected set of target words, aiming at detecting potential LSC of each target word across time. We explore different systems of embeddings to cover three topics: contextualized vs static word embeddings, token- vs type-based embeddings, and multilingual vs monolingual language models. We use a multilingual dataset covering three languages (English, German, Swedish) and explore each system of embedding with two subtasks, a binary classification task and a ranking task. We compare the performance of different systems of embeddings, and seek to answer our research questions through discussion and analysis of experimental results. We show that contextualized word embeddings are on par with static word embeddings in the classification task. Our results also show that it is more beneficial to use the contextualized embeddings from a multilingual model than from a language specific model in most cases. We present that token-based setting is strong for static embeddings, and type-based setting for contextual embeddings, especially for the ranking task. We provide some explanation for the results we achieve, and propose improvements that can be made to our experiments for future work.
27	Evaluation Of Methods For AutomaticallyDeciding Article Type For Newspapers Eriksson, Adam January 2021 (has links) No description available.
28	A Lexical Comparison Using Word Embedding Mapping from an Academic Word Usage Perspective Cai, Xuemei January 2020 (has links) This thesis applies the word embedding mapping approach to make a lexical comparison from academic word usage perspective. We aim to demonstrate the differences in academic word usage between a corpus of student writings and a corpus of academic English, as well as a corpus of student writings and social media texts. The Vecmap mapping algorithm, commonly used in solving cross-language mapping problems, was used to map academic English vector space and social media text vector space into the common student writing vector space to facilitate the comparison of word representations from different corpora and to visualize the comparison results. The average distance was defined as a measure of word usage differences of 420 typical academic words between each two corpora, and principal component analysis was applied to visualize the differences. A rank-biased overlap approach was adopted to evaluate the results of the proposed approach. The experimental results show that the usage of academic words of student writings corpus is more similar to the academic English corpus than to the social media text corpus.
29	Using Attention-based Sequence-to-Sequence Neural Networks for Transcription of Historical Cipher Documents Renfei, Han January 2020 (has links) Encrypted historical manuscripts (also called ciphers), containing encoded information, provides a useful resource for giving new insight into our history. Transcribing these manuscripts from image format to computer readable format is a necessary step for decrypting them. In this thesis project, we explore automatic approaches of Hand Written Text Recognition (HTR) for cipher image transcription line by line.In this thesis project, We applied an attention-based Sequence-to-Sequence (Seq2Seq) model for the automatic transcription of ciphers with three different writing systems. We tested/developed algorithms for the recognition of cipher symbols, and their location. To evaluate our method on different levels, the model is trained and tested on ciphers with various symbol sets, from digits to graphical signs. To find out the useful approaches for improving the transcription performance, we conducted ablation study regarding attention mechanism and other deep learning tricks. The results show an accuracy lower than 50% and indicate a big room for improvements and plenty of future work.
30	Named Entity Recognition for Social Media Text Zhang, Yaxi January 2019 (has links) This thesis aims to perform named entity recognition for English social media texts. Named Entity Recognition (NER) is applied in many NLP tasks as an important preprocessing procedure. Social media texts contain lots of real-time data and therefore serve as a valuable source for information extraction. Nevertheless, NER for social media texts is a rather challenging task due to the noisy context. Traditional approaches to deal with this task use hand-crafted features but prove to be both time-consuming and very task-specific. As a result, they fail to deliver satisfactory performance. The goal of this thesis is to tackle this task by automatically identifying and annotating the named entities with multiple types with the help of neural network methods. In this thesis, we experiment with three different word embeddings and character embedding neural network architectures that combine long short- term memory (LSTM), bidirectional LSTM (BI-LSTM) and conditional random field (CRF) to get the best result. The data and evaluation tool comes from the previous shared tasks on Noisy User-generated Text (W- NUT) in 2017. We achieve the best F1 score 42.44 using BI-LSTM-CRF with character-level representation extracted by a BI-LSTM, and pre-trained word embeddings trained by GloVe. We also find out that the results could be improved with larger training data sets.

Search results