Global ETD Search

101	Detecting Rhetorical Figures Based on Repetition of Words: Chiasmus, Epanaphora, Epiphora Dubremetz, Marie January 2017 (has links) This thesis deals with the detection of three rhetorical figures based on repetition of words: chiasmus (“Fair is foul, and foul is fair.”), epanaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). For a computer, locating all repetitions of words is trivial, but locating just those repetitions that achieve a rhetorical effect is not. How can we make this distinction automatically? First, we propose a new definition of the problem. We observe that rhetorical figures are a graded phenomenon, with universally accepted prototypical cases, equally clear non-cases, and a broad range of borderline cases in between. This makes it natural to view the problem as a ranking task rather than a binary detection task. We therefore design a model for ranking candidate repetitions in terms of decreasing likelihood of having a rhetorical effect, which allows potential users to decide for themselves where to draw the line with respect to borderline cases. Second, we address the problem of collecting annotated data to train the ranking model. Thanks to a selective method of annotation, we can reduce by three orders of magnitude the annotation work for chiasmus, and by one order of magnitude the work for epanaphora and epiphora. In this way, we prove that it is feasible to develop a system for detecting the three figures without an unsurmountable amount of human work. Finally, we propose an evaluation scheme and apply it to our models. The evaluation reveals that, even with a very incompletely annotated corpus, a system for repetitive figure detection can be trained to achieve reasonable accuracy. We investigate the impact of different linguistic features, including length, n-grams, part-of-speech tags, and syntactic roles, and find that different features are useful for different figures. We also apply the system to four different types of text: political discourse, fiction, titles of articles and novels, and quotations. Here the evaluation shows that the system is robust to shifts in genre and that the frequencies of the three rhetorical figures vary with genre. / Denna avhandling behandlar tre retoriska figurer som bygger på upprepning av ord, kiasm (“Om inte Muhammed kan komma till berget får berget komma till Muhammed.”), anafor (“Det är inte rimligt. Det är inte hållbart. Det är inte rättvist.”), och epifor (“Den här stugan är min. Den här bilen är min. Du är min.”). En dator kan lätt identifiera upprepningar av ord i en text, men att urskilja enbart de upprepningar som har en retorisk effekt är svårare. Hur kan vi få datorer att göra detta? För det första föreslår vi en ny definition av problemet. Vi noterar att retoriska figurer är ett graderbart fenomen, med prototypiska fall å ena sidan, och klara icke-fall å andra sidan; däremellan finns ett brett spektrum av gränsfall. Detta gör det naturligt att se problemet som en uppgift som gäller rangordning snarare än binär klassificering. Vi skapar därför en modell för att rangordna repetitioner efter sannolikheten att de har en retorisk effekt. Därigenom tillåts systemets användare att själva avgöra hur gränsfall ska hanteras. För det andra försöker vi undvika tänkbara svårigheter med att samla in annoterade data för att träna modellen för rangordning. Genom att använda en selektiv metod kan vi reducera mängden annoteringsarbete tusenfalt för kiasm och tiofalt för anafor och epifor. Det är alltså möjligt att utveckla ett system för att identifiera de aktuella retoriska figurerna utan en stor mängd manuell annotering. Slutligen föreslår vi en metod för utvärdering och tillämpar den på våra modeller. Utvärderingen visar att vi även med en korpus där få exempel är annoterade kan träna ett system för identifiering av repetitiva figurer med godtagbart resultat. Vi undersöker effekten av olika särdrag som bygger på t.ex. längd, n-gram, ordklasser och syntaktiska roller. En slutsats är att olika särdrag är användbara i olika grad för olika figurer. Vi prövar också systemet på ytterligare texttyper: politisk diskurs, skönlitteratur, titlar på artiklar och romaner, samt citat. Utvärderingen visar att systemet är robust vad gäller genreskillnader. Vi ser även att figurernas frekvens varierar över olika genrer. digital humanities figure of speech rhetorical device machine learning annotation
102	Finding Synonyms in Medical Texts : Creating a system for automatic synonym extraction from medical texts Cederblad, Gustav January 2018 (has links) This thesis describes the work of creating an automatic system for identifying synonyms and semantically related words in medical texts. Before this work, as a part of the project E-care@home, medical texts have been classified as either lay or specialized by both a lay annotator and an expert annotator. The lay annotator, in this case, is a person without any medical knowledge, whereas the expert annotator has professional knowledge in medicine. Using these texts made it possible to create co-occurrences matrices from which the related words could be identified. Fifteen medical terms were chosen as system input. The Dice similarity of these words in a context window of ten words around them was calculated. As output, five candidate related terms for each medical term was returned. Only unigrams were considered. The candidate related terms were evaluated using a questionnaire, where 223 healthcare professionals rated the similarity using a scale from one to five. A Fleiss kappa test showed that the agreement among these raters was 0.28, which is a fair agreement. The evaluation further showed that there was a significant correlation between the human ratings and the relatedness score (Dice similarity). That is, words with higher Dice similarity tended to get a higher human rating. However, the Dice similarity interval in which the words got the highest average human rating was 0.35-0.39. This result means that there is much room for improving the system. Further developments of the system should remove the unigram limitation and expand the corpus the provide a more accurate and reliable result. eHealth distributional semantics medical synonyms semantic relations word similarity
103	Better cooperation through communication in multi-agent reinforcement learning Kiseliou, Ivan January 2020 (has links) Cooperative needs play a critical role in the organisation of natural systems of communications. A number of recent studies in multi-agent reinforcement learning have established that artiﬁcial intelligence agents are similarly able to develop functional communication when required to complete a cooperative task. This thesis studies the emergence of communication in reinforcement learning agents, using a custom card game environment as a test-bed. Two contrasting approaches encompassing continuous and discrete modes of communication were appraised experimentally. Based on the average game completion rate, the agents provisioned with a continuous communication channel consistently exceed the no-communication baseline. A qualitative analysis of the agents’ behavioural strategies reveals a clearly deﬁned communication protocol as well as the deployment of playing tactics unseen in the baseline agents. On the other hand, the agents equipped with the discrete channel fail to learn to utilise it eﬀectively, ultimately showing no improvement from the baseline. Reinforcement Learning emergent communication
104	Studies of Cipher Keys from the 16th Century : Transcription, Systematisation and Analysis Tudor, Crina January 2019 (has links) In historical cryptography, a cipher key represents a set of rules by which we can convert between plaintext and ciphertext within an encryption system. Presently, there are not many studies that focus on analysing keys,especially not on a large scale or done in a systematic manner. In this paper, we describe a uniform transcription standard for the keys in the DECODE database. This way, we intend to lay a strong foundation to facilitate further studies on large sets of key transcriptions. We believe that a homogeneous set of transcriptions would be an ideal starting point for comparative studies, especially from a chronological perspective, as this can reveal potential patterns in the evolution of encryption methods. We also build a script that can perform an in-depth analysis of the components of a key, using our standardized transcription files as input. Finally, we give a detailed account of our findings and show that our method can reliably extract valuable information from the transcription file, such as the method of encryption or the types of symbols used for encoding, without the need of additional manual analysis of the original key. keys ciphers historical keys cryptography key transcription
105	Neural Network Based Automatic Essay Scoring for Swedish / Neurala nätverk för automatisk bedömning av uppsatser i nationella prov i svenska Ruan, Rex Dajun January 2020 (has links) This master thesis work presents a novel method of automatic essay scoring for Swedish national tests written by upper secondary high school students by deploying neural network architectures and linguistic feature extraction in the framework of Swegram. There are four sorts of linguistic aspects involved in our feature extraction: count-based,lexical morphological and syntactic. One of the three variants of recurrent network, vanilla RNN, GRU and LSTM, together with the specific model parameter setting, is implemented in the Automatic Essay Scoring (AES) modelling with extracted features measuring the linguistic complexity as text representation. The AES model is evaluated through interrater agreement with human assigned grade as target label in terms of quadratic weighted kappa (QWK) and exact percent agreement. Our best observed averaged QWK and averaged exact percent agreement is 0.50 and 52% over 10 folds among our all experimented models. Automatic Essay Scoring Swedish Linguistic Features Machine Learning
106	Multilingual Dependency Parsing of Uralic Languages : Parsing with zero-shot transfer and cross-lingual models using geographically proximate, genealogically related, and syntactically similar transfer languages Erenmalm, Elsa January 2020 (has links) One way to improve dependency parsing scores for low-resource languages is to make use of existing resources from other closely related or otherwise similar languages. In this paper, we look at eleven Uralic target languages (Estonian, Finnish, Hungarian, Karelian, Livvi, Komi Zyrian, Komi Permyak, Moksha, Erzya, North Sámi, and Skolt Sámi) with treebanks of varying sizes and select transfer languages based on geographical, genealogical, and syntactic distances. We focus primarily on the performance of parser models trained on various combinations of geographically proximate and genealogically related transfer languages, in target-trained, zero-shot, and cross-lingual configurations. We find that models trained on combinations of geographically proximate and genealogically related transfer languages reach the highest LAS in most zero-shot models, while our highest-performing cross-lingual models were trained on genealogically related languages. We also find that cross-lingual models outperform zero-shot transfer models. We then select syntactically similar transfer languages for three target languages, and find a slight improvement in the case of Hungarian. We discuss the results and conclude with suggestions for possible future work. dependency parsing multilingual zero-shot transfer learning uralic
107	Extracting Text into Meta-Data : Improving machine text-understanding of news-media articles / Extrahera Meta-Data från texter : Förbättra förståelsen för nyheter med hjälp av maskininlärning Lindén, Johannes January 2021 (has links) Society is constantly in need of information. It is important to consume event-based information of what is happening around us as well as facts and knowledge. As society grows, the amount of information to consume grows with it. This thesis demonstrates one way to extract and represent knowledge from text in a machine-readable way for news media articles. Three objectives are considered when developing a machine learning system to retrieve categories, entities, relations and other meta-data from text paragraphs. The first is to sort the terminology by topic; this makes it easier for machine learning algorithms to understand the text and the unique words used. The second objective is to construct a service for use in production, where scalability and performance are evaluated. Features are implemented to iteratively improve the model predictions, and several versions are run at the same time to, for example, compare them in an A/B test. The third objective is to further extract the gist of what is expressed in the text. The gist is extracted in the form of triples by connecting two related entities using a combination of natural language processing algorithms. The research presents a comparison between five different auto categorization algorithms, and an evaluation of their hyperparameters and how they would perform under the pressure of thousands of big, concurrent predictions. The aim is to build an auto-categorization system that can be used in the news media industry to help writers and journalists focus more on the story rather than filling in meta-data for each article. The best-performing algorithm is a Bidirectional Long-Short-Term-Memory neural network. Three different information extraction algorithms for extracting the gist of paragraphs are also compared. The proposed information extraction algorithm supports extracting information from texts in multiple languages with competitive accuracy compared with the state-of-the-art OpenIE and MinIE algorithms that can extract information in a single language. The use of the multi-linguistic models helps local-news media to write articles in different languages as a help to integrate immigrants into the society. / <p>Vid tidpunkten för presentationen var följande delarbeten opublicerade: delarbete 4 inskickat.</p><p>At the time of the public defence the following papers were unpublished: paper 4 submitted.</p> Computer Sciences Datavetenskap (datalogi)
108	Smoothening of Software documentation : comparing a self-made sequence to sequence model to a pre-trained model GPT-2 / Utjämning av mjukvarudokumentation Tao, Joakim, Thimrén, David January 2021 (has links) This thesis was done in collaboration with Ericsson AB with the goal of researching the possibility of creating a machine learning model that can transfer the style of a text into another arbitrary style depending on the data used. This had the purpose of making their technical documentation appear to have been written with one cohesive style for a better reading experience. Two approaches to solve this task were tested, the first one was to implement an encoder-decoder model from scratch, and the second was to use the pre-trained GPT-2 model created by a team from OpenAI and fine-tune the model on the specific task. Both of these models were trained on data provided by Ericsson, sentences were extracted from their documentation. To evaluate the models training loss, test sentences, and BLEU scores were used and these were compared to each other and with other state-of-the-art models. The models did not succeed in transforming text into a general technical documentation style but a good understanding of what would need to be improved and adjusted to improve the results were obtained. / <p>This thesis was presented on June 22, 2021, the presentation was done online on Microsoft teams. </p> Text style transfer authorship style transfer documentation style
109	Automated Essay Scoring for English Using Different Neural Network Models for Text Classification Deng, Xindi January 2021 (has links) Written skills are an essential evaluation criterion for a student’s creativity, knowledge, and intellect. Consequently, academic writing is a common part of university and college admissions applications, standardized tests, and classroom assessments. However, the task for teachers is quite daunting when it comes to essay scoring. Then Automated Essay Scoring may be a helpful tool in the decision-making by the teacher. There have been many successful models with supervised or unsupervised machine learning algorithms in the eld of Automated Essay Scoring. This thesis work makes a comparative study among various neural network models with supervised machine learning algorithms and different linguistic feature combinations. It also proves that the same linguistic features are applicable to more than one language. The models studied in this experiment include TextCNN, TextRNN_LSTM, Tex- tRNN_GRU, and TextRCNN trained with the essays from the Automated Student Assessment Prize (ASAP) from Kaggle competitions. Each essay is represented with linguistic features measuring linguistic complexity. Those features are divided into four groups: count-based, morphological, syntactic, and lexical features, and the four groups of features can form a total of 14 combinations. The models are evaluated via three measurements: Accuracy, F1 score, and Quadratic Weighted Kappa. The experimental results show that models trained only with count-based features outperform the models trained using other feature combinations. In addition, TextRNN_LSTM performs best, with an accuracy of 54.79%, an F1 score of 0.55, and a Quadratic Weighted Kappa of 0.59, which beats the statistically-based baseline models. Automated Essay Scoring neural network models linguistic features
110	Targeted Topic Modeling for Levantine Arabic Zahra, Shorouq January 2020 (has links) Topic models for focused analysis aim to capture topics within the limiting scope of a targeted aspect (which could be thought of as some inner topic within a certain domain). To serve their analytic purposes, topics are expected to be semantically-coherent and closely aligned with human intuition – this in itself poses a major challenge for the more common topic modeling algorithms which, in a broader sense, perform a full analysis that covers all aspects and themes within a collection of texts. The paper attempts to construct a viable focused-analysis topic model which learns topics from Twitter data written in a closely related group of non-standardized varieties of Arabic widely spoken in the Levant region (i.e Levantine Arabic). Results are compared to a baseline model as well as another targeted topic model designed precisely to serve the purpose of focused analysis. The model is capable of adequately capturing topics containing terms which fall within the scope of the targeted aspect when judged overall. Nevertheless, it fails to produce human-friendly and semantically-coherent topics as several topics contained a number of intruding terms while others contained terms, while still relevant to the targeted aspect, thrown together seemingly at random. Topic Model Focused Analysis Targeted Aspect Levantine Arabic

Search results