Global ETD Search

111	Textuella särdrag som kvalitet : En studie om att automatiskt mäta kvalitet i teknisk dokumentation Hantosi Albertsson, Sarah January 2015 (has links) Denna uppsats har undersökt vilka textuella särdrag som upplevs som brott emot kvalitet för den tekniska dokumentationen internt på Saab och hur särdrag som valts enligt experters bedömning kan evalueras automatiskt. Uppsatsen har med hjälp av data som genererats ur en deltagande design föreslagit en ny automatisk metod för att undersöka kvalitet i teknisk dokumentation. Tekniska skribenter och redaktörer deltog för att besvara uppsatsens första fråga och resultatet ifrån detta är en samling textuella särdrag som är möjliga att kvantifiera. Ur samlingen valdes fyra textuella särdrag som sedan undersökts genom programmering med syfte att evaluera textens läsbarhet, textens unikhet och dess syntaktiska struktur genom dependensparsning och dependenslängd som ett värde för kvalitet. Kvalitetsvärdet som systemet genererar anses validerat. Uppsatsen visar därmed att det finns goda möjligheter att använda mått som en del i en kvalitetsbedömning för teknisk dokumentation.
112	Abbreviation Expansion in Swedish Clinical Text : Using Distributional Semantic Models and Levenshtein Distance Normalization Tengstrand, Lisa January 2014 (has links) In the medical domain, especially in clinical texts, non-standard abbreviations are prevalent, which impairs readability for patients. To ease the understanding of the physicians' notes, abbreviations need to be identified and expanded into their original forms. This thesis presents a distributional semantic approach to find candidates of the original form of the abbreviation, which is combined with Levenshtein distance to choose the correct candidate among the semantically related words. The method is applied to radiology reports and medical journal texts, and a comparison is made to general Swedish. The results show that the correct expansion of the abbreviation can be found in 40% of the cases, an improvement by 24 percentage points compared to the baseline (0.16), and an increase by 22 percentage points compared to using word space models alone (0.18).
113	The generation of phrase-structure representations from principles LeBlanc, David C. January 1990 (has links) Implementations of grammatical theory have traditionally been based upon Context- Free Grammar (CFG) formalisms which all but ignore questions of learnability. Even implementations which are based upon theories of Generative Grammar (GG), a paradigm which is supposedly motivated by learnability, rarely address such questions. In this thesis we will examine a GG theory which has been formulated primarily to address questions of learnability and present an implementation based upon this theory. The theory argues from Chomsky's definition of epistemological priority that principles which match elements and structures from prelinguistic systems with elements and structures in linguistic systems are preferable to those which are defined purely linguistically or non-linguistically. A procedure for constructing phrase-structure representations from prelinguistic relations using principles of node percolation (rather than the traditional X-theory of GG theories or phrase-structure rules of CFG theories) is presented and this procedure integrated into a left-right, primarily bottom-up parsing mechanism. Specifically, we present a parsing mechanism which derives phrase-structure representations of sentences from Case- and 0-relations using a small number of Percolation Principles. These Percolation Principles simply determine the categorial features of the dominant node of any two adjacent nodes in a representational tree, doing away with explicit phrase structure rules altogether. The parsing mechanism also instantiates appropriate empty categories using a filler-driven paradigm for leftward argument and non-argument movement. Procedures modelling learnability are not implemented in this work, but the applicability of the presented model to a computational model of language is discussed. / Science, Faculty of / Computer Science, Department of / Graduate Computational linguistics Generative grammar Phrase structure grammar
114	Identifying Base Noun Phrases by Means of Recurrent Neural Networks : Using Morphological and Dependency Features Wang, Tonghe January 2020 (has links) Noun phrases convey key information in communication and are of interest in NLP tasks. A base NP is deﬁned as the headword and left-hand side modiﬁers of a noun phrase. In this thesis, we identify base NPs in Universal Dependencies treebanks in English and French using an RNN architecture.The data of this thesis consist of three multi-layered treebanks in which each sentence is annotated in both constituency and dependency formalisms. To build our training data, we ﬁnd base NPs in the constituency layers and project them onto the dependency layer by labeling corresponding tokens. For input features, we devised 18 conﬁgurations of features available in UD annotation. We train RNN models with LSTM and GRU cells with diﬀerent numbers of epochs on these conﬁgurations of features.Tested on monolingual and bilingual test sets, our models delivered satisfactory token-based F1 scores (92.70% on English, 94.87% on French, 94.29% on bilingual test set). The most predicative conﬁguration of features is found out to be pos_dep_parent_child_morph, which covers 1) dependency relations between the current token, its syntactic head, its leftmost and rightmost syntactic dependents; 2) PoS tags of these tokens; and 3) morphological features of the current token.
115	Specificity Prediction for Sentences in Press Releases He, Tiantian January 2020 (has links) Specificity is an important factor to text analysis. While much research on sentence specificity experiments upon news, very little is known about press releases. Our study is devoted to specificity in press releases, which are journalistic documents that companies share with the press and other media outlets. In this research, we analyze press releases about digital transformation written by pump companies, and develop tools for automatic measurement of sentence specificity. The goal of the research is to 1) explore the effects of data combination, 2) analyze features for specificity prediction, and 3) compare the effectiveness of classification and probability estimation. Through our experiment on various combinations of training data, we find that adding news data to the model effectively improves probability estimation, but the effects on classification are not noticeable. In terms of features, we find that the sentence length plays an essential role in specificity prediction. We remove twelve insignificant features, and this modification results in a model running faster as well as achieving comparable scores. We also find that both classification and probability estimation have drawbacks. With regard to probability estimation, models can score well by only making predictions around the threshold. Binary classification depends on the threshold, and threshold setting requires consideration. Besides, classification scores cannot sift out models that make unreliable judgement about high and low specificity sentences.
116	Clustering Short Texts: Categorizing Initial Utterances from Customer Service Dialogue Agents Hang, Sijia January 2021 (has links) Text classification involves labeled data, which is not always available, or requires expensive manual labour.User-generated short texts are being produced in abundance in customer service sectors through transcripts of phone calls or chats online. This kind of unstructured textual data can be noisy and thus poses challenges to unsupervised classification methods developed for standard documents such as news articles.This thesis project explores some possible methods of unsupervised classification of user-generated short texts in Swedish on a real-world dataset of short texts collected from first utterances in a Conversational Interactive Voice Response solution. Such texts represent a spectrum of sub domains that customer service representative may handle, but are not extensively explored in the literature. We experiment with three types of pretrained word embeddings as text representation methods, and two clustering algorithms on two representative, but different, subsets of the data as well as the full dataset. The experimental results show that the static fastText embeddings are better suited than state-of-the-art contextual embeddings, such as those derived from BERT, at representing noisy short texts for clustering. In addition, we conduct manual (re-)labeling of selected subsets of the data as an exploratory analysis of the dataset and it shows that the provided labels are not reliable for meaningful evaluation.Furthermore, as the data often covers several overlapping concepts in a narrow domain, the existing pretrained embeddings are not effective at capturing the nuanced differences and the clustering algorithms do not separate the data points that fit the operational objectives according to provided labels. Nevertheless, our qualitative analysis shows that unsupervised clustering algorithms could contribute to the goal of minimizing manual efforts in the data labeling process to a certain degree in the preprocessing step, but more could be achieved in a semi-supervised ``human-in-the-loop'' manner.
117	Transfer Learning for Multilingual Offensive Language Detection with BERT Casula, Camilla January 2020 (has links) The popularity of social media platforms has led to an increase in user-generated content being posted on the Internet. Users, masked behind what they perceive as anonymity, can express offensive and hateful thoughts on these platforms, creating a need to detect and filter abusive content. Since the amount of data available on the Internet is impossible to analyze manually, automatic tools are the most effective choice for detecting offensive and abusive messages. Academic research on the detection of offensive language on social media has been on the rise in recent years, with more and more shared tasks being organized on the topic. State-of-the-art deep-learning models such as BERT have achieved promising results on offensive language detection in English. However, multilingual offensive language detection systems, which focus on several languages at once, have remained underexplored until recently. In this thesis, we investigate whether transfer learning can be useful for improving the performance of a classifier for detecting offensive speech in Danish, Greek, Arabic, Turkish, German, and Italian. More specifically, we first experiment with using machine-translated data as input to a classifier. This allows us to evaluate whether machine translated data can help classification. We then experiment with fine-tuning multiple pre-trained BERT models at once. This parallel fine-tuning process, named multi-channel BERT (Sohn and Lee, 2019), allows us to exploit cross-lingual information with the goal of understanding its impact on the detection of offensive language. Both the use of machine translated data and the exploitation of cross-lingual information could help the task of detecting offensive language in cases in which there is little or no annotated data available, for example for low-resource languages. We find that using machine translated data, either exclusively or mixed with gold data, to train a classifier on the task can often improve its performance. Furthermore, we find that fine-tuning multiple BERT models in parallel can positively impact classification, although it can lead to robustness issues for some languages.
118	Emotional Content in Novels for Literary Genre Prediction : And Impact of Feature Selection on Text Classification Models Yako, Mary January 2021 (has links) Automatic literary genre classification presents a challenging task for Natural Language Processing (NLP) systems, mainly because literary texts have deeper levels of meanings, hold distinctive themes, and communicate certain messages and emotions. We conduct a study where we experiment with building literary genre classifiers based on emotions in novels, to investigate the effects that features pertinent to emotions have on models of genre prediction. We begin by performing an analysis of emotions describing emotional composition and density in the dataset. The experiments are carried out on a dataset consisting of novels categorized in eight different genres. Genre prediction models are built using three algorithms: Random Forest, Support Vector Machine, and k-Nearest Neighbor. We build models based on emotion-words counts and emotional words in a novel, and compare them to models of commonly used features, the bag-of-words and the TF-IDF features. Moreover, we use a feature selection dimensionality reduction procedure on the TF-IDF feature set and study its impact on classification performance. Finally, we train and test the classifiers on a combination of the two most optimal emotion-related feature sets, and compare them on classifiers trained and tested on a combination of bag-of-words and the reduced TF-IDF features. Our results confirm that: using features of emotional content in novels improves classification performance a 75% F1 compared to a bag-of-words baseline of 71% F1; TF-IDF feature filtering method positively impacts genre classification performance on literary texts.
119	Unsupervised Lexical Semantic Change Detection with Context-Dependent Word Representations / Oövervakad inlärning av lexikalsemantisk förändring genom kontextberoende ordrepresentationer You, Huiling January 2021 (has links) In this work, we explore the usefulness of contextualized embeddings from language models on lexical semantic change (LSC) detection. With diachronic corpora spanning two time periods, we construct word embeddings for a selected set of target words, aiming at detecting potential LSC of each target word across time. We explore different systems of embeddings to cover three topics: contextualized vs static word embeddings, token- vs type-based embeddings, and multilingual vs monolingual language models. We use a multilingual dataset covering three languages (English, German, Swedish) and explore each system of embedding with two subtasks, a binary classification task and a ranking task. We compare the performance of different systems of embeddings, and seek to answer our research questions through discussion and analysis of experimental results. We show that contextualized word embeddings are on par with static word embeddings in the classification task. Our results also show that it is more beneficial to use the contextualized embeddings from a multilingual model than from a language specific model in most cases. We present that token-based setting is strong for static embeddings, and type-based setting for contextual embeddings, especially for the ranking task. We provide some explanation for the results we achieve, and propose improvements that can be made to our experiments for future work.
120	Evaluation Of Methods For AutomaticallyDeciding Article Type For Newspapers Eriksson, Adam January 2021 (has links) No description available.

Search results