Spelling suggestions: "subject:"databehandling"" "subject:"databehandlings""
41 |
Emotional Content in Novels for Literary Genre Prediction : And Impact of Feature Selection on Text Classification ModelsYako, Mary January 2021 (has links)
Automatic literary genre classification presents a challenging task for Natural Language Processing (NLP) systems, mainly because literary texts have deeper levels of meanings, hold distinctive themes, and communicate certain messages and emotions. We conduct a study where we experiment with building literary genre classifiers based on emotions in novels, to investigate the effects that features pertinent to emotions have on models of genre prediction. We begin by performing an analysis of emotions describing emotional composition and density in the dataset. The experiments are carried out on a dataset consisting of novels categorized in eight different genres. Genre prediction models are built using three algorithms: Random Forest, Support Vector Machine, and k-Nearest Neighbor. We build models based on emotion-words counts and emotional words in a novel, and compare them to models of commonly used features, the bag-of-words and the TF-IDF features. Moreover, we use a feature selection dimensionality reduction procedure on the TF-IDF feature set and study its impact on classification performance. Finally, we train and test the classifiers on a combination of the two most optimal emotion-related feature sets, and compare them on classifiers trained and tested on a combination of bag-of-words and the reduced TF-IDF features. Our results confirm that: using features of emotional content in novels improves classification performance a 75% F1 compared to a bag-of-words baseline of 71% F1; TF-IDF feature filtering method positively impacts genre classification performance on literary texts.
|
42 |
Unsupervised Lexical Semantic Change Detection with Context-Dependent Word Representations / Oövervakad inlärning av lexikalsemantisk förändring genom kontextberoende ordrepresentationerYou, Huiling January 2021 (has links)
In this work, we explore the usefulness of contextualized embeddings from language models on lexical semantic change (LSC) detection. With diachronic corpora spanning two time periods, we construct word embeddings for a selected set of target words, aiming at detecting potential LSC of each target word across time. We explore different systems of embeddings to cover three topics: contextualized vs static word embeddings, token- vs type-based embeddings, and multilingual vs monolingual language models. We use a multilingual dataset covering three languages (English, German, Swedish) and explore each system of embedding with two subtasks, a binary classification task and a ranking task. We compare the performance of different systems of embeddings, and seek to answer our research questions through discussion and analysis of experimental results. We show that contextualized word embeddings are on par with static word embeddings in the classification task. Our results also show that it is more beneficial to use the contextualized embeddings from a multilingual model than from a language specific model in most cases. We present that token-based setting is strong for static embeddings, and type-based setting for contextual embeddings, especially for the ranking task. We provide some explanation for the results we achieve, and propose improvements that can be made to our experiments for future work.
|
43 |
Evaluation Of Methods For AutomaticallyDeciding Article Type For NewspapersEriksson, Adam January 2021 (has links)
No description available.
|
44 |
A Lexical Comparison Using Word Embedding Mapping from an Academic Word Usage PerspectiveCai, Xuemei January 2020 (has links)
This thesis applies the word embedding mapping approach to make a lexical comparison from academic word usage perspective. We aim to demonstrate the differences in academic word usage between a corpus of student writings and a corpus of academic English, as well as a corpus of student writings and social media texts. The Vecmap mapping algorithm, commonly used in solving cross-language mapping problems, was used to map academic English vector space and social media text vector space into the common student writing vector space to facilitate the comparison of word representations from different corpora and to visualize the comparison results. The average distance was defined as a measure of word usage differences of 420 typical academic words between each two corpora, and principal component analysis was applied to visualize the differences. A rank-biased overlap approach was adopted to evaluate the results of the proposed approach. The experimental results show that the usage of academic words of student writings corpus is more similar to the academic English corpus than to the social media text corpus.
|
45 |
Using Attention-based Sequence-to-Sequence Neural Networks for Transcription of Historical Cipher DocumentsRenfei, Han January 2020 (has links)
Encrypted historical manuscripts (also called ciphers), containing encoded information, provides a useful resource for giving new insight into our history. Transcribing these manuscripts from image format to computer readable format is a necessary step for decrypting them. In this thesis project, we explore automatic approaches of Hand Written Text Recognition (HTR) for cipher image transcription line by line.In this thesis project, We applied an attention-based Sequence-to-Sequence (Seq2Seq) model for the automatic transcription of ciphers with three different writing systems. We tested/developed algorithms for the recognition of cipher symbols, and their location. To evaluate our method on different levels, the model is trained and tested on ciphers with various symbol sets, from digits to graphical signs. To find out the useful approaches for improving the transcription performance, we conducted ablation study regarding attention mechanism and other deep learning tricks. The results show an accuracy lower than 50% and indicate a big room for improvements and plenty of future work.
|
46 |
Named Entity Recognition for Social Media TextZhang, Yaxi January 2019 (has links)
This thesis aims to perform named entity recognition for English social media texts. Named Entity Recognition (NER) is applied in many NLP tasks as an important preprocessing procedure. Social media texts contain lots of real-time data and therefore serve as a valuable source for information extraction. Nevertheless, NER for social media texts is a rather challenging task due to the noisy context. Traditional approaches to deal with this task use hand-crafted features but prove to be both time-consuming and very task-specific. As a result, they fail to deliver satisfactory performance. The goal of this thesis is to tackle this task by automatically identifying and annotating the named entities with multiple types with the help of neural network methods. In this thesis, we experiment with three different word embeddings and character embedding neural network architectures that combine long short- term memory (LSTM), bidirectional LSTM (BI-LSTM) and conditional random field (CRF) to get the best result. The data and evaluation tool comes from the previous shared tasks on Noisy User-generated Text (W- NUT) in 2017. We achieve the best F1 score 42.44 using BI-LSTM-CRF with character-level representation extracted by a BI-LSTM, and pre-trained word embeddings trained by GloVe. We also find out that the results could be improved with larger training data sets.
|
47 |
How negation influences word order in languages : Automatic classification of word order preference in positive and negative transitive clausesLyu, Chen January 2023 (has links)
In this work, we explore the possibility of using word alignment in parallel corpus to project language annotations such as Part-of-Speech tags and dependency relation from high-resource languages to low-resource languages. We use a parallel corpus of Bible translations, including 1,444 translations in 986 languages, and a well-developed parser is used to annotate source languages (English, French, German, and Czech). The annotations are projected to low-resource languages based on the word alignment results. Then we design and refine the process of detecting verbs and the subjects/objects linked to this verb, find and count the word orders. We used data from The World Atlas of Language Structures (WALS) to check if our program gives satisfactory results, including some Central African languages with different word orders in positive and negative clauses. And our method gives acceptable results. We explain our results and propose some languages with different word orders in positive and negative clauses. After looking up grammar books, we ensure one language out of three has this feature. Also, some possible ways to improve the performance of this method are described.
|
48 |
Expressiveness in virtual talking facesSvanfeldt, Gunilla January 2006 (has links)
In this thesis, different aspects concerning how to make synthetic talking faces more expressive have been studied. How can we collect data for the studies, how is the lip articulation affected by expressive speech, can the recorded data be used interchangeably in different face models, can we use eye movements in the agent for communicative purposes? The work of this thesis includes studies of these questions and also an experiment using a talking head as a complement to a targeted audio device, in order to increase the intelligibility of the speech. The data collection described in the first paper resulted in two multimodal speech corpora. In the following analysis of the recorded data it could be stated that expressive modes strongly affect the speech articulation, although further studies are needed in order to acquire more quantitative results and to cover more phonemes and expressions as well as to be able to generalise the results to more than one individual. When switching the files containing facial animation parameters (FAPs) between different face models (as well as research sites), some problematic issues were encountered despite the fact that both face models were created according to the MPEG-4 standard. The evaluation test of the implemented emotional expressions showed that best recognition results were obtained when the face model and FAP-file originated from the same site. The perception experiment where a synthetic talking head was combined with a targeted audio, parametric loudspeaker showed that the virtual face augmented the intelligibility of speech, especially when the sound beam was directed slightly to the side of the listener i. e. at lower sound intesities. In the experiment with eye gaze in a virtual talking head, the possibility of achieving mutual gaze with the observer was assessed. The results indicated that it is possible, but also pointed at some design features in the face model that need to be altered in order to achieve a better control of the perceived gaze direction. / QC 20101126
|
49 |
Ensuring Brand Safety by Using Contextual Text Features: A Study of Text Classification with BERTSong, Lingqing January 2023 (has links)
When advertisements are placed on web pages, the context in which the advertisements are presented is important. For example, manufacturers of kitchen knives may not want their advertisement to appear in a news article about a knife-wielding murderer. The purpose of the current work is to explore the ability of pre-trained language models on text classification tasks for determining whether the content of a given article is brand-safe, that is, suitable for brand advertising. A Norwegian-language news dataset containing 3600 news items was manually labelled with negative topics. Five pre-trained BERT language models were tested, including one multilingual BERT and four language models pre-trained specifically on Norwegian. Different training settings and fine-tuning methods were also tested for two best-performing models. It was found that more structurally complex language models and language models trained on corpora that were large or had larger vocabularies performed better on the text classification task during testing. However, the performance of smaller models is also acceptable if there is a trade-off between the better performance and the time and processing power required. As far as training and fine-tuning settings are concerned, this work found that for news texts, the initial part of the articles, which often contain the most information, is the optimal choice of parts as input to the model BERT. Another achievement and contribution of this work was the manual tagging of a Norwegian news dataset on negative topics. This thesis also points to some possible directions for future work, such as experimenting with different label granularity, experimenting with multilingual controlled training, and training with few samples.
|
50 |
Exploring Patient Classification Based on Medical Records : The case of implant bearing patientsDanielsson, Benjamin January 2022 (has links)
In this thesis, the application of transformer-based models on the real-world task of identifying patients as implant bearing is investigated. The task is approached as a classification task and five transformer-based models relying on the BERT architecture are implemented, along with a Support Vector Machine (SVM) as a baseline for comparison. The models are fine-tuned with Swedish medical texts, i.e. patients’ medical histories. The five transformer-based models in question makes use of two pre-trained BERT models, one released by the National Library of Sweden and a second one using the same pre-trained model but which has also been further pre-trained on domain specific language. These are in turn fine-tuned using five different types of architectures. These are: (1) a typical BERT model, (2) GAN-BERT, (3) RoBERT, (4) chunkBERT, (5) a frequency based optimized BERT. The final classifier, an SVM baseline, is trained using TF-IDF as the feature space. The data used in the thesis comes from a subset of an unreleased corpus from four Swedish clinics that cover a span of five years. The subset contains electronic medical records of patients belonging to the radiology, and cardiology clinics. Four training sets were created, respectively containing 100, 200, 300, and 903 labelled records. The test set, containing 300 labelled samples, was also created from said subset. The labels upon which the models are trained are created by labelling the patients as implant bearing based on the amount of implant terms each patient history contain. The results are promising, and show favourable performance when classifying the patient histories. Models trained on 903 and 300 samples are able to outperform the baseline, and at their peak, BERT, chunkBERT and the frequency based optimized BERT achieves an F1-measure of 0.97. When trained using 100 and 200 labelled records all of the transformerbased models are outperformed by the baseline, except for the semi-supervised GAN-BERT which is able to achieve competitive scores with 200 records. There is not a clear delineation between using the pre-trained BERT or the BERT model that has additional pre-training on domain specific language. However, it is believed that further research could shed additional light on the subject since the results are inconclusive. / Patient-Safe Magnetic Resonance Imaging Examination by AI-based Medical Screening
|
Page generated in 0.0759 seconds