121 |
A Lexical Comparison Using Word Embedding Mapping from an Academic Word Usage PerspectiveCai, Xuemei January 2020 (has links)
This thesis applies the word embedding mapping approach to make a lexical comparison from academic word usage perspective. We aim to demonstrate the differences in academic word usage between a corpus of student writings and a corpus of academic English, as well as a corpus of student writings and social media texts. The Vecmap mapping algorithm, commonly used in solving cross-language mapping problems, was used to map academic English vector space and social media text vector space into the common student writing vector space to facilitate the comparison of word representations from different corpora and to visualize the comparison results. The average distance was defined as a measure of word usage differences of 420 typical academic words between each two corpora, and principal component analysis was applied to visualize the differences. A rank-biased overlap approach was adopted to evaluate the results of the proposed approach. The experimental results show that the usage of academic words of student writings corpus is more similar to the academic English corpus than to the social media text corpus.
|
122 |
Using Attention-based Sequence-to-Sequence Neural Networks for Transcription of Historical Cipher DocumentsRenfei, Han January 2020 (has links)
Encrypted historical manuscripts (also called ciphers), containing encoded information, provides a useful resource for giving new insight into our history. Transcribing these manuscripts from image format to computer readable format is a necessary step for decrypting them. In this thesis project, we explore automatic approaches of Hand Written Text Recognition (HTR) for cipher image transcription line by line.In this thesis project, We applied an attention-based Sequence-to-Sequence (Seq2Seq) model for the automatic transcription of ciphers with three different writing systems. We tested/developed algorithms for the recognition of cipher symbols, and their location. To evaluate our method on different levels, the model is trained and tested on ciphers with various symbol sets, from digits to graphical signs. To find out the useful approaches for improving the transcription performance, we conducted ablation study regarding attention mechanism and other deep learning tricks. The results show an accuracy lower than 50% and indicate a big room for improvements and plenty of future work.
|
123 |
Named Entity Recognition for Social Media TextZhang, Yaxi January 2019 (has links)
This thesis aims to perform named entity recognition for English social media texts. Named Entity Recognition (NER) is applied in many NLP tasks as an important preprocessing procedure. Social media texts contain lots of real-time data and therefore serve as a valuable source for information extraction. Nevertheless, NER for social media texts is a rather challenging task due to the noisy context. Traditional approaches to deal with this task use hand-crafted features but prove to be both time-consuming and very task-specific. As a result, they fail to deliver satisfactory performance. The goal of this thesis is to tackle this task by automatically identifying and annotating the named entities with multiple types with the help of neural network methods. In this thesis, we experiment with three different word embeddings and character embedding neural network architectures that combine long short- term memory (LSTM), bidirectional LSTM (BI-LSTM) and conditional random field (CRF) to get the best result. The data and evaluation tool comes from the previous shared tasks on Noisy User-generated Text (W- NUT) in 2017. We achieve the best F1 score 42.44 using BI-LSTM-CRF with character-level representation extracted by a BI-LSTM, and pre-trained word embeddings trained by GloVe. We also find out that the results could be improved with larger training data sets.
|
124 |
Syntactic effects from lexical decision in sentences : implications for human parsingWright, Barton Day January 1982 (has links)
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Psychology, 1982. / MICROFICHE COPY AVAILABLE IN ARCHIVES AND HUMANITIES. / Bibliography: leaves 99-100. / by Barton Day Wright. / Ph.D.
|
125 |
How negation influences word order in languages : Automatic classification of word order preference in positive and negative transitive clausesLyu, Chen January 2023 (has links)
In this work, we explore the possibility of using word alignment in parallel corpus to project language annotations such as Part-of-Speech tags and dependency relation from high-resource languages to low-resource languages. We use a parallel corpus of Bible translations, including 1,444 translations in 986 languages, and a well-developed parser is used to annotate source languages (English, French, German, and Czech). The annotations are projected to low-resource languages based on the word alignment results. Then we design and refine the process of detecting verbs and the subjects/objects linked to this verb, find and count the word orders. We used data from The World Atlas of Language Structures (WALS) to check if our program gives satisfactory results, including some Central African languages with different word orders in positive and negative clauses. And our method gives acceptable results. We explain our results and propose some languages with different word orders in positive and negative clauses. After looking up grammar books, we ensure one language out of three has this feature. Also, some possible ways to improve the performance of this method are described.
|
126 |
Expressiveness in virtual talking facesSvanfeldt, Gunilla January 2006 (has links)
In this thesis, different aspects concerning how to make synthetic talking faces more expressive have been studied. How can we collect data for the studies, how is the lip articulation affected by expressive speech, can the recorded data be used interchangeably in different face models, can we use eye movements in the agent for communicative purposes? The work of this thesis includes studies of these questions and also an experiment using a talking head as a complement to a targeted audio device, in order to increase the intelligibility of the speech. The data collection described in the first paper resulted in two multimodal speech corpora. In the following analysis of the recorded data it could be stated that expressive modes strongly affect the speech articulation, although further studies are needed in order to acquire more quantitative results and to cover more phonemes and expressions as well as to be able to generalise the results to more than one individual. When switching the files containing facial animation parameters (FAPs) between different face models (as well as research sites), some problematic issues were encountered despite the fact that both face models were created according to the MPEG-4 standard. The evaluation test of the implemented emotional expressions showed that best recognition results were obtained when the face model and FAP-file originated from the same site. The perception experiment where a synthetic talking head was combined with a targeted audio, parametric loudspeaker showed that the virtual face augmented the intelligibility of speech, especially when the sound beam was directed slightly to the side of the listener i. e. at lower sound intesities. In the experiment with eye gaze in a virtual talking head, the possibility of achieving mutual gaze with the observer was assessed. The results indicated that it is possible, but also pointed at some design features in the face model that need to be altered in order to achieve a better control of the perceived gaze direction. / QC 20101126
|
127 |
What's in a letter?Schein, Aaron J 01 January 2012 (has links) (PDF)
Sentiment analysis is a burgeoning field in natural language processing used to extract and categorize opinion in evaluative documents. We look at recommendation letters, which pose unique challenges to standard sentiment analysis systems. Our dataset is eighteen letters from applications to UMass Worcester Memorial Medical Center’s residency program in Obstetrics and Gynecology. Given a small dataset, we develop a method intended for use by domain experts to systematically explore their intuitions about the topical make-up of documents on which they make critical decisions. By leveraging WordNet and the WordNet Propagation algorithm, the method allows a user to develop topic seed sets from real data and propagate them into robust lexicons for use on new data. We show how one pass through the method yields useful feedback to our beliefs about the make-up of recommendation letters. At the end, future directions are outlined which assume a fuller dataset.
|
128 |
Ensuring Brand Safety by Using Contextual Text Features: A Study of Text Classification with BERTSong, Lingqing January 2023 (has links)
When advertisements are placed on web pages, the context in which the advertisements are presented is important. For example, manufacturers of kitchen knives may not want their advertisement to appear in a news article about a knife-wielding murderer. The purpose of the current work is to explore the ability of pre-trained language models on text classification tasks for determining whether the content of a given article is brand-safe, that is, suitable for brand advertising. A Norwegian-language news dataset containing 3600 news items was manually labelled with negative topics. Five pre-trained BERT language models were tested, including one multilingual BERT and four language models pre-trained specifically on Norwegian. Different training settings and fine-tuning methods were also tested for two best-performing models. It was found that more structurally complex language models and language models trained on corpora that were large or had larger vocabularies performed better on the text classification task during testing. However, the performance of smaller models is also acceptable if there is a trade-off between the better performance and the time and processing power required. As far as training and fine-tuning settings are concerned, this work found that for news texts, the initial part of the articles, which often contain the most information, is the optimal choice of parts as input to the model BERT. Another achievement and contribution of this work was the manual tagging of a Norwegian news dataset on negative topics. This thesis also points to some possible directions for future work, such as experimenting with different label granularity, experimenting with multilingual controlled training, and training with few samples.
|
129 |
Exploring Patient Classification Based on Medical Records : The case of implant bearing patientsDanielsson, Benjamin January 2022 (has links)
In this thesis, the application of transformer-based models on the real-world task of identifying patients as implant bearing is investigated. The task is approached as a classification task and five transformer-based models relying on the BERT architecture are implemented, along with a Support Vector Machine (SVM) as a baseline for comparison. The models are fine-tuned with Swedish medical texts, i.e. patients’ medical histories. The five transformer-based models in question makes use of two pre-trained BERT models, one released by the National Library of Sweden and a second one using the same pre-trained model but which has also been further pre-trained on domain specific language. These are in turn fine-tuned using five different types of architectures. These are: (1) a typical BERT model, (2) GAN-BERT, (3) RoBERT, (4) chunkBERT, (5) a frequency based optimized BERT. The final classifier, an SVM baseline, is trained using TF-IDF as the feature space. The data used in the thesis comes from a subset of an unreleased corpus from four Swedish clinics that cover a span of five years. The subset contains electronic medical records of patients belonging to the radiology, and cardiology clinics. Four training sets were created, respectively containing 100, 200, 300, and 903 labelled records. The test set, containing 300 labelled samples, was also created from said subset. The labels upon which the models are trained are created by labelling the patients as implant bearing based on the amount of implant terms each patient history contain. The results are promising, and show favourable performance when classifying the patient histories. Models trained on 903 and 300 samples are able to outperform the baseline, and at their peak, BERT, chunkBERT and the frequency based optimized BERT achieves an F1-measure of 0.97. When trained using 100 and 200 labelled records all of the transformerbased models are outperformed by the baseline, except for the semi-supervised GAN-BERT which is able to achieve competitive scores with 200 records. There is not a clear delineation between using the pre-trained BERT or the BERT model that has additional pre-training on domain specific language. However, it is believed that further research could shed additional light on the subject since the results are inconclusive. / Patient-Safe Magnetic Resonance Imaging Examination by AI-based Medical Screening
|
130 |
Creating Knowledge Graphs using Distributional Semantic ModelsSandelius, Hugo January 2016 (has links)
This report researches a method for creating knowledge graphs, a specific way of structuring information, using distributional semantic models. Two different algorithms for selecting graph edges and two different algorithms for labelling edges are tried, and variations of those are evaluated. We perform experiments comparing our knowledge graphs with existing manually constructed knowledge graphs of high quality, with respect to graph structure and edge labels. We find that the algorithms usually produces graphs with a structure similar to that of manually constructed knowledge graphs, as long as the data set is sufficiently large and general, and that the similarity of edge labels to manually chosen edge labels vary widely depending on input.
|
Page generated in 0.0306 seconds