Global ETD Search

321	Duplicate Detection and Text Classification on Simplified Technical English / Dublettdetektion och textklassificering på Förenklad Teknisk Engelska Lund, Max January 2019 (has links) This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection. NLP CNL transformer models LSTM BERT document embeddings word embeddings text classification text clustering transfer learning machine learning Computer Sciences Datavetenskap (datalogi)
322	Extractive Multi-document Summarization of News Articles Grant, Harald January 2019 (has links) Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embeddings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation implemented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summarization. Where the cohesive summary is preferred in the majority of cases. NLP extractive summarization multi-document neural embeddings information extraction text-to-text generation textrank BERT attention Transformer transfer learning fine-tuning ROUGE
323	Cluster Analysis with Meaning : Detecting Texts that Convey the Same Message / Klusteranalys med mening : Detektering av texter som uttrycker samma sak Öhrström, Fredrik January 2018 (has links) Textual duplicates can be hard to detect as they differ in words but have similar semantic meaning. At Etteplan, a technical documentation company, they have many writers that accidentally re-write existing instructions explaining procedures. These "duplicates" clutter the database. This is not desired because it is duplicate work. The condition of the database will only deteriorate as the company expands. This thesis attempts to map where the problem is worst, and also how to calculate how many duplicates there are. The corpus is small, but written in a controlled natural language called Simplified Technical English. The method uses document embeddings from doc2vec and clustering by use of HDBSCAN* and validation using Density-Based Clustering Validation index (DBCV), to chart the problems. A survey was sent out to try to determine a threshold value of when documents stop being duplicates, and then using this value, a theoretical duplicate count was calculated. nlp text mining clustering semantic meaning text clustering semantic duplicates simplified technical english duplicate detection dbcv doc2vec etteplan Computer Sciences Datavetenskap (datalogi)
324	Modelling Phone-Level Pronunciation in Discourse Context Jande, Per-Anders January 2006 (has links) Analytic knowledge about the systematic variation in a language has an important place in the description of the language. Such knowledge is interesting e.g. in the language teaching domain, as a background for various types of linguistic studies, and in the development of more dynamic speech technology applications. In previous studies, the effects of single variables or relatively small groups of related variables on the pronunciation of words have been studied separately. The work described in this thesis takes a holistic perspective on pronunciation variation and focuses on a method for creating general descriptions of phone-level pronunciation in discourse context. The discourse context is defined by a large set of linguistic attributes ranging from high-level variables such as speaking style, down to the articulatory feature level. Models of phone-level pronunciation in the context of a discourse have been created for the central standard Swedish language variety. The models are represented in the form of decision trees, which are readable for both machines and humans. A data-driven approach was taken for the pronunciation modelling task, and the work involved the annotation of recorded speech with linguistic and related information. The decision tree models were induced from the annotation. An important part of the work on pronunciation modelling was also the development of a pronunciation lexicon for Swedish. In a cross-validation experiment, several sets of pronunciation models were created with access to different parts of the attributes in the annotation. The prediction accuracy of pronunciation models could be improved by 42.2% by making information from layers above the phoneme level accessible during model training. Optimal models were obtained when attributes from all layers of annotation were used. The goal for the models was to produce pronunciation representations representative for the language variety and not necessarily for the individual speakers, on whose speech the models were trained. In the cross-validation experiment, model-produced phone strings were compared to key phonetic transcripts of actual speech, and the phone error rate was defined as the share of discrepancies between the respective phone strings. Thus, the phone error rate is the sum of actual errors and discrepancies resulting from desired adaptations from a speaker-specific pronunciation to a pronunciation reflecting general traits of the language variety. The optimal models gave an average phone error rate of 8.2%. / QC 20100901 Pronunciation modelling Pronunciation variation Discourse-context Phone-level variation Central standard Swedish Spoken language annotation Data-driven methods Machine learning Decision trees Pronunciation lexicon development Machine-readable lexicon Phonology Discourse Lexicon Language technology Språkteknologi
325	Linguistic interference in translated academic texts: : A case study of Portuguese interference in abstracts translated into English Galvao, Gabriela January 2009 (has links) <p>AbstractThis study deals with linguistic interference in abstracts of scientific papers translated fromPortuguese into English collected from the online scientific database SciELO. The aim of thisstudy is to analyze linguistic interference phenomena in 50 abstracts from the field ofhumanities, history, social sciences, technology and natural sciences. The types ofinterference discussed are syntactic/grammatical, lexical/semantic and pragmatic interference.This study is mainly qualitative. Therefore, the qualitative method was used, in order to findout what kinds of interference phenomena occur in the abstracts, analyze the possible reasonsfor their occurrence and present some suggestions to avoid the problems discussed. Besides, aquantitative analysis was carried out to interpret the results (figures and percentages) of thestudy. The analysis is aimed at providing some guidance for future translations. This studyconcluded that translations from a Romance language (in this case Portuguese) into aGermanic language (English) tend to be more objective and/or sometimes lose originalmeanings attributed in the source text. Another important finding was that abstracts from thehumanities, history and social sciences present more cases of interference phenomena than theones belonging to technology and natural sciences. These findings imply that many abstractswithin these areas have high probability to be subject to the phenomena discussed and,consequently, have parts of their original meaning lost or misinterpreted in the target texts.Keywords: abstracts, bilingualism, cross-linguistic influence, linguistic interference, linguistictransfer, non-native speakers of English, Portuguese-English interference, source text, targettext, translation.</p> / Study on linguistic interference lingvistik linguistics engelska portugisiska english portuguese interference linguistic translation översättning tradução português inglês Languages and linguistics Språkvetenskap Portugese language Portugisiska English language Engelska språket Linguistics Lingvistik Linguistic subjects Lingvistikämnen Language technology Språkteknologi
326	Translation as Linear Transduction : Models and Algorithms for Efficient Learning in Statistical Machine Translation Saers, Markus January 2011 (has links) Automatic translation has seen tremendous progress in recent years, mainly thanks to statistical methods applied to large parallel corpora. Transductions represent a principled approach to modeling translation, but existing transduction classes are either not expressive enough to capture structural regularities between natural languages or too complex to support efficient statistical induction on a large scale. A common approach is to severely prune search over a relatively unrestricted space of transduction grammars. These restrictions are often applied at different stages in a pipeline, with the obvious drawback of committing to irrevocable decisions that should not have been made. In this thesis we will instead restrict the space of transduction grammars to a space that is less expressive, but can be efficiently searched. First, the class of linear transductions is defined and characterized. They are generated by linear transduction grammars, which represent the natural bilingual case of linear grammars, as well as the natural linear case of inversion transduction grammars (and higher order syntax-directed transduction grammars). They are recognized by zipper finite-state transducers, which are equivalent to finite-state automata with four tapes. By allowing this extra dimensionality, linear transductions can represent alignments that finite-state transductions cannot, and by keeping the mechanism free of auxiliary storage, they become much more efficient than inversion transductions. Secondly, we present an algorithm for parsing with linear transduction grammars that allows pruning. The pruning scheme imposes no restrictions a priori, but guides the search to potentially interesting parts of the search space in an informed and dynamic way. Being able to parse efficiently allows learning of stochastic linear transduction grammars through expectation maximization. All the above work would be for naught if linear transductions were too poor a reflection of the actual transduction between natural languages. We test this empirically by building systems based on the alignments imposed by the learned grammars. The conclusion is that stochastic linear inversion transduction grammars learned from observed data stand up well to the state of the art. linear transduction linear transduction grammar inversion transduction zipper finite-state automaton zipper finite-state transducer formal language theory formal transduction theory translation automatic translation machine translation statistical machine translation Computational linguistics Datorlingvistik Language technology Språkteknologi
327	Classifying Hate Speech using Fine-tuned Language Models Brorson, Erik January 2018 (has links) Given the explosion in the size of social media, the amount of hate speech is also growing. To efficiently combat this issue we need reliable and scalable machine learning models. Current solutions rely on crowdsourced datasets that are limited in size, or using training data from self-identified hateful communities, that lacks specificity. In this thesis we introduce a novel semi-supervised modelling strategy. It is first trained on the freely available data from the hateful communities and then fine-tuned to classify hateful tweets from crowdsourced annotated datasets. We show that our model reach state of the art performance with minimal hyper-parameter tuning. machine learning natural language processing hate speech transfer learning semi-supervised learning recurrent neural networks Probability Theory and Statistics Sannolikhetsteori och statistik
328	Relation Classification using Semantically-Enhanced Syntactic Dependency Paths : Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks Capshaw, Riley January 2018 (has links) Many approaches to solving tasks in the field of Natural Language Processing (NLP) use syntactic dependency trees (SDTs) as a feature to represent the latent nonlinear structure within sentences. Recently, work in parsing sentences to graph-based structures which encode semantic relationships between words—called semantic dependency graphs (SDGs)—has gained interest. This thesis seeks to explore the use of SDGs in place of and alongside SDTs within a relation classification system based on long short-term memory (LSTM) neural networks. Two methods for handling the information in these graphs are presented and compared between two SDG formalisms. Three new relation extraction system architectures have been created based on these methods and are compared to a recent state-of-the-art LSTM-based system, showing comparable results when semantic dependencies are used to enhance syntactic dependencies, but with significantly fewer training parameters. Natural language processing NLP computational linguistics syntactic dependency trees semantic dependency graphs relation classification relation extraction artificial intelligence machine learning deep learning neural networks long short-term memory LSTM
329	Word Representations and Machine Learning Models for Implicit Sense Classification in Shallow Discourse Parsing Callin, Jimmy January 2017 (has links) CoNLL 2015 featured a shared task on shallow discourse parsing. In 2016, the efforts continued with an increasing focus on sense classification. In the case of implicit sense classification, there was an interesting mix of traditional and modern machine learning classifiers using word representation models. In this thesis, we explore the performance of a number of these models, and investigate how they perform using a variety of word representation models. We show that there are large performance differences between word representation models for certain machine learning classifiers, while others are more robust to the choice of word representation model. We also show that with the right choice of word representation model, simple and traditional machine learning classifiers can reach competitive scores even when compared with modern neural network approaches. machine learning neural networks word representations word embeddings distributional semantic models word vectors discourse parsing shallow discourse parsing maskininlärning neurala nätverk ordrepresentationer distributionella semantiska modeller diskursparsning
330	Definition Extraction From Swedish Technical Documentation : Bridging the gap between industry and academy approaches Helmersson, Benjamin January 2016 (has links) Terminology is concerned with the creation and maintenance of concept systems, terms and definitions. Automatic term and definition extraction is used to simplify this otherwise manual and sometimes tedious process. This thesis presents an integrated approach of pattern matching and machine learning, utilising feature vectors in which each feature is a Boolean function of a regular expression. The integrated approach is compared with the two more classic approaches, showing a significant increase in recall while maintaining a comparable precision score. Less promising is the negative correlation between the performance of the integrated approach and training size. Further research is suggested. definition extraction machine learning pattern matching naive bayes regular expressions rev classifier terminology comparison definitionsextraktion maskininlärning mönstermatchning reguljära uttryck rev klassificerare terminologi jämförelse

Search results