Global ETD Search

191	Tree Transformations in Inductive Dependency Parsing Nilsson, Jens January 2007 (has links) This licentiate thesis deals with automatic syntactic analysis, or parsing, of natural languages. A parser constructs the syntactic analysis, which it learns by looking at correctly analyzed sentences, known as training data. The general topic concerns manipulations of the training data in order to improve the parsing accuracy. Several studies using constituency-based theories for natural languages in such automatic and data-driven syntactic parsing have shown that training data, annotated according to a linguistic theory, often needs to be adapted in various ways in order to achieve an adequate, automatic analysis. A linguistically sound constituent structure is not necessarily well-suited for learning and parsing using existing data-driven methods. Modifications to the constituency-based trees in the training data, and corresponding modifications to the parser output, have successfully been applied to increase the parser accuracy. The topic of this thesis is to investigate whether similar modifications in the form of tree transformations to training data, annotated with dependency-based structures, can improve accuracy for data-driven dependency parsers. In order to do this, two types of tree transformations are in focus in this thesis. The first one concerns non-projectivity. The full potential of dependency parsing can only be realized if non-projective constructions are allowed, which pose a problem for projective dependency parsers. On the other hand, non-projective parsers tend, among other things, to be slower. In order to maintain the benefits of projective parsing, a tree transformation technique to recover non-projectivity while using a projective parser is presented here. The second type of transformation concerns linguistic phenomena that are possible but hard for a parser to learn, given a certain choice of dependency analysis. This study has concentrated on two such phenomena, coordination and verb groups, for which tree transformations are applied in order to improve parsing accuracy, in case the original structure does not coincide with a structure that is easy to learn. Empirical evaluations are performed using treebank data from various languages, and using more than one dependency parser. The results show that the benefit of these tree transformations used in preprocessing and postprocessing to a large extent is language, treebank and parser independent. Inductive Dependency Parsing Dependency Structure Tree Transformation Non-projectivity Coordination Verb Group
192	Creation of a customised character recognition application Sandgren, Frida January 2005 (has links) This master’s thesis describes the work in creating a customised optical character recognition (OCR) application; intended for use in digitisation of theses submitted to the Uppsala University in the 18th and 19th centuries. For this purpose, an open source software called Gamera has been used for recognition and classification of the characters in the documents. The software provides specific algorithms for analysis of heritage documents and is designed to be used as a tool for creating domain-specific (i.e. customised) recognition applications. By using the Gamera classifier training interface, classifier data was created which reflects the characters in the particular theses. The data can then be used in automatic recognition of ‘new’ characters, by loading it into one of Gamera’s classifiers. The output of Gamera are sets of classified glyphs (i.e. small images of characters), stored in an XML-based format. However, as OCR typically involves translation of images of text into a machine-readable format, a complementary OCR-module was needed. For this purpose, an external Gamera module for page segmentation was modified and used. In addition, a script for control of the OCR-process was created, which initiates the page segmentation on Gamera classified glyphs. The result is written to text files. Finally, in a test for recognition accuracy, one of the theses was used for creation of training data and for test of data. The result from the test show an average accuracy rate of 82% and that there is a need for a better pre-processing module which removes more noise from the images, as well as recognises different character sizes in the images before they are run by the OCR-process. Computational linguistics OCR Digitisation Character recognition Classification Heritage documents Datorlingvistik
193	Exploring the Compositionality of German Particle Verbs Rawein, Carina January 2018 (has links) In this thesis we explore the compositionality of particle verbs using distributional similarity and pre-trained word embeddings. We investigate the compositionality of 100 pairs of particle verbs with their base verbs. The ranking of our findings are compared to a ranking of human ratings on compositionality. In our distributional approach we use features such as context window size, content words, and only use particle verbs with one word sense. We then compare the distributional approach to a ranking done with pre-trained word embeddings. While none of the results are statistically significant, it is shown that word embeddings are not automatically superior to the more traditional distributional approach. particle verbs German particle verbs distributional similarity word embeddings compositionality
194	A Natural Language Interface for Querying Linked Data Akrin, Christoffer, Tham, Simon January 2020 (has links) The thesis introduces a proof of concept idea that could spark great interest from many industries. The idea consists of a remote Natural Language Interface (NLI), for querying Knowledge Bases (KBs). The system applies natural language technology tools provided by the Stanford CoreNLP, and queries KBs with the use of the query language SPARQL. Natural Language Processing (NLP) is used to analyze the semantics of a question written in natural language, and generates relational information about the question. With correctly defined relations, the question can be queried on KBs containing relevant Linked Data. The Linked Data follows the Resource Description Framework (RDF) model by expressing relations in the form of semantic triples: subject-predicate-object. With our NLI, any KB can be understood semantically. By providing correct training data, the AI can learn to understand the semantics of the RDF data stored in the KB. The ability to understand the RDF data allows for the process of extracting relational information from questions about the KB. With the relational information, questions can be translated to SPARQL and be queried on the KB. SPARQL NLP RDF Semantic Web Knowledge Base Knowledge Graph Computer Sciences Datavetenskap (datalogi)
195	Parafrasidentifiering med maskinklassificerad data : utvärdering av olika metoder / Paraphrase identification with computer classified paraphrases : An evaluation of different methods Johansson, Oskar January 2020 (has links) Detta arbete undersöker hur språkmodellen BERT och en MaLSTM-arkitektur fungerar att för att identifiera parafraser ur 'Microsoft Paraphrase Research Corpus' (MPRC) om dessa tränats på automatiskt identifierade parafraser ur 'Paraphrase Database' (PPDB). Metoderna ställs mot varandra för att undersöka vilken som presterar bäst och metoden att träna på maskinklassificerad data för att användas på mänskligt klassificerad data utvärderas i förhållande till annan klassificering av samma dataset. Meningsparen som används för att träna modellerna hämtas från de högst rankade parafraserna ur PPDB och genom en genereringsmetod som skapar icke-parafraser ur samma dataset. I resultatet visar sig BERT vara kapabel till att identifiera en del parafraser ur MPRC, medan MaLSTM-arkitekturen inte klarade av detta trots förmåga att särskilja på parafraser och icke-parafraser under träning. Både BERT och MaLSTM presterade sämre på att identifiera parafraser ur MPRC än modeller som till exempel StructBERT, som tränat och utvärderats på samma dataset, presterar. Anledningar till att MaLSTM inte klarar av uppgiften diskuteras och främst lyfts att meningarna från icke-parafraserna ur träningsdatan är för olika varandra i förhållande till hur de ser ut i MPRC. Slutligen diskuteras vikten av att forska vidare på hur man kan använda sig av maskinframtagna parafraser inom parafraseringsrelaterad forskning. NLP text classification BERT
196	Improving BERTScore for Machine Translation Evaluation Through Contrastive Learning Yousuf, Oreen January 2022 (has links) Since the advent of automatic evaluation, tasks within Natural Language Processing (NLP), including Machine Translation, have been able to better utilize both time and labor resources. Later, multilingual pre-trained models (MLMs)have uplifted many languages’ capacity to participate in NLP research. Contextualized representations generated from these MLMs are both influential towards several downstream tasks and have inspired practitioners to better make sense of them. We propose the adoption of BERTScore, coupled with contrastive learning, for machine translation evaluation in lieu of BLEU - the industry leading metric. While BERTScore computes a similarity score for each token in a candidate and reference sentence, it does away with exact matches in favor of computing token similarity using contextual embeddings. We improve BERTScore via contrastive learning-based fine-tuning on MLMs. We use contrastive learning to improve BERTScore across different language pairs in both high and low resource settings (English-Hausa, English-Chinese), across three models (XLM-R, mBERT, and LaBSE) and across three domains (news,religious, combined). We also investigated both the effects of pairing relatively linguistically similar low-resource languages (Somali-Hausa), and data size on BERTScore and the corresponding Pearson correlation to human judgments. We found that reducing the distance between cross-lingual embeddings via contrastive learning leads to BERTScore having a substantially greater correlation to system-level human evaluation than BLEU for mBERT and LaBSE in all language pairs in multiple domains. machine translation evaluation BERTScore contrastive learning SimCSE Hausa Somali Chinese
197	Object Classification using Language Models From, Gustav January 2022 (has links) In today’s modern digital world more and more emails and messengers must be sent, processed and handled. The categorizing and classification of these text pieces can take an incredibly long time and will cost the company a lot of time and money. If the classification could be done automatically by a computer dependent on the content of the text/message it would result in a major yield for the Easit AB and its customers. In order to facilitate the task of text-classification Easit needs a solution that is made out of one language model and one classifier model. The language model will convert raw text to a vector that is representative of the text and the classifier will construe what predefined labels fit for the vector. The end goal is not to create the best solution. It is simply to create a general understanding about different language and classifier models and how to build a system that will be both fast and accurate. BERT were the primary language model during evaluation but doc2Vec and One-Hot encoding was also tested. The classifier consisted out of boundary condition models or dense neural networks that were all trained without knowledge about what language model that the text vectors came from. The validation accuracy which was presented for the IMDB-comment dataset with BERT resulted between 75% to 94%, mostly dependent on the language model and not on the classifier. The knowledge from the work resulted in a recommendation to Easit for an alternativebased system solution. / I dagens moderna digitala värld är det allt mer majl-ärenden och meddelanden som ska skickas och processeras. Kategorisering och klassificering av dessa kan ta otroligt lång tid och kostar företag tid samt pengar. Om klassifieringen kunde ske automatiskt beroende på text-innehållet skulle det innebära en stor vinst för Easit AB och deras kunder. För att underlätta arbetet med text-klassifiering behöver Easit en tvådelad lösning som består utav en språkmodell och en klassifierare. Språkmodellen som omvandlar text till en vektor som representerar texten och klassifieraren tolkar vilka fördefinerade ettiketter/märken som passar för vektorn. Målet är inte att skapa den bästa lösningen utan det är att skapa en generell kunskap för hur man kan utforma ett system som kan klassifiera texten på ett träffsäkert och effektivt sätt. Vid utvärdering av olika språkmodeller användes framförallt BERT-modeller men även doc2Vec och One-Hot testas också. Klassifieraren bestod utav gränsvillkors-modeller eller dense neurala nätverk som tränades helt utan vetskap om vilken språkmodell som skickat text-vektorerna. Träffsäkerheten som uppvisades vid validering för IMDB-kommentars datasetet med BERT blev mellan 75% till 94%, primärt beroende på språkmodellen. De neuralt nätverk passar bäst som klassifierare mest på grund av deras skalbarhet med flera ettiketter. Kunskapen från arbetet resulterade i en rekommendation till Easit om en alternativbaserad systemlösning. Classifier BERT machine learning ML language model IMDB word2Vec doc2Vec NLP
198	Algoritm för automatiserad generering av metadata / Algorithm for Automated Generation of Metadata Karlsson, Fredrik, Berg, Fredrik January 2015 (has links) Sveriges Radio sparar sin data i stora arkiv vilket gör det svårt att hitta specifik information. På grund av denna storlek blir uppgiften att hitta specifik information om händelser ett stort problem. För att lösa problemet krävs en mer konsekvent användning av metadata, därför har en undersökning om metadata och nyckelordsgenerering gjorts.Arbetet gick ut på att utveckla en algoritm som automatisk kan generera nyckelord från transkriberade radioprogram. Det ingick också i arbetet att göra en undersökning av tidigare arbeten för att se vilka system och algoritmer som kan användas för att generera nyckelord. Dessutom utvecklades en applikation som generar färdiga nyckelord som förslag till en användare. Denna applikation jämfördes och utvärderades med redan existerande program. Metoderna som använts bygger på både lingvistiska och statistiska algoritmer. En analys av resultaten gjordes och visade att den utvecklade applikationen genererade många precisa nyckelord, men även till antalet stora mängder nyckelord. Jämförelsen med ett redan existe-rande program visade att täckningen var bättre för den utvecklade applikationen, samtidigt som precisionen var bättre för det redan existerande programmet. / Sveriges Radio stores their data in large archives which makes it hard to retrieve specific information. The sheer size of the archives makes retrieving information about a specific event difficult and causes a big problem. To solve this problem a more consistent use of metadata is needed. This resulted in an investigation about metadata and keyword genera-tion.The appointed task was to automatically generate keywords from transcribed radio shows. This included an investigation of which systems and algorithms that can be used to generate keywords, based on previous works. An application was also developed which suggests keywords based on a text to a user. This application was tested and compared to other al-ready existing software, as well as different methods/techniques based on both linguistic and statistic algorithms. The resulting analysis displayed that the developed application generated many accurate keywords, but also a large amount of keywords in general. The comparison also showed that the recall for the developed algorithm got better results than the already existing software, which in turn produced a better precision in their keywords. Metadata keywords Natural Language Processing algorithms. Metadata nyckelord textutvinning naturliga språk algoritmer.
199	Metod för automatiserad sammanfattning och nyckelordsgenerering / Method for automated summary and keyword generator Björkvall, Dennis, Ploug, Martin January 2016 (has links) Företaget Widespace hanterar hundratals ärenden i veckan vilket kräver stor överblick för varje an-ställd att sätta sig in i varje enskilt ärende. På grund av denna kvantitet blir uppgiften att skapa över-blicken ett stort problem. För att lösa detta problem krävs en mer konsekvent användning av meta-data och därför har en litteraturstudie om metadata, automatiserad sammanfattning och nyckelords-generering utförts. Arbetet gick ut på att utveckla en prototyp som automatisk kan generera en sammanfattning av texten från ett ärende, samt generera en lista av nyckelord och ge en indikation om vilket språk texten är skriven i. Det ingick också i arbetet att göra en undersökning av tidigare arbeten för att se vilka system och metoder som kan användas för att lösa denna uppgift. Två egenutvecklade prototyper, MkOne och MkTwo, jämfördes med varandra och utvärderades därefter. Metoderna som använts bygger på både statistiska och lingvistiska processer. En analys av resultaten gjordes och visade att prototypen MkOne levererade bäst resultat för sammanfattningen och att nyckelordlistan tillhandahöll nyckelord av hög precision och en bred täckning. / The company Widespace handles hundreds of tasks (tickets) per week, which requires great overview by each employee. Because of this quantity, creating a clear view becomes a major problem. To solve this problem, a more consistent use of metadata is required, therefore, a study of metadata, automated summary and key words generation has been performed. The task was to develop a prototype that can automatically generate a summary, a list of keywords and give an indication of what language the text is written in. It was also included in the work to make a survey of earlier works to see which systems and methods that can be used for this task. Two prototypes were developed, compared with each other and evaluated. The methods used were based on both statistical and linguistic processes. Analysis of the results was done and showed that the prototype MkOne delivered the best results for the summary. The keyword list contained many precise keywords with high precision and a wide coverage. metadata summary keywords auto generator text analyzing metadata sammanfattning nyckelord autogenerering textanalysering
200	Methods for increasing cohesion in automatically extracted summaries of Swedish news articles : Using and extending multilingual sentence transformers in the data-processing stage of training BERT models for extractive text summarization / Metoder för att öka kohesionen i automatiskt extraherade sammanfattningar av svenska nyhetsartiklar Andersson, Elsa January 2022 (has links) Developments in deep learning and machine learning overall has created a plethora of opportunities for easier training of automatic text summarization (ATS) models for producing summaries with higher quality. ATS can be split into extractive and abstractive tasks; extractive models extract sentences from the original text to create summaries. On the contrary, abstractive models generate novel sentences to create summaries. While extractive summaries are often preferred over abstractive ones, summaries created by extractive models trained on Swedish texts often lack cohesion, which affects the readability and overall quality of the summary. Therefore, there is a need to improve the process of training ATS models in terms of cohesion, while maintaining other text qualities such as content coverage. This thesis explores and implements methods at the data-processing stage aimed at improving cohesion of generated summaries. The methods are based around Sentence-BERT for creating advanced sentence embeddings that can be used to rank sentences in a text in terms of if it should be included in the extractive summary or not. Three models are trained using different methods and evaluated using ROUGE, BERTScore for measuring content coverage and Coh-Metrix for measuring cohesion. The results of the evaluation suggest that the methods can indeed be used to create more cohesive summaries, although content coverage was reduced, which gives rise to the potential for extensive future exploration of further implementation. NLP automatic text summarization extractive summarization BERT sentence embeddings cohesion

Search results