Global ETD Search

621	Keyword Extraction from Swedish Court Documents / Extraktion av nyckelord från svenska rättsdokument Grosz, Sandra January 2020 (has links) This thesis addresses the problem of extracting keywords which represent the rulings and and grounds for the rulings in Swedish court documents. The problem of identifying the candidate keywords was divided into two steps; first preprocessing the documents and second extracting keywords using a keyword extraction algorithm on the preprocessed documents. The preprocessing methods used in conjunction with the keywords extraction algorithms were that of using stop words and a stemmer. Then, three different approaches for extracting keywords were used; one statistic approach, one machine learning approach and lastly one graph-based approach. The three different approaches used to extract keywords were then evaluated to measure the quality of the keywords and the rejection rate of keywords which were not of a high enough quality. Out of the three approaches implemented and evaluated the results indicated that the graph-based approach showed the most promise. However, the results also showed that neither of the three approaches had a high enough accuracy to be used without human supervision. / Detta examensarbete behandlar problemet om att extrahera nyckelord som representerar domslut och domskäl ur svenska rättsdokument. Problemet med att identifiera möjliga nyckelord delades upp i två steg; det första steget är att använda förbehandlingsmetoder och det andra steget att extrahera nyckelord genom att använda en algoritm för nyckelordsextraktion. Förbehandlingsmetoderna som användes tillsammans med nyckelordsextraktionsalgoritmerna var stoppord samt avstammare. Sedan användes tre olika metoder för att extrahera nyckelord; en statistisk, en maskininlärningsbaserad och slutligen en grafbaserad. De tre metoderna för att extrahera nyckelord blev sedan evaluerade för att kunna mäta kvaliteten på nyckelorden samt i vilken grad nyckelord som inte var av tillräckligt hög kvalitet förkastades. Av de tre implementerade och evaluerade tillvägagångssätten visade den grafbaserade metoden mest lovande resultat. Däremot visade resultaten även att ingen av de tre metoderna hade en tillräckligt hög riktighet för att kunna användas utan mänsklig övervakning. Keywords extraction Information Retrieval Natural Language Processing nyckelordsextraktion informationssökning naturligt språkbehandling. Computer and Information Sciences Data- och informationsvetenskap
622	Exploring the Potential of Twitter Data and Natural Language Processing Techniques to Understand the Usage of Parks in Stockholm / Utforska potentialen för användning av Natural Language Processing på Twitter data för att förstå användningen av parker i Stockholm Norsten, Theodor January 2020 (has links) Traditional methods used to investigate the usage of parks consists of questionnaire which is both a very time- and- resource consuming method. Today more than four billion people daily use some form of social media platform. This has led to the creation of huge amount of data being generated every day through various social media platforms and has created a potential new source for retrieving large amounts of data. This report will investigate a modern approach, using Natural Language Processing on Twitter data to understand how parks in Stockholm being used. Natural Language Processing (NLP) is an area within artificial intelligence and is referred to the process to read, analyze, and understand large amount of text data and is considered to be the future for understanding unstructured text. Twitter data were obtained through Twitters open API. Data from three parks in Stockholm were collected between the periods 2015-2019. Three analysis were then performed, temporal, sentiment, and topic modeling analysis. The results from the above analysis show that it is possible to understand what attitudes and activities are associated with visiting parks using NLP on social media data. It is clear that sentiment analysis is a difficult task for computers to solve and it is still in an early stage of development. The results from the sentiment analysis indicate some uncertainties. To achieve more reliable results, the analysis would consist of much more data, more thorough cleaning methods and be based on English tweets. One significant conclusion given the results is that people’s attitudes and activities linked to each park are clearly correlated with the different attributes each park consists of. Another clear pattern is that the usage of parks significantly peaks during holiday celebrations and positive sentiments are the most strongly linked emotion with park visits. Findings suggest future studies to focus on combining the approach in this report with geospatial data based on a social media platform were users share their geolocation to a greater extent. / Traditionella metoder använda för att förstå hur människor använder parker består av frågeformulär, en mycket tids -och- resurskrävande metod. Idag använder mer en fyra miljarder människor någon form av social medieplattform dagligen. Det har inneburit att enorma datamängder genereras dagligen via olika sociala media plattformar och har skapat potential för en ny källa att erhålla stora mängder data. Denna undersöker ett modernt tillvägagångssätt, genom användandet av Natural Language Processing av Twitter data för att förstå hur parker i Stockholm används. Natural Language Processing (NLP) är ett område inom artificiell intelligens och syftar till processen att läsa, analysera och förstå stora mängder textdata och anses vara framtiden för att förstå ostrukturerad text. Data från Twitter inhämtades via Twitters öppna API. Data från tre parker i Stockholm erhölls mellan perioden 2015–2019. Tre analyser genomfördes därefter, temporal, sentiment och topic modeling. Resultaten från ovanstående analyser visar att det är möjligt att förstå vilka attityder och aktiviteter som är associerade med att besöka parker genom användandet av NLP baserat på data från sociala medier. Det är tydligt att sentiment analys är ett svårt problem för datorer att lösa och är fortfarande i ett tidigt skede i utvecklingen. Resultaten från sentiment analysen indikerar några osäkerheter. För att uppnå mer tillförlitliga resultat skulle analysen bestått av mycket mer data, mer exakta metoder för data rensning samt baserats på tweets skrivna på engelska. En tydlig slutsats från resultaten är att människors attityder och aktiviteter kopplade till varje park är tydligt korrelerat med de olika attributen respektive park består av. Ytterligare ett tydligt mönster är att användandet av parker är som högst under högtider och att positiva känslor är starkast kopplat till park-besök. Resultaten föreslår att framtida studier fokuserar på att kombinera metoden i denna rapport med geospatial data baserat på en social medieplattform där användare delar sin platsinfo i större utsträckning. Natural Language Processing Sentiment analysis Topic modeling Twitter VADER LDA Other Social Sciences Annan samhällsvetenskap
623	Using semantic folding with TextRank for automatic summarization / Semantisk vikning med TextRank för automatisk sammanfattning Karlsson, Simon January 2017 (has links) This master thesis deals with automatic summarization of text and how semantic folding can be used as a similarity measure between sentences in the TextRank algorithm. The method was implemented and compared with two common similarity measures. These two similarity measures were cosine similarity of tf-idf vectors and the number of overlapping terms in two sentences. The three methods were implemented and the linguistic features used in the construction were stop words, part-of-speech filtering and stemming. Five different part-of-speech filters were used, with different mixtures of nouns, verbs, and adjectives. The three methods were evaluated by summarizing documents from the Document Understanding Conference and comparing them to gold-standard summarization created by human judges. Comparison between the system summaries and gold-standard summaries was made with the ROUGE-1 measure. The algorithm with semantic folding performed worst of the three methods, but only 0.0096 worse in F-score than cosine similarity of tf-idf vectors that performed best. For semantic folding, the average precision was 46.2% and recall 45.7% for the best-performing part-of-speech filter. / Det här examensarbetet behandlar automatisk textsammanfattning och hur semantisk vikning kan användas som likhetsmått mellan meningar i algoritmen TextRank. Metoden implementerades och jämfördes med två vanliga likhetsmått. Dessa två likhetsmått var cosinus-likhet mellan tf-idf-vektorer samt antal överlappande termer i två meningar. De tre metoderna implementerades och de lingvistiska särdragen som användes vid konstruktionen var stoppord, filtrering av ordklasser samt en avstämmare. Fem olika filter för ordklasser användes, med olika blandningar av substantiv, verb och adjektiv. De tre metoderna utvärderades genom att sammanfatta dokument från DUC och jämföra dessa mot guldsammanfattningar skapade av mänskliga domare. Jämförelse mellan systemsammanfattningar och guldsammanfattningar gjordes med måttet ROUGE-1. Algoritmen med semantisk vikning presterade sämst av de tre jämförda metoderna, dock bara 0.0096 sämre i F-score än cosinus-likhet mellan tf-idf-vektorer som presterade bäst. För semantisk vikning var den genomsnittliga precisionen 46.2% och recall 45.7% för det ordklassfiltret som presterade bäst. automatic summarization natural language processing TextRank semantic folding extractive document summarization Computer Sciences Datavetenskap (datalogi)
624	Deep Text Mining of Instagram Data Without Strong Supervision / Textutvinning från Instagram utan Precis Övervakning Hammar, Kim January 2018 (has links) With the advent of social media, our online feeds increasingly consist of short, informal, and unstructured text. This data can be analyzed for the purpose of improving user recommendations and detecting trends. The grand volume of unstructured text that is available makes the intersection of text processing and machine learning a promising avenue of research. Current methods that use machine learning for text processing are in many cases dependent on annotated training data. However, considering the heterogeneity and variability of social media, obtaining strong supervision for social media data is in practice both difficult and expensive. In light of this limitation, a belief that has put its marks on this thesis is that the study of text mining methods that can be applied without strong supervision is of a higher practical interest. This thesis investigates unsupervised methods for scalable processing of text from social media. Particularly, the thesis targets a classification and extraction task in the fashion domain on the image-sharing platform Instagram. Instagram is one of the largest social media platforms, containing both text and images. Still, research on text processing in social media is to a large extent limited to Twitter data, and little attention has been paid to text mining of Instagram data. The aim of this thesis is to broaden the scope of state-of-the-art methods for information extraction and text classification to the unsupervised setting, working with informal text on Instagram. Its main contributions are (1) an empirical study of text from Instagram; (2) an evaluation of word embeddings for Instagram text; (3) a distributed implementation of the FastText algorithm; (4) a system for fashion attribute extraction in Instagram using word embeddings; and (5) a multi-label clothing classifier for Instagram text, built with deep learning techniques and minimal supervision. The empirical study demonstrates that the text distribution on Instagram exhibits the long-tail phenomenon, that the text is just as noisy as have been reported in studies on Twitter text, and that comment sections are multi-lingual. In experiments with word embeddings for Instagram, the importance of hyperparameter tuning is manifested and a mismatch between pre-trained embeddings and social media is observed. Furthermore, that word embeddings are a useful asset for information extraction is confirmed. Experimental results show that word embeddings beats a baseline that uses Levenshtein distance on the task of extracting fashion attributes from Instagram. The results also show that the distributed implementation of FastText reduces the time it takes to train word embeddings with a factor that scales with the number of machines used for training. Finally, our research demonstrates that weak supervision can be used to train a deep classifier, achieving an F1 score of 0.61 on the task of classifying clothes in Instagram posts based only on the associated text, which is on par with human performance. / I och med uppkomsten av sociala medier så består våra online-flöden till stor del av korta och informella textmeddelanden, denna data kan analyseras med syftet att upptäcka trender och ge användarrekommendationer. Med tanke på den stora volymen av ostrukturerad text som finns tillgänglig så är kombinationen av språkteknologi och maskinlärning ett forskningsområde med stor potential. Nuvarande maskinlärningsteknologier för textbearbetning är i många fall beroende av annoterad data för träning. I praktiken så är det dock både komplicerat och dyrt att anskaffa annoterad data av hög kvalitet, inte minst vad gäller data från sociala medier, med tanke på hur pass föränderlig och heterogen sociala medier är som datakälla. En övertygelse som genomsyrar denna avhandling är att textutvinnings metoder som inte kräver precis övervakning har större potential i praktiken. Denna avhandling undersöker oövervakade metoder för skalbar bearbetning av text från sociala medier. Specifikt så täcker avhandlingen ett komplext klassifikations- och extraktions- problem inom modebranschen på bilddelningsplattformen Instagram. Instagram hör till de mest populära sociala plattformarna och innehåller både bilder och text. Trots det så är forskning inom textutvinning från sociala medier till stor del begränsad till data från Twitter och inte mycket uppmärksamhet har givits de stora möjligheterna med textutvinning från Instagram. Ändamålet med avhandlingen är att förbättra nuvarande metoder som används inom textklassificering och informationsextraktion, samt göra dem applicerbara för oövervakad maskinlärning på informell text från Instagram. De primära forskningsbidragen i denna avhandling är (1) en empirisk studie av text från Instagram; (2) en utvärdering av ord-vektorer för användning med text från Instagram; (3) en distribuerad implementation av FastText algoritmen; (4) ett system för extraktion av kläddetaljer från Instagram som använder ord-vektorer; och (5) en flerkategorisk kläd-klassificerare för text från Instagram, utvecklad med djupinlärning och minimal övervakning. Den empiriska studien visar att textdistributionen på Instagram har en lång svans, att texten är lika informell som tidigare rapporterats från studier på Twitter, samt att kommentarssektionerna är flerspråkiga. Experiment med ord-vektorer för Instagram understryker vikten av att justera parametrar före träningsprocessen, istället för att använda förbestämda värden. Dessutom visas att ord-vektorer tränade på formell text är missanpassade för applikationer som bearbetar informell text. Vidare så påvisas att ord-vektorer är effektivt för informationsextraktion i sociala medier, överlägsen ett standardvärde framtaget med informationsextraktion baserat på syntaktiskt ordlikhet. Resultaten visar även att den distribuerade implementationen av FastText kan minska tiden det tar att träna ord-vektorer med en faktor som beror på antalet maskiner som används i träningen. Slutligen, vår forskning indikerar att svag övervakning kan användas för att träna en klassificerare med djupinlärning. Den tränade klassificeraren uppnår ett F1 resultat av 0.61 på uppgiften att klassificera kläddetaljer av bilder från Instagram, baserat endast på bildtexten och tillhörande användarkommentarer, vilket är i nivå med mänsklig förmåga. Natural Language Processing Information Extraction Machine Learning Språkteknologi Informationsextraktion Maskinlärning Computer Systems Datorsystem
625	Turn of Phrase: Contrastive Pre-Training for Discourse-Aware Conversation Models Laboulaye, Roland 16 August 2021 (has links) Understanding long conversations requires recognizing a discourse flow unique to conversation. Recent advances in unsupervised representation learning of text have been attained primarily through language modeling, which models discourse only implicitly and within a small window. These representations are in turn evaluated chiefly on sentence pair or paragraph-question pair benchmarks, which measure only local discourse coherence. In order to improve performance on discourse-reliant, long conversation tasks, we propose Turn-of-Phrase pre-training, an objective designed to encode long conversation discourse flow. We leverage tree-structured Reddit conversations in English to, relative to a chosen conversation path through the tree, select paths of varying degrees of relatedness. The final utterance of the chosen path is appended to the related paths and the model learns to identify the most coherent conversation path. We demonstrate that our pre-training objective encodes conversational discourse awareness by improving performance on a dialogue act classification task. We then demonstrate the value of transferring discourse awareness with a comprehensive array of conversation-level classification tasks evaluating persuasion, conflict, and deception. conversational discourse natural language processing model pre-training machine learning Physical Sciences and Mathematics
626	Improving Eligibility Prescreening for Alzheimer’s Disease and Related Dementias Clinical Trials with Natural Language Processing Idnay, Betina Ross Saldua January 2022 (has links) Alzheimer’s disease and related dementias (ADRD) are among the leading causes of disability and mortality among the older population worldwide and a costly public health issue, yet there is still no treatment for prevention or cure. Clinical trials are available, but successful recruitment has been a longstanding challenge. One strategy to improve recruitment is conducting eligibility prescreening, a resource-intensive process where clinical research staff manually go through electronic health records to identify potentially eligible patients. Natural language processing (NLP), an informatics approach used to extract relevant data from various structured and unstructured data types, may improve eligibility prescreening for ADRD clinical trials. Guided by the Fit between Individuals, Task, and Technology framework, this dissertation research aims to optimize eligibility prescreening for ADRD clinical research by evaluating the sociotechnical factors influencing the adoption of NLP-driven tools. A systematic review of the literature was done to identify NLP systems that have been used for eligibility prescreening in clinical research. Following this, three NLP-driven tools were evaluated in ADRD clinical research eligibility prescreening: Criteria2Query, i2b2, and Leaf. We conducted an iterative mixed-methods usability evaluation with twenty clinical research staff using a cognitive walkthrough with a think-aloud protocol, Post-Study System Usability Questionnaire, and a directed deductive content analysis. Moreover, we conducted a cognitive task analysis with sixty clinical research staff to assess the impact of cognitive complexity on the usability of NLP systems and identify the sociotechnical gaps and cognitive support needed in using NLP systems for ADRD clinical research eligibility prescreening. The results show that understanding the role of NLP systems in improving eligibility prescreening is critical to the advancement of clinical research recruitment. All three systems are generally usable and accepted by a group of clinical research staff. The cognitive walkthrough and a think-aloud protocol informed iterative system refinement, resulting in high system usability. Cognitive complexity has no significant effect on system usability; however, the system, order of evaluation, job position, and computer literacy are associated with system usability. Key recommendations for system development and implementation include improving system intuitiveness and overall user experience through comprehensive consideration of user needs and task completion requirements; and implementing a focused training on database query to improve clinical research staff’s aptitude in eligibility prescreening and advance workforce competency. Finally, this study contributes to our understanding of the conduct of electronic eligibility prescreening for ADRD clinical research by clinical research staff. Findings from this study highlighted the importance of leveraging human-computer collaboration in conducting eligibility prescreening using NLP-driven tools, which provide an opportunity to identify and enroll participants of diverse backgrounds who are eligible for ADRD clinical research and accelerate treatment development. Nursing Medical screening Alzheimer's disease--Diagnosis Dementia--Diagnosis
627	Syntax-based Concept Extraction For Question Answering Glinos, Demetrios 01 January 2006 (has links) Question answering (QA) stands squarely along the path from document retrieval to text understanding. As an area of research interest, it serves as a proving ground where strategies for document processing, knowledge representation, question analysis, and answer extraction may be evaluated in real world information extraction contexts. The task is to go beyond the representation of text documents as "bags of words" or data blobs that can be scanned for keyword combinations and word collocations in the manner of internet search engines. Instead, the goal is to recognize and extract the semantic content of the text, and to organize it in a manner that supports reasoning about the concepts represented. The issue presented is how to obtain and query such a structure without either a predefined set of concepts or a predefined set of relationships among concepts. This research investigates a means for acquiring from text documents both the underlying concepts and their interrelationships. Specifically, a syntax-based formalism for representing atomic propositions that are extracted from text documents is presented, together with a method for constructing a network of concept nodes for indexing such logical forms based on the discourse entities they contain. It is shown that meaningful questions can be decomposed into Boolean combinations of question patterns using the same formalism, with free variables representing the desired answers. It is further shown that this formalism can be used for robust question answering using the concept network and WordNet synonym, hypernym, hyponym, and antonym relationships. This formalism was implemented in the Semantic Extractor (SEMEX) research tool and was tested against the factoid questions from the 2005 Text Retrieval Conference (TREC), which operated upon the AQUAINT corpus of newswire documents. After adjusting for the limitations of the tool and the document set, correct answers were found for approximately fifty percent of the questions analyzed, which compares favorably with other question answering systems. natural language processing concept extraction question answering knowledge acquisition knowledge representation concept network Computer Sciences Engineering
628	The Hermeneutics Of The Hard Drive: Using Narratology, Natural Language Processing, And Knowledge Management To Improve The Effectiveness Of The Digital Forensic Process Pollitt, Mark 01 January 2013 (has links) In order to protect the safety of our citizens and to ensure a civil society, we ask our law enforcement, judiciary and intelligence agencies, under the rule of law, to seek probative information which can be acted upon for the common good. This information may be used in court to prosecute criminals or it can be used to conduct offensive or defensive operations to protect our national security. As the citizens of the world store more and more information in digital form, and as they live an ever-greater portion of their lives online, law enforcement, the judiciary and the Intelligence Community will continue to struggle with finding, extracting and understanding the data stored on computers. But this trend affords greater opportunity for law enforcement. This dissertation describes how several disparate approaches: knowledge management, content analysis, narratology, and natural language processing, can be combined in an interdisciplinary way to positively impact the growing difficulty of developing useful, actionable intelligence from the ever-increasing corpus of digital evidence. After exploring how these techniques might apply to the digital forensic process, I will suggest two new theoretical constructs, the Hermeneutic Theory of Digital Forensics and the Narrative Theory of Digital Forensics, linking existing theories of forensic science, knowledge management, content analysis, narratology, and natural language processing together in order to identify and extract narratives from digital evidence. An experimental approach will be described and prototyped. The results of these experiments demonstrate the potential of natural language processing techniques to digital forensics. Digital forensics narrative natural language processing narratology knowledge management xml Forensic Science and Technology
629	Enhanced Content-Based Fake News Detection Methods with Context-Labeled News Sources Arnfield, Duncan 01 December 2023 (has links) (PDF) This work examined the relative effectiveness of multilayer perceptron, random forest, and multinomial naïve Bayes classifiers, trained using bag of words and term frequency-inverse dense frequency transformations of documents in the Fake News Corpus and Fake and Real News Dataset. The goal of this work was to help meet the formidable challenges posed by proliferation of fake news to society, including the erosion of public trust, disruption of social harmony, and endangerment of lives. This training included the use of context-categorized fake news in an effort to enhance the tools’ effectiveness. It was found that term frequency-inverse dense frequency provided more accurate results than bag of words across all evaluation metrics for identifying fake news instances, and that the Fake News Corpus provided much higher result metrics than the Fake and Real News Dataset. In comparison to state-of-the-art methods the models performed as expected. fake news text classification natural language processing detection model Information Security Other Computer Sciences
630	TSPOONS: Tracking Salience Profiles of Online News Stories Paterson, Kimberly Laurel 01 June 2014 (has links) (PDF) News space is a relatively nebulous term that describes the general discourse concerning events that affect the populace. Past research has focused on qualitatively analyzing news space in an attempt to answer big questions about how the populace relates to the news and how they respond to it. We want to ask when do stories begin? What stories stand out among the noise? In order to answer the big questions about news space, we need to track the course of individual stories in the news. By analyzing the specific articles that comprise stories, we can synthesize the information gained from several stories to see a more complete picture of the discourse. The individual articles, the groups of articles that become stories, and the overall themes that connect stories together all complete the narrative about what is happening in society. TSPOONS provides a framework for analyzing news stories and answering two main questions: what were the important stories during some time frame and what were the important stories involving some topic. Drawing technical news stories from Techmeme.com, TSPOONS generates profiles of each news story, quantitatively measuring the importance, or salience, of news stories as well as quantifying the impact of these stories over time. Automated Discourse Analysis Natural Language Processing Data Science Computer Science Other Computer Engineering

Search results