• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 276
  • 31
  • 25
  • 22
  • 9
  • 8
  • 5
  • 3
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 430
  • 206
  • 161
  • 156
  • 150
  • 136
  • 112
  • 102
  • 92
  • 80
  • 77
  • 73
  • 73
  • 71
  • 62
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
241

A Rule-Based Normalization System for Greek Noisy User-Generated Text

Toska, Marsida January 2020 (has links)
The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. Therefore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization.
242

Rättssäker Textanalys

Svensson, Henrik, Lindqvist, Kalle January 2019 (has links)
Digital språkbehandling (natural language processing) är ett forskningsområde inom vilketdet ständigt görs nya framsteg. En betydande del av den textanalys som sker inom dettafält har som mål att uppnå en fullgod tillämpning kring dialogen mellan människa ochdator. I denna studie vill vi dock fokusera på den inverkan digital språkbehandling kan hapå den mänskliga inlärningsprocessen. Vårt praktiska testområde har också en framtidainverkan på en av de mest grundläggande förutsättningarna för ett rättssäkert samhälle,nämligen den polisiära rapportskrivningen.Genom att skapa en teoretisk idébas som förenar viktiga aspekter av digital språk-behandling och polisrapportskrivning samt därefter implementera dem i en pedagogiskwebbplattform ämnad för polisstudenter är vi av uppfattningen att vår forskning tillförnågot nytt inom det datavetenskapliga respektive det samhällsvetenskapliga fälten.Syftet med arbetet är att verka som de första stegen mot en webbapplikation somunderstödjer svensk polisdokumentation. / Natural language processing is a research area in which new advances are constantly beingmade. A significant portion of text analyses that takes place in this field have the aim ofachieving a satisfactory application in the dialogue between human and computer. In thisstudy, we instead want to focus on what impact natural language processing can have onthe human learning process.Simultaneously, the context for our research has a future impact on one of the mostbasic principles for a legally secure society, namely the writing of the police report.By creating a theoretical foundation of ideas that combines aspects of natural languageprocessing as well as official police report writing and then implementing them in aneducational web platform intended for police students, we are of the opinion that ourresearch adds something new in the computer science and sociological fields.The purpose of this work is to act as the first steps towards a web application thatsupports the Swedish police documentation.
243

Automatic Extraction of Narrative Structure from Long Form Text

Eisenberg, Joshua Daniel 02 November 2018 (has links)
Automatic understanding of stories is a long-time goal of artificial intelligence and natural language processing research communities. Stories literally explain the human experience. Understanding our stories promotes the understanding of both individuals and groups of people; various cultures, societies, families, organizations, governments, and corporations, to name a few. People use stories to share information. Stories are told –by narrators– in linguistic bundles of words called narratives. My work has given computers awareness of narrative structure. Specifically, where are the boundaries of a narrative in a text. This is the task of determining where a narrative begins and ends, a non-trivial task, because people rarely tell one story at a time. People don’t specifically announce when we are starting or stopping our stories: We interrupt each other. We tell stories within stories. Before my work, computers had no awareness of narrative boundaries, essentially where stories begin and end. My programs can extract narrative boundaries from novels and short stories with an F1 of 0.65. Before this I worked on teaching computers to identify which paragraphs of text have story content, with an F1 of 0.75 (which is state of the art). Additionally, I have taught computers to identify the narrative point of view (POV; how the narrator identifies themselves) and diegesis (how involved in the story’s action is the narrator) with F1 of over 0.90 for both narrative characteristics. For the narrative POV, diegesis, and narrative level extractors I ran annotation studies, with high agreement, that allowed me to teach computational models to identify structural elements of narrative through supervised machine learning. My work has given computers the ability to find where stories begin and end in raw text. This allows for further, automatic analysis, like extraction of plot, intent, event causality, and event coreference. These tasks are impossible when the computer can’t distinguish between which stories are told in what spans of text. There are two key contributions in my work: 1) my identification of features that accurately extract elements of narrative structure and 2) the gold-standard data and reports generated from running annotation studies on identifying narrative structure.
244

Pre-training a knowledge enhanced model in biomedical domain for information extraction

Yan, Xi January 2022 (has links)
While recent years have seen a rise of research in knowledge graph enrichedpre-trained language models(PLM), few studies have tried to transfer the work to the biomedical domain. This thesis is a first attempt to pre-train a large-scalebiological knowledge enriched language model (KPLM). Under the frameworkof CoLAKE (T. Sun et al., 2020), a general-use KPLM in general field, this study is pre-trained on PubMed abstracts (a large scale medical text data) andBIKG (AstraZeneca’s biological knowledge graph). We firstly get abstracts from PubMed and their entity linking results. Following this is to connect the entities from abstracts to BIKG to form sub-graphs. Such sub-graphs and sentences from PubMed abstracts are then sent to model CoLAKE for pre-training. By training the model on three objectives (masking word nodes, masking entity nodes and masking relation nodes), this research aims to not only enhancing model’s capacity on modeling natural language but also infusing in-depth knowledge. Later the model is fine-tuned on name entity recognition (NER) and relation extraction tasks on three benchmark datasets (Chemprot (Kringelumet al., 2016), DrugProt (form Text mining drug-protein/gene interactions sharedtask) and DDI (Segura-Bedmar et al., 2013)). Empirical results show that the model outperform state-of-the-art models relation extraction task on DDI dataset, with F1 score of 91.2%. Also on Drugprot and chemprot, this model shows improvement over baseline - scibert model.
245

A STUDY OF TRANSFORMER MODELS FOR EMOTION CLASSIFICATION IN INFORMAL TEXT

Alvaro S Esperanca (11797112) 07 January 2022 (has links)
<div>Textual emotion classification is a task in affective AI that branches from sentiment analysis and focuses on identifying emotions expressed in a given text excerpt. </div><div>It has a wide variety of applications that improve human-computer interactions, particularly to empower computers to understand subjective human language better. </div><div>Significant research has been done on this task, but very little of that research leverages one of the most emotion-bearing symbols we have used in modern communication: Emojis.</div><div>In this thesis, we propose several transformer-based models for emotion classification that processes emojis as input tokens and leverages pretrained models and uses them</div><div>, a model that processes Emojis as textual inputs and leverages DeepMoji to generate affective feature vectors used as reference when aggregating different modalities of text encoding. </div><div>To evaluate ReferEmo, we experimented on the SemEval 2018 and GoEmotions datasets, two benchmark datasets for emotion classification, and achieved competitive performance compared to state-of-the-art models tested on these datasets. Notably, our model performs better on the underrepresented classes of each dataset.</div>
246

Password habits of Sweden

Gustafsson, Daniel January 2023 (has links)
The password is the first line of defence in most modern web services, it is therefore critical to choose a strong password. Many previous studies have found patterns to improve in global users password creation but none have researched the patterns of Swedish users in particular. In this project, passwords of Swedish users were gathered from underground forums and analyzed to find if Swedish users create passwords differently from global users and if there are any weak patterns in their passwords. We found that Swedish users often use words or names found in a Swedish NLP corpus in their passwords as well as using lowercase letters more frequently than global users. We also found that several of the most popular Swedish websites use weak password policies which might contribute to Swedish users choosing weak passwords. / Lösenordet är den första försvarslinjen i de flesta moderna nät tjänsterna, det är därför kritiskt att välja ett starkt lösenord. Många tidigare studier har upptäckt mönster som kan förbättras i globala användares lösenord men ingen har tidigare forskat på mönster hos just svenska användare. I det här projektet har vi samlat lösenord av svenska användare från olika undergroundforum och analyserat dem för att ta reda på om svenska användare skapar sina lösenord annorlunda från globala användare och ifall det finns några svaga mönster i lösenorden. Vi fann att svenska användare ofta använder ord eller namn från en svensk NLP korpus i sina lösenord och även att svenska användare använder små bokstäver i högre grad än globala användare. Vi fann även att flera av de mest populära svenska hemsidorna har svaga lösenordspolicys vilken kan bidra till att svenska användare väljer svaga lösenord.
247

Automated Extraction of Insurance Policy Information : Natural Language Processing techniques to automate the process of extracting information about the insurance coverage from unstructured insurance policy documents.

Hedberg, Jacob, Furberg, Erik January 2023 (has links)
This thesis investigates Natural Language Processing (NLP) techniques to extract relevant information from long and unstructured insurance policy documents. The goal is to reduce the amount of time required by readers to understand the coverage within the documents. The study uses predefined insurance policy coverage parameters, created by industry experts to represent what is covered in the policy documents. Three NLP approaches are used to classify the text sequences as insurance parameter classes. The thesis shows that using SBERT to create vector representations of text to allow cosine similarity calculations is an effective approach. The top scoring sequences for each parameter are assigned that parameter class. This approach shows a significant reduction in the number of sequences required to read by a user but misclassifies some positive examples. To improve the model, the parameter definitions and training data were combined into a support set. Similarity scores were calculated between all sequences and the support sets for each parameter using different pooling strategies. This few-shot classification approach performed well for the use case, improving the model’s performance significantly. In conclusion, this thesis demonstrates that NLP techniques can be applied to help understand unstructured insurance policy documents. The model developed in this study can be used to extract important information and reduce the time needed to understand the contents of aninsurance policy document. A human expert would however still be required to interpret the extracted text. The balance between the amount of relevant information and the amount of text shown would depend on how many of the top-scoring sequences are classified for each parameter. This study also identifies some limitations of the approach depending on available data. Overall, this research provides insight into the potential implications of NLP techniques for information extraction and the insurance industry.
248

Text Classification using the Teacher- Student  Chatroom Corpus / Text klassificering med Teacher-- Student Chatroom Corpu

Österberg, Marcus January 2023 (has links)
Advancements in Artificial Intelligence, especially in the field of natural language processing have opened new possibilities for educational chatbots. One of these is a chatbot that can simulate a conversation between the teacher and the student for continuous learner support. In an up-scaled learning environment, teachers have less time to interact with each student individually. A resource to practice interactions with students could be a boon to alleviate this issue. In this thesis, we present a machine-learning model combined with a heuristic approach used in the creation of a chatbot. The machine learning model learns language understanding using prebuilt language representations which are fine-tuned with teacher-student conversations. The heuristic compares responses and picks the highest score for response retrieval. A data quality analysis is also performed on the teacher-student conversation dataset. For results, the best-base-cased language model performed best for text classification with a weighted F1-score of 0.70. The dataset used for the machine learning model showed consistency and completeness issues regarding labelling. The Technology Acceptance Model has been used to evaluate the model. The results of this evaluation show a high perceived ease of use, but a low perceived usefulness of the present solution. The thesis contributes with the innovative TUM (topic understanding model), an educational chatbot and an evaluation of the teacher-student chatroom corpus regarding the usage for text classification. / Teknologiska framsteg i artificiell intelligens, speciellt inom språkteknologi, har öppnat för nya möjligheter för chatbottar inom utbildningssektorn. Chatbots har sett en ökande användning i olika lärandeändamål. En av dessa är en chatbot som kan simulera en konversation mellan en lärare och en student för lärandestöd. När inlärning sker på en allt större skala, har lärare allt mindre tid att lägga individuellt på varje student. En resurs för att öva på interaktioner med studenter skulle därför kunna vara ett bra hjälpmedel. I denna masteruppsats presenteras en maskininlärnings modell kombinerad med ett heuristiskt tillvägagångsätt i skapandet av en chatbot. Maskininlärningsmodellen använder sig av färdigbyggda språkrepresentationer som är finjusterade med lärare-studentkonversationer. Heuristiken jämför svar och väljer den högsta poängen för svarshämtning. En datakvalité analys är också gjord på lärare-studentkonversations datasetet. För resultat, den BERT-baserade språkmodellen gav bäst resultat för textklassificering med en weigthed-F1- score på 0.70. Datasetet som användes för maskininlärningsmodellen visade konsistens och fullständighet problem rörande etiketter. Teknologi acceptans modellen har använts för att evaluera modellen. Resultatet av evalueringen visade hög upplevd användarvänlighet, men låg upplevd användbarhet. Detta arbete bidrar med TUM (topic understanding model), en utbildningschatbot och en evaluering av datasetet teacherstudent chatroom corpus för användning till textklassificering.
249

Revisiting Item Semantics in Measurement: A New Perspective Using Modern Natural Language Processing Embedding Techniques

Guo, Feng 11 August 2023 (has links)
No description available.
250

Controllable sentence simplification in Swedish : Automatic simplification of sentences using control prefixes and mined Swedish paraphrases

Monsen, Julius January 2023 (has links)
The ability to read and comprehend text is essential in everyday life. Some people, including individuals with dyslexia and cognitive disabilities, may experience difficulties with this. Thus, it is important to make textual information accessible to diverse target audiences. Automatic Text Simplification (ATS) techniques aim to reduce the linguistic complexity in texts to facilitate readability and comprehension. However, existing ATS systems often lack customization to specific user needs, and simplification data for languages other than English is limited. This thesis addressed ATS in a Swedish context, building upon novel methods that provide more control over the simplification generation process, enabling user customization. A dataset of Swedish paraphrases was mined from a large amount of text data. ATS models were then trained on this dataset utilizing prefix-tuning with control prefixes. Two sets of text attributes and their effects on performance were explored for controlling the generation. The first had been used in previous research, and the second was extracted in a data-driven way from existing text complexity measures. The trained ATS models for Swedish and additional models for English were evaluated and compared using SARI and BLEU metrics. The results for the English models were consistent with results from previous research using controllable generation mechanisms, although slightly lower. The Swedish models provided significant improvements over the baseline, in the form of a fine-tuned BART model, and compared to previous Swedish ATS results. These results highlight the efficiency of using paraphrase data paired with controllable generation mechanisms for simplification. Furthermore, the different sets of attributes provided very similar results, pointing to the fact that both these sets of attributes manage to capture aspects of simplification. The process of mining paraphrases, selecting control attributes and other methodological implications are discussed, leading to suggestions for future research.

Page generated in 0.0752 seconds