Spelling suggestions: "subject:"[een] NLP"" "subject:"[enn] NLP""
151 |
Argument Mining: Claim Annotation, Identification, VerificationKaramolegkou, Antonia January 2021 (has links)
Researchers writing scientific articles summarize their work in the abstracts mentioning the final outcome of their study. Argumentation mining can be used to extract the claim of the researchers as well as the evidence that could support their claim. The rapid growth of scientific articles demands automated tools that could help in the detection and evaluation of the scientific claims’ veracity. However, there are neither a lot of studies focusing on claim identification and verification neither a lot of annotated corpora available to effectively train deep learning models. For this reason, we annotated two argument mining corpora and perform several experiments with state-of-the-art BERT-based models aiming to identify and verify scientific claims. We find that using SciBERT provides optimal results regardless of the dataset. Furthermore, increasing the amount of training data can improve the performance of every model we used. These findings highlight the need for large-scale argument mining corpora, as well as domain-specific pre-trained models.
|
152 |
Named-entity recognition with BERT for anonymization of medical recordsBridal, Olle January 2021 (has links)
Sharing data is an important part of the progress of science in many fields. In the largely deep learning dominated field of natural language processing, textual resources are in high demand. In certain domains, such as that of medical records, the sharing of data is limited by ethical and legal restrictions and therefore requires anonymization. The process of manual anonymization is tedious and expensive, thus automated anonymization is of great value. Since medical records consist of unstructured text, pieces of sensitive information have to be identified in order to be masked for anonymization. Named-entity recognition (NER) is the subtask of information extraction named entities, such as person names or locations, are identified and categorized. Recently, models that leverage unsupervised training on large quantities of unlabeled training data have performed impressively on the NER task, which shows promise in their usage for the problem of anonymization. In this study, a small set of medical records was annotated with named-entity tags. Because of the lack of any training data, a BERT model already fine-tuned for NER was then evaluated on the evaluation set. The aim was to find out how well the model would perform on NER on medical records, and to explore the possibility of using the model to anonymize medical records. The most positive result was that the model was able to identify all person names in the dataset. The average accuracy for identifying all entity types was however relatively low. It is discussed that the success of identifying person names shows promise in the model’s application for anonymization. However, because the overall accuracy is significantly worse than that of models fine-tuned on domain-specific data, it is suggested that there might be better methods for anonymization in the absence of relevant training data.
|
153 |
Improving Solr search with Natural Language Processing : An NLP implementation for information retrieval in Solr / Att förbättra Solr med Natural Language ProcessingLager, Adam January 2021 (has links)
The field of AI is emerging fast and institutions and companies are pushing the limits of impossibility. Natural Language Processing is a branch of AI where the goal is to understand human speech and/or text. This technology is used to improve an inverted index,the full text search engine Solr. Solr is open source and has integrated OpenNLP makingit a suitable choice for these kinds of operations. NLP-enabled Solr showed great results compared to the Solr that’s currently running on the systems, where NLP-Solr was slightly worse in terms of precision, it excelled at recall and returning the correct documents.
|
154 |
Applied Machine Learning for Online EducationSerena Alexis Nicoll (12476796) 28 April 2022 (has links)
<p>We consider the problem of developing innovative machine learning tools for online education and evaluate their ability to provide instructional resources. Prediction tasks for student behavior are a complex problem spanning a wide range of topics: we complement current research in student grade prediction and clickstream analysis by considering data from three areas of online learning: Social Learning Networks (SLN), Instructor Feedback, and Learning Management Systems (LMS). In each of these categories, we propose a novel method for modelling data and an associated tool that may be used to assist students and instructors. First, we develop a methodology for analyzing instructor-provided feedback and determining how it correlates with changes in student grades using NLP and NER--based feature extraction. We demonstrate that student grade improvement can be well approximated by a multivariate linear model with average fits across course sections approaching 83\%, and determine several contributors to student success. Additionally, we develop a series of link prediction methodologies that utilize spatial and time-evolving network architectures to pass network state between space and time periods. Through evaluation on six real-world datasets, we find that our method obtains substantial improvements over Bayesian models, linear classifiers, and an unsupervised baseline, with AUCs typically above 0.75 and reaching 0.99. Motivated by Federated Learning, we extend our model of student discussion forums to model an entire classroom as a SLN. We develop a methodology to represent student actions across different course materials in a shared, low-dimensional space that allows characteristics from actions of different types to be passed jointly to a downstream task. Performance comparisons against several baselines in centralized, federated, and personalized learning demonstrate that our model offers more distinctive representations of students in a low-dimensional space, which in turn results in improved accuracy on a common downstream prediction task. Results from these three research thrusts indicate the ability of machine learning methods to accurately model student behavior across multiple data types and suggest their ability to benefit students and instructors alike through future development of assistive tools. </p>
|
155 |
Multi-Layer Web Services Discovery using Word Embedding and Clustering TechniquesObidallah, Waeal 25 February 2021 (has links)
Web services discovery is the process of finding the right Web services that best match the end-users’ functional and non-functional requirements. Artificial intelligence, natural language processing, data mining, and text mining techniques have been applied by researchers in Web services discovery to facilitate the process of matchmaking. This thesis contributes to the area of Web services discovery and recommendation, adopting the Design Science Research Methodology to guide the development of useful knowledge, including design theory and artifacts.
The lack of a comprehensive review of Web services discovery and recommendation in the literature motivated us to conduct a systematic literature review. Our main purpose in conducting the systematic literature review was to identify and systematically compare current clustering and association rules techniques for Web services discovery and recommendation by providing answers to various research questions, investigating the prior knowledge, and identifying gaps in the related literature.
We then propose a conceptual model and a typology of Web services discovery systems. The conceptual model provides a high-level representation of Web services discovery systems, including their various elements, tasks, and relationships. The proposed typology of Web services discovery systems is composed of five groups of characteristics: storage and location characteristics, formalization characteristics, matchmaking characteristics, automation characteristics, and selection characteristics. We reference the typology to compare Web services discovery methods and architectures from the extant literature by linking them to the five proposed characteristics.
We employ the proposed conceptual model with its specified characteristics to design and develop the multi-layer data mining architecture for Web services discovery using word embedding and clustering techniques. The proposed architecture consists of five layers: Web services description and data preprocessing; word embedding and representation; syntactic similarity; semantic similarity; and clustering. In the first layer, we identify the steps to parse and preprocess the Web services documents. Bag of Words with Term Frequency–Inverse Document Frequency and three word-embedding models are employed for Web services representation in the second layer. Then in the third layer, four distance measures, including Cosine, Euclidean, Minkowski, and Word Mover, are studied to find the similarities between Web services documents. In layer four, WordNet and Normalized Google Distance are employed to represent and find the similarity between Web services documents. Finally, in the fifth layer, three clustering algorithms, including affinity propagation, K-means, and hierarchical agglomerative clustering, are investigated to cluster Web services based on the observed documents’ similarities. We demonstrate how each component of the five layers is employed in the process of Web services clustering using random-ly selected Web services documents.
We conduct experimental analysis to cluster Web services using a collected dataset of Web services documents and evaluating their clustering performances. Using a ground truth for evaluation purposes, we observe that clusters built based on the word embedding models performed better compared to those built using the Bag of Words with Term Frequency–Inverse Document Frequency model. Among the three word embedding models, the pre-trained Word2Vec’s skip-gram model reported higher performance in clustering Web services. Among the three semantic similarity measures, path-based WordNet similarity reported higher clustering performance. By considering the different words representations models and syntactic and semantic similarity measures, the affinity propagation clustering technique performed better in discovering similarities among Web services.
|
156 |
Fine-grained sentiment analysis of product reviews in SwedishWestin, Emil January 2020 (has links)
In this study we gather customer reviews from Prisjakt, a Swedish price comparison site, with the goal to study the relationship between review and rating, known as sentiment analysis. The purpose of the study is to evaluate three different supervised machine learning models on a fine-grained dependent variable representing the review rating. For classification, a binary and multinomial model is used with the one-versus-one strategy implemented in the Support Vector Machine, with a linear kernel, evaluated with F1, accuracy, precision and recall scores. We use Support Vector Regression by approximating the fine-grained variable as continuous, evaluated using MSE. Furthermore, three models are evaluated on a balanced and unbalanced dataset in order to investigate the effects of class imbalance. The results show that the SVR performs better on unbalanced fine-grained data, with the best fine-grained model reaching a MSE 4.12, compared to the balanced SVR (6.84). The binary SVM model reaches an accuracy of 86.37% and weighted F1 macro of 86.36% on the unbalanced data, while the balanced binary SVM model reaches approximately 80% for both measures. The multinomial model shows the worst performance due to the inability to handle class imbalance, despite the implementation of class weights. Furthermore, results from feature engineering shows that SVR benefits marginally from certain regex conversions, and tf-idf weighting shows better performance on the balanced sets compared to the unbalanced sets.
|
157 |
Indexation et apprentissage de termes et de relations à partir de comptes rendus de radiologie / Automatic extraction of semantic information in the radiologic reports for search in of medical imagingRamadier, Lionel 18 November 2016 (has links)
Dans le domaine médical, l'informatisation des professions de santé et le développement du dossier médical personnel (DMP) entraîne une progression rapide du volume d'information médicale numérique. Le besoin de convertir et de manipuler toute ces informations sous une forme structurée constitue un enjeu majeur. C'est le point de départ de la mise au point d'outils d'interrogation appropriés pour lesquels, les méthodes issues du traitement automatique du langage naturel (TALN) semblent bien adaptées. Les travaux de cette thèse s'inscrivent dans le domaine de l'analyse de documents médicaux et traitent de la problématique de la représentation de l'information biomédicale (en particulier du domaine radiologique) et de son accès. Nous proposons de construire une base de connaissance dédiée à la radiologie à l'intérieur d'une base de connaissance générale (réseau lexico-sémantique JeuxDeMots). Nous montrons l'intérêt de l'hypothèse de non séparation entre les différents types de connaissances dans le cadre d'une analyse de documents. Cette hypothèse est que l'utilisation de connaissances générales, en plus de celles de spécialités, permet d'améliorer significativement l'analyse de documents médicaux.Au niveau du réseau lexico-sémantique, l'ajout manuel et automatisé des méta-informations sur les annotations (informations fréquentielles, de pertinences, etc) est particulièrement utile. Ce réseau combine poids et annotations sur des relations typées entre des termes et des concepts ainsi qu'un mécanisme d'inférence dont l'objet est d'améliorer la qualité et la couverture du réseau. Nous décrivons comment à partir d'informations sémantiques présentes dans le réseau, il est possible de définir une augmentation des index bruts construits pour chaque comptes rendus afin d'améliorer la recherche documentaire. Nous présentons, ensuite, une méthode d'extraction de relations sémantiques entre des termes ou concepts. Cette extraction est réalisée à l'aide de patrons linguistiques auxquels nous avons rajouté des contraintes sémantiques.Les résultats des évaluations montrent que l'hypothèse de non séparation entre les différents types de connaissances améliorent la pertinence de l'indexation. L'augmentation d'index permet une amélioration du rappel alors que les contraintes sémantiques améliorent la précision de l'extraction de relations. / In the medical field, the computerization of health professions and development of the personal medical file (DMP) results in a fast increase in the volume of medical digital information. The need to convert and manipulate all this information in a structured form is a major challenge. This is the starting point for the development of appropriate tools where the methods from the natural language processing (NLP) seem well suited.The work of this thesis are within the field of analysis of medical documents and address the issue of representation of biomedical information (especially the radiology area) and its access. We propose to build a knowledge base dedicated to radiology within a general knowledge base (lexical-semantic network JeuxDeMots). We show the interest of the hypothesis of no separation between different types of knowledge through a document analysis. This hypothesis is that the use of general knowledge, in addition to those specialties, significantly improves the analysis of medical documents.At the level of lexical-semantic network, manual and automated addition of meta information on annotations (frequency information, pertinence, etc.) is particularly useful. This network combines weight and annotations on typed relationships between terms and concepts as well as an inference mechanism which aims to improve quality and network coverage. We describe how from semantic information in the network, it is possible to define an increase in gross index built for each records to improve information retrieval. We present then a method of extracting semantic relationships between terms or concepts. This extraction is performed using lexical patterns to which we added semantic constraints.The results show that the hypothesis of no separation between different types of knowledge to improve the relevance of indexing. The index increase results in an improved return while semantic constraints improve the accuracy of the relationship extraction.
|
158 |
Klassificering av transkriberade telefonsamtal med Support Vector Machines för ökad effektivitet inom vården / Classification of transcribed telephone calls with support vector machines for increased efficiency in healthcareHöglind, Sanna, Sundström, Emelie January 2019 (has links)
Patientnämndens förvaltning i Stockholm tar årligen emot tusentals samtal som önskar framföra klagomål på vården i Region Stockholm. Syftet med arbetet är att undersöka hur en NLP-robot för klassificering av inkomna klagomål skulle kunna bidra till en ökad effektivitet av verksamheten. Klassificeringen av klagomålen har utförts med hjälp av en metod baserad på Support Vector Machines. För att optimera modellens korrekthet undersöktes hur längden av ordvektorerna påverkar korrektheten. Modellen gav en slutgiltig korrekthet 53,10 %. Detta resultat analyserades sedan med målsättningen att identifiera potentiella förbättringsmöjligheter hos modellen. För framtida arbeten kan det därför vara intressant att undersöka hur antalet samtal, antalet personer som spelar in samtal och klassfördelningen i datamängden påverkar korrektheten. För att undersöka hur effektiviteten hos Patientnämndens förvaltning i Stockholm skulle påverkas av implementeringen av en NLP-robot användes en SWOT-analys. Denna analys visade på tydliga fördelar med automatisering av klagomålshanteringen, men att en sådan implementation måste ske med försiktighet där det säkerställs att tillgången på kompetens är tillräcklig för att förebygga potentiella hot. / Every year Patientnämnden recieves thousands of phone calls from patients wishing to make complaints about the health care in Stockholm. The aim of this work is to investigate how an NLP-robot for classification of recieved phone calls would contribute to an increased efficiency of the operation. The classification of the complaints has been made using a method based on Support Vector Machines. In order to optimize the accuracy of the model the impact of the length of the word vector has been investigated. The final result was an accuracy of 53.10%. The result was analyzed with the goal to identify potential opportunities of improvement of the model. For future work it could be interesting to investigate in how the number of calls, the number of people recording the calls and the distribution between the classes affect the accuracy A SWOT-analysis was performed in order to investigate in how the efficiency of Patientnämnden would be affected by the implementation of an NLP-robot. The analysis showed apparent benefits of automation of complaint management, but also that such an implementation must be done with great caution in order to be able to ensure that the available competence is high enough to prevent potential threats.
|
159 |
A Study in Describing Complex Words Using Wikipedia's Categorisation System : Adding Descriptive Terms to Increase the Comprehension of Swedish Texts / En studie i att förklara komplexa ord med hjälp av Wikipedias kategoriseringssystemRagnarsson, Sebastian January 2023 (has links)
This thesis offers new input in the field of generating epithets to aid the comprehension of Swedish texts. For whatever reason, a reader might find certain words in a text difficult to understand. For example, they may never have come across the term moussaka before; however, by the simple expedient of assigning an explanatory epithet – in this case, “the dish” moussaka – they can hopefully continue reading uninterrupted. To do this, obscure phrases are identified and extracted based on word class, shallow token features and the Pareto Principle. An algorithm then extracts appropriate epithets for each word using the Wikipedia categorisation system. Although the algorithm developed for the study achieved underwhelming results when extracting obscure phrases, it did prove excellent at assigning appropriate epithets to nouns and proper nouns. With further research, this process can hopefully be utilised as a tool for improving the readability of any text.
|
160 |
Sentiment Analysis for Swedish : The Impact of Emojis on Sentiment Analysis of Swedish Informal TextsBerggren, Lovisa January 2023 (has links)
This study investigates the use of emojis in sentiment analysis for the Swedish language, with the objective to assess if emojis improve the performance of the model. Sentiment analysis is an NLP classification task aimed at extracting people's opinions, sentiments, and attitudes from language. Though sentiment analysis as a research area has made a lot of progress recently, there are still some challenges to overcome. In this work, two of these challenges were considered; the analysis of a non-English language and the impact of emojis. These areas were explored through creating a sentiment annotated dataset of Swedish texts containing emojis, and creating a Swedish sentiment analysis model for evaluation. The sentiment analysis model created, SweVADER, was based on the English Lexicon-based model VADER. The best performing SweVADER model achieved an accuracy of 0.53 and an F1-score of 0.47. Furthermore, the presence of emojis improved the analysis for most models, but not by much. The results indicate that the use of emojis can improve the sentiment analysis, but there were other features affecting the results as well. The sentiment lexicon used plays a key role, and pre-processing techniques like stemming could affect the performance too. A takeaway from this study is that emojis contain important sentiment information, and should not be disregarded. Furthermore, emojis are useful when analyzing texts, if there is a lack of linguistic resources for the language in question.
|
Page generated in 0.045 seconds