• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 2
  • Tagged with
  • 12
  • 12
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Semantic Segmentation of Iron Ore Pellets in the Cloud

Lindberg, Hampus January 2021 (has links)
This master's thesis evaluates data annotation, semantic segmentation and Docker for use in AWS. The data provided has to be annotated and is to be used as a dataset for the creation of a neural network. Different neural network models are then to be compared based on performance. AWS has the option to use Docker containers and thus that option is to be examined, and lastly the different tools available in AWS SageMaker will be analyzed for bringing a neural network to the cloud. Images were annotated in Ilastik and the dataset size is 276 images, then a neural network was created in PyTorch by using the library Segmentation Models PyTorch which gave the option of trying different models. This neural network was created in a notebook in Google Colab for a quick setup and easy testing. The dataset was then uploaded to AWS S3 and the notebook was brought from Colab to an AWS instance where the dataset then could be loaded from S3. A Docker container was created and packaged with the necessary packages and libraries as well as the training and inference code, to then be pushed to the ECR (Elastic Container Registry). This container could then be used to perform training jobs in SageMaker which resulted in a trained model stored in S3, and the hyperparameter tuning tool was also examined to get a better performing model. The two different deployment methods in SageMaker was then investigated to understand the entire machine learning solution. The images annotated in Ilastik were deemed sufficient as the neural network results were satisfactory. The neural network created was able to use all of the models accessible from Segmentation Models PyTorch which enabled a lot of options. By using a Docker container all of the tools available in SageMaker could be used with the created neural network packaged in the container and pushed to the ECR. Training jobs were run in SageMaker by using the container to get a trained model which could be saved to AWS S3. Hyperparameter tuning was used and got better results than the manually tested parameters which resulted in the best neural network produced. The model that was deemed the best was Unet++ in combination with the Dpn98 encoder. The two different deployment methods in SageMaker was explored and is believed to be beneficial in different ways and thus has to be reconsidered for each project. By analysis the cloud solution was deemed to be the better alternative compared to an in-house solution, in all three aspects measured, which was price, performance and scalability.
2

Usando aplicações ricas para internet na criação de um ambiente para visualização e edição de regras SWRL / Using rich Internet applications to create an environment for viewing and editing SWRL rules

Orlando, João Paulo 25 May 2012 (has links)
A Web Semântica é uma maneira de explorar a associação de significados explícitos aos conteúdos de documentos presentes na Web, para que esses possam ser processados diretamente ou indiretamente por máquinas. Para possibilitar esse processamento, os computadores necessitam ter acesso a coleções estruturadas de informações e a conjuntos de regras de inferência sobre esses conteúdos. O SWRL permite a combinação de regras e termos de ontologias (definidos por OWL) para aumentar a expressividade de ambos. Entretanto, conforme um conjunto de regras cresce, ele torna-se de difícil compreensão e sujeito a erros, especialmente quando mantido por mais de uma pessoa. Para que o SWRL se torne um verdadeiro padrão web, deverá ter a capacidade de lidar com grandes conjuntos de regras. Para encontrar soluções para este problema, primeiramente, foi realizado um levantamento sobre sistemas de regras de negócios, descobrindo os principais recursos e interfaces utilizados por eles, e então, com as descobertas, propusemos técnicas que usam novas representações visuais em uma aplicação web. Elas permitem detecção de erro, identificação de regras similares, agrupamento, visualização de regras e o reuso de átomos para novas regras. Estas técnicas estão implementadas no SWRL Editor, um plug-in open-source para o Web-Protégé (um editor de ontologias baseado na web) que utiliza ferramentas de colaboração para permitir que grupos de usuários possam não só ver e editar regras, mas também comentar e discutir sobre elas. Foram realizadas duas avaliações do SWRL Editor. A primeira avaliação foi um estudo de caso para duas ontologias da área biomédica (uma área onde regras SWRL são muito usadas) e a segunda uma comparação com os únicos três editores de regras SWRL encontrados na literatura. Nessa comparação foi mostrando que ele implementa mais recursos encontrados em sistemas de regras em geral / The Semantic Web is a way to associate explicitly meaning to the content of web documents to allow them to be processed directly by machines. To allow this processing, computers need to have access to structured collections of information and sets of rules to reason about these content. The Semantic Web Rule Language (SWRL) allows the combination of rules and ontology terms, defined using the Web Ontology Language (OWL), to increase the expressiveness of both. However, as rule sets grow, they become difficult to understand and error prone, especially when used and maintained by more than one person. If SWRL is to become a true web standard, it has to be able to handle big rule sets. To find answers to this problem, we first surveyed business rule systems and found the key features and interfaces they used and then, based on our finds, we proposed techniques and tools that use new visual representations to edit rules in a web application. They allow error detection, rule similarity analysis, rule clustering visualization and atom reuse between rules. These tools are implemented in the SWRL Editor, an open source plug-in for Web-Protégé (a web-based ontology editor) that leverages Web-Protégés collaborative tools to allow groups of users to not only view and edit rules but also comment and discuss about them. We have done two evaluations of the SWRL Editor. The first one was a case study of two ontologies from the biomedical domain, the second was a comparison with the SWRL editors available in the literature, there are only three. In this comparison, it has been shown that the SWRL Editor implements more of the key resources found on general rule systems than the other three editors
3

A multi-layered approach to information extraction from tables in biomedical documents

Milosevic, Nikola January 2018 (has links)
The quantity of literature in the biomedical domain is growing exponentially. It is becoming impossible for researchers to cope with this ever-increasing amount of information. Text mining provides methods that can improve access to information of interest through information retrieval, information extraction and question answering. However, most of these systems focus on information presented in main body of text while ignoring other parts of the document such as tables and figures. Tables present a potentially important component of research presentation, as authors often include more detailed information in tables than in textual sections of a document. Tables allow presentation of large amounts of information in relatively limited space, due to their structural flexibility and ability to present multi-dimensional information. Table processing encapsulates specific challenges that table mining systems need to take into account. Challenges include a variety of visual and semantic structures in tables, variety of information presentation formats, and dense content in table cells. The work presented in this thesis examines a multi-layered approach to information extraction from tables in biomedical documents. In this thesis we propose a representation model of tables and a method for table structure disentangling and information extraction. The model describes table structures and how they are read. We propose a method for information extraction that consists of: (1) table detection, (2) functional analysis, (3) structural analysis, (4) semantic tagging, (5) pragmatic analysis, (6) cell selection and (7) syntactic processing and extraction. In order to validate our approach, show its potential and identify remaining challenges, we applied our methodology to two case studies. The aim of the first case study was to extract baseline characteristics of clinical trials (number of patients, age, gender distribution, etc.) from tables. The second case study explored how the methodology can be applied to relationship extraction, examining extraction of drug-drug interactions. Our method performed functional analysis with a precision score of 0.9425, recall score of 0.9428 and F1-score of 0.9426. Relationships between cells were recognized with a precision of 0.9238, recall of 0.9744 and F1-score of 0.9484. The information extraction methodology performance is the state-of-the-art in table information extraction recording an F1-score range of 0.82-0.93 for demographic data, adverse event and drug-drug interaction extraction, depending on the complexity of the task and available semantic resources. Presented methodology demonstrated that information can be efficiently extracted from tables in biomedical literature. Information extraction from tables can be important for enhancing data curation, information retrieval, question answering and decision support systems with additional information from tables that cannot be found in the other parts of the document.
4

Usando aplicações ricas para internet na criação de um ambiente para visualização e edição de regras SWRL / Using rich Internet applications to create an environment for viewing and editing SWRL rules

João Paulo Orlando 25 May 2012 (has links)
A Web Semântica é uma maneira de explorar a associação de significados explícitos aos conteúdos de documentos presentes na Web, para que esses possam ser processados diretamente ou indiretamente por máquinas. Para possibilitar esse processamento, os computadores necessitam ter acesso a coleções estruturadas de informações e a conjuntos de regras de inferência sobre esses conteúdos. O SWRL permite a combinação de regras e termos de ontologias (definidos por OWL) para aumentar a expressividade de ambos. Entretanto, conforme um conjunto de regras cresce, ele torna-se de difícil compreensão e sujeito a erros, especialmente quando mantido por mais de uma pessoa. Para que o SWRL se torne um verdadeiro padrão web, deverá ter a capacidade de lidar com grandes conjuntos de regras. Para encontrar soluções para este problema, primeiramente, foi realizado um levantamento sobre sistemas de regras de negócios, descobrindo os principais recursos e interfaces utilizados por eles, e então, com as descobertas, propusemos técnicas que usam novas representações visuais em uma aplicação web. Elas permitem detecção de erro, identificação de regras similares, agrupamento, visualização de regras e o reuso de átomos para novas regras. Estas técnicas estão implementadas no SWRL Editor, um plug-in open-source para o Web-Protégé (um editor de ontologias baseado na web) que utiliza ferramentas de colaboração para permitir que grupos de usuários possam não só ver e editar regras, mas também comentar e discutir sobre elas. Foram realizadas duas avaliações do SWRL Editor. A primeira avaliação foi um estudo de caso para duas ontologias da área biomédica (uma área onde regras SWRL são muito usadas) e a segunda uma comparação com os únicos três editores de regras SWRL encontrados na literatura. Nessa comparação foi mostrando que ele implementa mais recursos encontrados em sistemas de regras em geral / The Semantic Web is a way to associate explicitly meaning to the content of web documents to allow them to be processed directly by machines. To allow this processing, computers need to have access to structured collections of information and sets of rules to reason about these content. The Semantic Web Rule Language (SWRL) allows the combination of rules and ontology terms, defined using the Web Ontology Language (OWL), to increase the expressiveness of both. However, as rule sets grow, they become difficult to understand and error prone, especially when used and maintained by more than one person. If SWRL is to become a true web standard, it has to be able to handle big rule sets. To find answers to this problem, we first surveyed business rule systems and found the key features and interfaces they used and then, based on our finds, we proposed techniques and tools that use new visual representations to edit rules in a web application. They allow error detection, rule similarity analysis, rule clustering visualization and atom reuse between rules. These tools are implemented in the SWRL Editor, an open source plug-in for Web-Protégé (a web-based ontology editor) that leverages Web-Protégés collaborative tools to allow groups of users to not only view and edit rules but also comment and discuss about them. We have done two evaluations of the SWRL Editor. The first one was a case study of two ontologies from the biomedical domain, the second was a comparison with the SWRL editors available in the literature, there are only three. In this comparison, it has been shown that the SWRL Editor implements more of the key resources found on general rule systems than the other three editors
5

Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora

Olsson, Fredrik January 2008 (has links)
This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task.
6

Automatic vs. Manual Data Labeling : A System Dynamics Modeling Approach / Automatisk Kontra Manuell Dataannotering : med Systemdynamiksmodellering

Blank, Clas January 2020 (has links)
Labeled data, which is a collection of data samples that have been tagged with one or more labels, play an important role many software organizations in today's market. It can help in solving automation problems, training and validating machine learning models, or analysing data. Many organizations therefore set up their own labeled data gathering system which supplies them with the data they require. Labeling data can either be done by humans or be done via some automated process. However, labeling datasets comes with costs to these organizations. This study will examine what this labeled data gathering system could look like and determine which components that play a crucial role when determining how costly an automatic approach is compared to a manual approach using the company Klarna's label acquisition system as a case study. Two models are presented where one describes a system that solely uses humans for data annotation, while the other model describes a system where labeling is done via an automatic process. These models are used to compare costs to an organization taking those approaches. Important findings include the identification of important components that affects which approach would be more economically efficient to an organization under certain circumstances. Some of these important components are the label decay rate, automatic and manual expected accuracy, and number of data points that require labeling. / Annoterad data, vilket är en kollektion utav datapunkter som har blivit annoterade med en eller flera taggar, spelar en viktig roll för många mjukvaruföretag i dagens marknad. Det kan hjälpa till att lösa automatiseringsingsproblem, träna och validera maskininlärningsmodeller, eller analysera data. Många organisationer sätter därför upp sina egna dataannoteringssystem som kan leverera den annoterade data som behövs inom organisationen. Annotering kan göras av människor, men kan också göras via en automatiserad process. Emellertid kommer annotering utav data med kostnader för organisationen. Denna studie undersöker hur ett sådant dataannoteringssystem kan se ut och analyserar vilka komponenter som spelar en betydande roll när kostnader mellan ett automatiserat system och ett manuellt system ska jämföras. Klarnas dataannoteringssystem kommer att användas som en case-studie. Två modeller presenteras varav den ena beskriver ett system där enbart manuellt annoteringsarbete utförs, och den andra beskriver ett system där annotering utav data utförs via en automatisk process. Några viktiga resultat av denna studie är identifikationen utav betydelsefulla parametrar i modellerna när det kommer till att jämföra den ekonomiska effektiviteten mellan de två olika dataannoteringsstrategierna. Exempel på dessa komponenter är annoteringens förfalltakt, den förväntade manuella/automatiska pricksäkerheten, och mängden data som behöver annoteras.
7

Air Reconnaissance Analysis using Convolutional Neural Network-based Object Detection

Fasth, Niklas, Hallblad, Rasmus January 2020 (has links)
The Swedish armed forces use the Single Source Intelligent Cell (SSIC), developed by Saab, for analysis of aerial reconnaissance video and report generation. The analysis can be time-consuming and demanding for a human operator. In the analysis workflow, identifying vehicles is an important part of the work. Artificial Intelligence is widely used for analysis in many industries to aid or replace a human worker. In this paper, the possibility to aid the human operator with air reconnaissance data analysis is investigated, specifically, object detection for finding cars in aerial images. Many state-of-the-art object detection models for vehicle detection in aerial images are based on a Convolutional Neural Network (CNN) architecture. The Faster R-CNN- and SSD-based models are both based on this architecture and are implemented. Comprehensive experiments are conducted using the models on two different datasets, the open Video Verification of Identity (VIVID) dataset and a confidential dataset provided by Saab. The datasets are similar, both consisting of aerial images with vehicles. The initial experiments are conducted to find suitable configurations for the proposed models. Finally, an experiment is conducted to compare the performance of a human operator and a machine. The results from this work prove that object detection can be used to supporting the work of air reconnaissance image analysis regarding inference time. The current performance of the object detectors makes applications, where speed is more important than accuracy, most suitable.
8

Eye Movement Analysis for Activity Recognition in Everyday Situations

Gustafsson, Anton January 2018 (has links)
Den ständigt ökande mängden av smarta enheter i vår vardag har lett till nya problem inom HCI så som hur vi människor ska interagera med dessa enheter på ett effektivt och enkelt sätt. Än så länge har kontextuellt medvetna system visat sig kunna vara ett möjligt sätt att lösa detta problem. Om ett system hade kunnat automatiskt detektera personers aktiviteter och avsikter, kunde det agera utan någon explicit inmatning från användaren. Ögon har tidigare visat sig avslöja mycket information om en persons kognitiva tillstånd och skulle kunna vara en möjlig modalitet för att extrahera aktivitesinformation ifrån.I denna avhandling har vi undersökt möjligheten att detektera aktiviteter genom att använda en billig, hemmabyggd ögonspårningsapparat. Ett experiment utfördes där deltagarna genomförde aktiviteter i ett kök för att samla in data om deras ögonrörelser. Efter experimentet var färdigt, annoterades, förbehandlades och klassificerades datan med hjälp av en multilayer perceptron--och en random forest--klassificerare.Trots att mängden data var relativt liten, visade resultaten att igenkänningsgraden var mellan 30-40% beroende på vilken klassificerare som användes. Detta bekräftar tidigare forskning att aktivitetsigenkänning genom att analysera ögonrörelser är möjligt. Dock visar det även att det fortfarande är svårt att uppnå en hög igenkänningsgrad. / The increasing amount of smart devices in our everyday environment has created new problems within human-computer interaction such as how we humans are supposed to interact with these devices efficiently and with ease. So far, context-aware systems could be a possible candidate to solve this problem. If a system automatically could detect people's activities and intentions, it could act accordingly without any explicit input from the user. Eyes have previously shown to be a rich source of information about a person's cognitive state and current activity. Because of this, eyes could be a viable input modality for extracting information from. In this thesis, we examine the possibility of detecting human activity by using a low cost, home-built monocular eye tracker. An experiment was conducted were participants performed everyday activities in a kitchen to collect eye movement data. After conducting the experiment, the data was annotated, preprocessed and classified using multilayer perceptron and random forest classifiers.Even though the data set collected was small, the results showed a recognition rate of between 30-40% depending on the classifier used. This confirms previous work that activity recognition using eye movement data is possible but that achieving high accuracy is challenging.
9

Beyond Privacy Concerns: Examining Individual Interest in Privacy in the Machine Learning Era

Brown, Nicholas James 12 June 2023 (has links)
The deployment of human-augmented machine learning (ML) systems has become a recommended organizational best practice. ML systems use algorithms that rely on training data labeled by human annotators. However, human involvement in reviewing and labeling consumers' voice data to train speech recognition systems for Amazon Alexa, Microsoft Cortana, and the like has raised privacy concerns among consumers and privacy advocates. We use the enhanced APCO model as the theoretical lens to investigate how the disclosure of human involvement during the supervised machine learning process affects consumers' privacy decision making. In a scenario-based experiment with 499 participants, we present various company privacy policies to participants to examine their trust and privacy considerations, then ask them to share reasons why they would or would not opt in to share their voice data to train a companies' voice recognition software. We find that the perception of human involvement in the ML training process significantly influences participants' privacy-related concerns, which thereby mediate their decisions to share their voice data. Furthermore, we manipulate four factors of a privacy policy to operationalize various cognitive biases actively present in the minds of consumers and find that default trust and salience biases significantly affect participants' privacy decision making. Our results provide a deeper contextualized understanding of privacy-related concerns that may arise in human-augmented ML system configurations and highlight the managerial importance of considering the role of human involvement in supervised machine learning settings. Importantly, we introduce perceived human involvement as a new construct to the information privacy discourse. Although ubiquitous data collection and increased privacy breaches have elevated the reported concerns of consumers, consumers' behaviors do not always match their stated privacy concerns. Researchers refer to this as the privacy paradox, and decades of information privacy research have identified a myriad of explanations why this paradox occurs. Yet the underlying crux of the explanations presumes privacy concern to be the appropriate proxy to measure privacy attitude and compare with actual privacy behavior. Often, privacy concerns are situational and can be elicited through the setup of boundary conditions and the framing of different privacy scenarios. Drawing on the cognitive model of empowerment and interest, we propose a multidimensional privacy interest construct that captures consumers' situational and dispositional attitudes toward privacy, which can serve as a more robust measure in conditions leading to the privacy paradox. We define privacy interest as a consumer's general feeling toward reengaging particular behaviors that increase their information privacy. This construct comprises four dimensions—impact, awareness, meaningfulness, and competence—and is conceptualized as a consumer's assessment of contextual factors affecting their privacy perceptions and their global predisposition to respond to those factors. Importantly, interest was originally included in the privacy calculus but is largely absent in privacy studies and theoretical conceptualizations. Following MacKenzie et al. (2011), we developed and empirically validated a privacy interest scale. This study contributes to privacy research and practice by reconceptualizing a construct in the original privacy calculus theory and offering a renewed theoretical lens through which to view consumers' privacy attitudes and behaviors. / Doctor of Philosophy / The deployment of human-augmented machine learning (ML) systems has become a recommended organizational best practice. ML systems use algorithms that rely on training data labeled by human annotators. However, human involvement in reviewing and labeling consumers' voice data to train speech recognition systems for Amazon Alexa, Microsoft Cortana, and the like has raised privacy concerns among consumers and privacy advocates. We investigate how the disclosure of human involvement during the supervised machine learning process affects consumers' privacy decision making and find that the perception of human involvement in the ML training process significantly influences participants' privacy-related concerns. This thereby influences their decisions to share their voice data. Our results highlight the importance of understanding consumers' willingness to contribute their data to generate complete and diverse data sets to help companies reduce algorithmic biases and systematic unfairness in the decisions and outputs rendered by ML systems. Although ubiquitous data collection and increased privacy breaches have elevated the reported concerns of consumers, consumers' behaviors do not always match their stated privacy concerns. This is referred to as the privacy paradox, and decades of information privacy research have identified a myriad of explanations why this paradox occurs. Yet the underlying crux of the explanations presumes privacy concern to be the appropriate proxy to measure privacy attitude and compare with actual privacy behavior. We propose privacy interest as an alternative to privacy concern and assert that it can serve as a more robust measure in conditions leading to the privacy paradox. We define privacy interest as a consumer's general feeling toward reengaging particular behaviors that increase their information privacy. We found that privacy interest was more effective than privacy concern in predicting consumers' mobilization behaviors, such as publicly complaining about privacy issues to companies and third-party organizations, requesting to remove their information from company databases, and reducing their self-disclosure behaviors. By contrast, privacy concern was more effective than privacy interest in predicting consumers' behaviors to misrepresent their identity. By developing and empirically validating the privacy interest scale, we offer interest in privacy as a renewed theoretical lens through which to view consumers' privacy attitudes and behaviors.
10

Bootstrapping Annotated Job Ads using Named Entity Recognition and Swedish Language Models / Identifiering av namngivna enheter i jobbannonser genom användning av semi-övervakade tekniker och svenska språkmodeller

Nyqvist, Anna January 2021 (has links)
Named entity recognition (NER) is a task that concerns detecting and categorising certain information in text. A promising approach for NER that recently has emerged is fine-tuning Transformer-based language models for this specific task. However, these models may require a relatively large quantity of labelled data to perform well. This can limit NER models applicability in real-world applications as manual annotation often is costly and time-consuming. In this thesis, we investigate the learning curve of human annotation and of a NER model during a semi-supervised bootstrapping process. Special emphasis is given the dependence of the number of classes and the amount of training data used in the process. We first annotate a set of collected job advertisements and then apply bootstrapping using both annotated and unannotated data and continuously fine-tune a pre-trained Swedish BERT model. The initial class system is simplified during the bootstrapping process according to model performance and inter-annotator agreement. The model performance increased as the training set grew larger with a final micro F1-score of 54%. This result provides a good baseline, and we point out several improvements that can be made to further enhance performance. We further identify classes handled differently by the annotators and potential factors as to why this is. Suggestions for future work include adjusting the current class system further by removing classes that were identified as low-performing in this thesis. / Namngiven entitetsigenkänning (eng. named entity recognition) innebär att identifiera och kategorisera nyckelord i text. En ny lovande teknik för identifiering av namngivna enheter är att finjustera Transformerbaserade språkmodeller för denna specifika uppgift. Dessa modeller kräver dock stora mängder märkt data för att prestera väl. Detta kan begränsa antal områden i vilka de kan användas då manuell märkning av data ofta är kostsamt och tidskrävande. I denna avhandling undersöker vi inlärningskurvan för manuell annotering och för en språkmodell under en halvövervakad bootstrapping process. Särskild vikt läggs på hur modellens och annoterarnas inlärning påverkas av antal klasser och mängden träningsdata som används i processen. Vi annoterar först en samling jobbannonser och tillämpar sedan en bootstrapping process med både märkt och omärkt data i vilken en förtränad svensk BERT-modell kontinuerligt finjusteras. Det första klasssystemet förenklas under processens gång beroende på modellprestation och interannoterar-överenskommelse. Modellen presterade bättre med mer träningsdata och uppnådde en slutlig micro F1-score på 54%. Detta resultat ger en bra baslinje, och vi föreslår flera förbättringar som kan göras för att ytterligare förbättra modellprestationen. Vidare identifierar vi även klasser som hanteras olika av annoterare och potentiella faktorer till vad detta beror på. Förslag för framtida arbete inkluderar att justera det nuvarande klasssystemet ytterligare genom att ta bort klasser som identifierades som lågpresterande i denna avhandling.

Page generated in 0.0885 seconds