Return to search

Evaluation of methods for question answering data generation : Using large language models / Utvärdering av metoder för skapande av fråge-svar data

One of the largest challenges in the field of artificial intelligence and machine learning isthe acquisition of a large quantity of quality data to train models on.This thesis investigates and evaluates approaches to data generation in a telecom domain for the task of extractive QA. To do this a pipeline was built using a combination ofBERT-like models and T5 models for data generation. We then evaluated our generateddata using the downstream task of QA on a telecom domain data set. We measured theperformance using EM and F1-scores. We achieved results that are state of the art on thetelecom domain data set.We found that synthetic data generation is a viable approach to obtaining synthetictelecom QA data with the potential of improving model performance when used in addition to human-annotated data. We also found that using models from the general domainprovided results that are on par or better than domain-specific models for the generation, which provides possibilities to use a single generation pipeline for many differentdomains. Furthermore, we found that increasing the amount of synthetic data providedlittle benefit for our models on the downstream task, with diminishing returns setting inquickly. We were unable to pinpoint the reason for this. In short, our approach works butmuch more work remains to understand and optimize it for greater results

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-187956
Date January 2022
CreatorsBissessar, Daniel, Bois, Alexander
PublisherLinköpings universitet, Institutionen för datavetenskap
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0024 seconds