One of the largest challenges in the field of artificial intelligence and machine learning isthe acquisition of a large quantity of quality data to train models on.This thesis investigates and evaluates approaches to data generation in a telecom domain for the task of extractive QA. To do this a pipeline was built using a combination ofBERT-like models and T5 models for data generation. We then evaluated our generateddata using the downstream task of QA on a telecom domain data set. We measured theperformance using EM and F1-scores. We achieved results that are state of the art on thetelecom domain data set.We found that synthetic data generation is a viable approach to obtaining synthetictelecom QA data with the potential of improving model performance when used in addition to human-annotated data. We also found that using models from the general domainprovided results that are on par or better than domain-specific models for the generation, which provides possibilities to use a single generation pipeline for many differentdomains. Furthermore, we found that increasing the amount of synthetic data providedlittle benefit for our models on the downstream task, with diminishing returns setting inquickly. We were unable to pinpoint the reason for this. In short, our approach works butmuch more work remains to understand and optimize it for greater results
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-187956 |
Date | January 2022 |
Creators | Bissessar, Daniel, Bois, Alexander |
Publisher | Linköpings universitet, Institutionen för datavetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0018 seconds