Active learning is a practical field of machine learning as labeling data or determining which data to label can be a time consuming and inefficient task. Active learning automates the process of selecting which data to label, but current methods are heavily model reliant. This has led to the inability of sampled data to be transferred to new models as well as issues with sampling bias. Both issues are of crucial concern in machine learning deployment. We propose active learning methods utilizing Combinatorial Coverage to overcome these issues.
The proposed methods are data-centric, and through our experiments we show that the inclusion of coverage in active learning leads to sampling data that tends to be the best in transferring to different models and has a competitive sampling bias compared to benchmark methods. / Master of Science / Machine learning (ML) models are being used frequently in a variety of applications. For the model to be able to learn, data is required. Processing this data is often one of the most, if not the most, time consuming aspects of utilizing ML. One especially burdensome aspect of data processing is data labeling, or determining what each data point corresponds to in terms of real world class. For example, determining if a data point that is an image contains a plane or bird. This way the ML model can learn from the data.
Active learning is a sub-field of machine learning which aims to ease this burden by allowing the model to select which data would be most beneficial to label, so that the entirety of the dataset does not need to be labeled. The issue with current active learning methods is that they are highly model dependent. In machine learning deployment the model being used may change while data stays the same, so this model dependency can cause for data points we label with respect to one model to not be ideal for another model. This model dependency has led to sampling bias issues as well; points which are chosen to be labeled may all be similar or not representative of all data resulting in the ML model not being as knowledgeable as possible.
Relevant work has focused on the sampling bias issue, and several methods have been proposed to combat this issue. Few of the methods are applicable to any type of ML model though. The issue of sampled points not generalizing to different models has been studied but no solutions have been proposed.
In this work we present active learning methods using Combinatorial Coverage. Combinatorial Coverage is a statistical technique from the field of Design of Experiments, and has commonly been used to design test sets. The extension of Combinatorial Coverage to ML is newer, and provides a way to focus on the data. We show that this data focused approach to active learning achieves a better performance when the sampled data is used for a different model and that it achieves a competitive sampling bias compared to benchmark methods.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/111467 |
Date | 04 August 2022 |
Creators | Katragadda, Sai Prathyush |
Contributors | Industrial and Systems Engineering, Beling, Peter A., Bansal, Manish, Freeman, Laura J. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.002 seconds