Global ETD Search

Return to search

Hierarchical Bayesian Dataset Selection

Despite the profound impact of deep learning across various domains, supervised model training critically depends on access to large, high-quality datasets, which are often challenging to identify. To address this, we introduce Hierarchical Bayesian Dataset Selection (HBDS), the first dataset selection algorithm that utilizes hierarchical Bayesian modeling, designed for collaborative data-sharing ecosystems. The proposed method efficiently decomposes the contributions of dataset groups and individual datasets to local model performance using Bayesian updates with small data samples. Our experiments on two benchmark datasets demonstrate that HBDS not only offers a computationally lightweight solution but also enhances interpretability compared to existing data selection methods, by revealing deep insights into dataset interrelationships through learned posterior distributions. HBDS outperforms traditional non-hierarchical methods by correctly identifying all relevant datasets, achieving optimal accuracy with fewer computational steps, even when initial model accuracy is low. Specifically, HBDS surpasses its non-hierarchical counterpart by 1.8% on DIGIT-FIVE and 0.7% on DOMAINNET, on average. In settings with limited resources, HBDS achieves a 6.9% higher accuracy than its non-hierarchical counterpart. These results confirm HBDS's effectiveness in identifying datasets that improve the accuracy and efficiency of deep learning models when collaborative data utilization is essential. / Master of Science / Deep learning technologies have revolutionized many domains and applications, from voice recognition in smartphones to automated recommendations on streaming services. However, the success of these technologies heavily relies on having access to large and high-quality datasets. In many cases, selecting the right datasets can be a daunting challenge. To tackle this, we have developed a new method that can quickly figure out which datasets or groups of datasets contribute most to improving the performance of a model with only a small amount of data needed. Our tests prove that this method is not only effective and light on computation but also helps us understand better how different datasets relate to each other.

https://hdl.handle.net/10919/119391

Hierarchical Bayesian

Data-Sharing

Reinforcement Learning

Dataset Selection

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/119391
Date	05 1900
Creators	Zhou, Xiaona
Contributors	Computer Science, Lourentzou, Ismini, Jin, Ran, Thomas, Christopher Lee
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Language	English
Detected Language	English
Type	Thesis
Format	ETD, application/pdf, application/pdf
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International, http://creativecommons.org/licenses/by-nc-nd/4.0/

Page generated in 0.0023 seconds

Hierarchical Bayesian Dataset Selection

Description

Links & Downloads

Tags

Additional Fields