Return to search

Data Quality Evaluation and Improvement for Machine Learning

In this research the focus is on data-centric AI with a specific concentration on data quality evaluation and improvement for machine learning. We first present a practical framework for data quality evaluation and improvement, using a legal domain as a case study and build a corpus for legal argument mining. We first created an initial corpus with 4,937 instances that were manually labeled. We define five data quality evaluation dimensions: comprehensiveness, correctness, variety, class imbalance, and duplication, and conducted a quantitative evaluation on these dimensions for the legal dataset and two existing datasets in the medical domain for medical concept normalization. The first group of experiments showed that class imbalance and insufficient training data are the two major data quality issues that negatively impacted the quality of the system that was built on the legal corpus. The second group of experiments showed that the overlap between the test datasets and the training datasets, which we defined as "duplication," is the major data quality issue for the two medical corpora. We explore several widely used machine learning methods for data quality improvement. Compared to pseudo-labeling, co-training, and expectation-maximization (EM), generative adversarial network (GAN) is more effective for automated data augmentation, especially when a small portion of labeled data and a large amount of unlabeled data is available. The data validation process, the performance improvement strategy, and the machine learning framework for data evaluation and improvement discussed in this dissertation can be used by machine learning researchers and practitioners to build high-performance machine learning systems. All the materials including the data, code, and results will be released at: https://github.com/haihua0913/dissertation-dqei.

Identiferoai:union.ndltd.org:unt.edu/info:ark/67531/metadc1944318
Date05 1900
CreatorsChen, Haihua
ContributorsChen, Jiangping, Ding, Junhua, Cleveland, Ana, Liu, Xiaozhong
PublisherUniversity of North Texas
Source SetsUniversity of North Texas
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation
FormatText
RightsPublic, Chen, Haihua, Copyright, Copyright is held by the author, unless otherwise noted. All rights Reserved.

Page generated in 0.0024 seconds