Data serves as the foundation in building effective deep learning algorithms, yet the process of annotation and curation to maintain high data quality is time-intensive. The challenges arise from the vast diversity and large amount of data, and the inherent complexity in labeling each sample. Then, relying on manual effort to construct high-quality data is implausible and not sustainable in the real world. Instead, this thesis introduces a set of novel techniques to effectively learn from the data with less curation, which is more practical in building AI applications.
In this thesis, we systematically study different directions in learning from low-quality data, with a specific focus on visual understanding and being robust to complicated label bias & noise. We first examine the bias exhibited in the whole dataset for image classification, and derive the debiasing algorithms based on representation learning that explores the geometry and distribution of embeddings. In this way, we mitigate the uneven performance over image classes caused by data imbalance, and suppress the spurious correlation between the input images and output predictions such that the model can be better generalized to new classes and maintain robust accuracy with a small number of labeled samples as reference. Then, we extend our analysis to the open-text description of each sample and explore the noisy label in multi-modal pre-training. We build our framework upon contrastive language-image pretraining to learn a common representation space and improve the training effectiveness by automatically eliminating false negative labels and correcting the false positives. Additionally, our approaches show the potential to tackle the label bias in multi-modal training data.
Throughout this dissertation, the unifying focus is on the effective approach for learning from low-quality data, which has considered the learning issues from two complementary aspects of data labeling, i.e., the bias in global distribution and the noise in annotation for each sample (local). Different from prior research that are developed on the data with biased & noisy label but artificially simulated from well-curated datasets, our approach has been validated to be resilient to the complex bias and noise in the real-world scenario. We hope our approach can offer contributions to the field of multi-modal machine learning with applications involving real-world low-quality data and the need to avoid manual effort in data construction.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/qsx7-3j70 |
Date | January 2024 |
Creators | Ma, Jiawei |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.0021 seconds