<p dir="ltr">The quality of crowdsource annotations has always been a challenge due to the variability in annotators backgrounds, task complexity, the subjective nature of many labeling tasks, and various other reasons. Hence, it is crucial to evaluate these annotations to ensure their reliability. Traditionally, human experts evaluate the quality of crowdsourced annotations, but this approach has its own challenges. Hence, this paper proposes to leverage large language models like ChatGPT-4 to evaluate one of the existing crowdsourced MAVEN dataset and explore its potential as an alternative solution. However, due to stochastic nature of LLMs, it is important to discern when to trust and question LLM responses. To address this, we introduce a novel approach that applies Rubin's framework for identifying and using linguistic cues within LLM responses as indicators of LLMs certainty levels. Our findings reveal that ChatGPT-4 successfully identified 63% of the incorrect labels, highlighting the potential for improving data label quality through human-AI collaboration on these identified inaccuracies. This study underscores the promising role of LLMs in evaluating crowdsourced data annotations offering a way to enhance accuracy and fairness of crowdsource annotations while saving time and costs.</p><p dir="ltr"><br></p>
Identifer | oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/26214551 |
Date | 09 July 2024 |
Creators | Venkata Divya Sree Pulipati (18469230) |
Source Sets | Purdue University |
Detected Language | English |
Type | Text, Thesis |
Rights | CC BY 4.0 |
Relation | https://figshare.com/articles/thesis/Leveraging_Linguistic_Insights_for_Uncertainty_Calibration_of_ChatGPT_and_Evaluating_Crowdsourced_Annotations/26214551 |
Page generated in 0.0018 seconds