Return to search

Faithfulness in Abstractive Summarization: Progress and Challenges

The exponential increase in online text has created a pressing need for automatic summarization systems that can distill key information from lengthy documents. While neural abstractive summarizers have achieved gains in fluency and coherence, a critical challenge that has emerged is ensuring faithfulness, i.e., accurately preserving the meaning from the original text. Modern neural abstractive summarizers can distort or fabricate facts, undermining their reliability in real-world applications. Thus, this thesis tackles the critical issue of improving faithfulness in abstractive summarization. This thesis is comprised of four parts.

The first part examines challenges in evaluating summarization faithfulness, including issues with reference-free metrics and human evaluation. We propose a novel approach for building automated evaluation metrics that are less reliant on spurious correlations and demonstrate significantly improved performance over existing faithfulness evaluation metrics. We further introduce a novel evaluation framework that enables a more holistic assessment of faithfulness by accounting for the abstractiveness of summarization systems. This framework enables more rigorous faithfulness evaluation, differentiating between gains from increased extraction versus improved abstraction.

The second part focuses on explaining the root causes of faithfulness issues in modern summarization systems. We introduce a novel contrastive approach for attributing errors that vastlyoutperforms prior work at tracing hallucinations in generated summaries back to training data deficiencies. Moreover, incorporating our method’s ideas into an existing technique substantially boosts its performance. Through a case study, we also analyze pre-training biases and demonstrate their propagation to summarization models, yielding biased hallucinations. We show that while mitigation strategies during finetuning can reduce overall hallucination rates, the remaining hallucinations still closely reflect intrinsic pre-training biases.

The third part applies insights from previous sections to develop impactful techniques for improving faithfulness in practice. We propose a novel approach for adaptively determining the appropriate level of abstractiveness for a given input to improve overall faithfulness. Our method yields systems that are both more faithful and more abstractive compared to baseline systems. We further leverage our error attribution approach to clean noisy training data, significantly reducing faithfulness errors in generated outputs. Models trained on datasets cleaned with our approach generate markedly fewer hallucinations than both baseline systems and models trained using other data cleaning techniques.

Finally, the fourth part examines the summarization capabilities of LLMs and assesses their faithfulness. We demonstrate that instruction-tuning and RLHF are key for enabling LLMs to achieve high-quality zero-shot summarization in the news domain, with state-of-the-art LLMs generating summaries comparable to human-written ones. However, this ability does not extend to narrative summarization, where even advanced LLMs struggle to produce consistently faithful summaries. Finally, we highlight the difficulty in evaluating high-performing LLMs, showing that crowdsourcing evaluations of LLM outputs may no longer be reliable as fluency and coherence improve. We observe a substantial gap between crowd workers and experts in identifying deficiencies in LLM-generated narrative summaries.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/4jvc-1c53
Date January 2023
CreatorsLadhak, Faisal
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0024 seconds