The task of comprehending charts [1, 2, 3] presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. The chart extraction task ensures the precise identification of key components, while the chart question answering (ChartQA) task integrates visual and textual information, facilitating accurate responses to queries based on the chart's content. To approach ChartQA, this research focuses on two main aspects. Firstly, we introduce ChartFormer, an integrated framework that simultaneously identifies and classifies every chart element. ChartFormer extends beyond traditional data visualization by identifying descriptive components such as the chart title, legend, and axes, providing a comprehensive understanding of the chart's content. ChartFormer is particularly effective for complex instance segmentation tasks that involve a wide variety of class objects with unique visual structures. It utilizes an end-to-end transformer architecture, which enhances its ability to handle the intricacies of diverse and distinct object features. Secondly, we present Question-guided Deformable Co-Attention (QDCAt), which facilitates multimodal fusion by incorporating question information into a deformable offset network and enhancing visual representation from ChartFormer through a deformable co-attention block. / Master of Science / Real-world data often encompasses multimodal information, blending textual descriptions with visual representations. Charts, in particular, pose a significant challenge for machine learning models due to their condensed and complex structure. Existing multimodal methods often neglect these graphics, failing to integrate them effectively. To address this gap, we introduce ChartFormer, a unified framework designed to enhance chart understanding through instance segmentation, and a novel Question-guided Deformable Co-Attention (QDCAt) mechanism. This approach seamlessly integrates visual and textual features for chart question answering (ChartQA), allowing for more comprehensive reasoning. ChartFormer excels at identifying and classifying chart components such as bars, lines, pies, titles, legends, and axes. The QDCAt mechanism further enhances multimodal fusion by aligning textual information with visual cues, thereby improving answer accuracy. By dynamically adjusting attention based on the question context, QDCAt ensures that the model focuses on the most relevant parts of the chart. Extensive experiments demonstrate that ChartFormer and QDChart significantly outperform their baseline models in chart component recognition and ChartQA tasks by 3.2% in mAP and 15.4% in accuracy, respectively, providing a robust solution for detailed visual data interpretation across various applications.
These results highlight the efficacy of our approach in providing a robust solution for detailed visual data interpretation, making it applicable to a wide range of domains, from scientific research to financial analysis and beyond.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/120922 |
Date | 13 August 2024 |
Creators | Zheng, Hanwen |
Contributors | Computer Science and#38; Applications, Huang, Lifu, Heath, Lenwood S., Thomas, Christopher Lee, Yanardag Delul, Pinar |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0022 seconds