Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks.
Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data.
Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. / Ph. D. / Designing systems that learn and reason with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks.
Given a target task, our system first represents the input information (e.g., images and text) in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input information. Based on those perspectives, the system performs reasoning to make a joint prediction to solve the target task. Perhaps surprisingly, we show that imagining (generating) abstract scenes behind input textual scene descriptions improves performance on various textual reasoning tasks such as answering fill-in-the-blank and paraphrasing questions, and answering questions about images improves performance on retrieving image captions. Through the use of perspectives from multiple modalities, our system also makes use of training data more efficiently and has a reasoning process that is easy to understand.
Complementary to the system design aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA). VQA is the task of answering open-ended natural language questions about images. In active learning for VQA, a model iteratively grows its knowledge through querying informative questions about images for answers. Inspired by human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven query selection function. We show that once initialized with a large training set, a VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/79521 |
Date | 05 October 2017 |
Creators | Lin, Xiao |
Contributors | Electrical and Computer Engineering, Parikh, Devi, Abbott, A. Lynn, Dhillon, Harpreet Singh, Tokekar, Pratap, Batra, Dhruv, Huang, Bert |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0025 seconds