Return to search

Unified Approaches for Multi-Task Vision-Language Interactions

Vision and Language are two major modalities that humans rely on to perceive the environment and understand the world. Recent advances in Artificial Intelligence (AI) facilitate the development of a variety of vision-language tasks derived from diverse multimodal interactions in daily life, such as image captioning, image-text matching, visual question answering (VQA), text-to-image generation, etc. Despite the remarkable performance, most previous state-of-the-art models are merely specialized for a single vision-language task, which lack generalizability across multiple tasks. Additionally, those specialized models sophisticate the algorithm designs and bring redundancy to model deployment when dealing with complex scenes.

In this study, we investigate developing unified approaches capable of solving various vision-language interactions in a multi-task manner. We argue that unified multi-task methods could enjoy several potential advantages: (1) A unified framework for multiple tasks can reduce human efforts in designing different models for different tasks; (2) Reusing and sharing parameters across tasks can improve efficiency; (3) Some tasks may be complementary to other tasks so that multi-tasking can boost the performance; (4) They can deal with the complex tasks that need a joint collaborating of multiple basic tasks and enable new applications.

In the first part of this thesis, we explore unified multi-task models with the goal of sharing and reusing as many parameters as possible between different tasks. We started with unifying many vision-language question-answering tasks, such as visual entailment, outside-knowledge VQA, and visual commonsense reasoning, in a simple iterative divide-and-conquer framework. Specifically, it iteratively decomposes the original text question into sub-question, solves each sub-question, and derives the answer to the original question, which can uniformly handle reasoning of various types and semantics levels within one framework. In the next work, we take one step further to unify tasks of image-to-text generation, text-to-image generation, vision-language understanding, and image-text matching all in one single large-scale Transformer-based model. The above two works demonstrate the feasibility, effectiveness and efficiency of sharing the parameters across different tasks in a single model. Nevertheless, they still need to switch between different tasks and can only conduct one task at a time.

In the second part of this thesis, we introduce our efforts toward simultaneous multi-task models that can conduct multiple tasks at the same time with a single model. It has additional advantages: the model can learn to perform different tasks or combinations of multiple tasks automatically according to user queries; the joint interaction of tasks can enable new potential applications. We begin with compounding spatial understanding and semantic understanding in a single multimodal Transformer-based model. To enable models to understand and localize local regions, we proposed a hybrid region representation that seamlessly bridges regions with image and text. Coupled with a delicately collected training dataset, our model can perform joint spatial and semantic understanding at the same iteration, and empower a new application: spatial reasoning. Continuing the above project, we further introduce an effective module to encode the high-resolution images, and propose a pre-training method that aligns semantics and spatial understanding in high resolution. Besides, we also couple the Optical Character Recognition (OCR) capability together with spatial understanding in the model and study the techniques to improve the compatibility of various tasks.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/n044-ra77
Date January 2024
CreatorsYou, Haoxuan
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0029 seconds