Return to search

Modality Bridging and Unified Multimodal Understanding

Multimodal understanding is a vast realm of research that covers multiple disciplines. Hence, it requires a correct understanding of the goal in a generic multimodal understanding research study. The definition of modalities of interest is important since each modality requires its own considerations. On the other hand, it is important to understand whether these modalities should be complimentary to each other or have significant overlap in terms of the information they carry. For example, most of the modalities in biological signals do not have significant overlap with each other, yet they can be used together to improve the range and accuracy of diagnoses. An extreme example of two modalities that have significant overlap is an instructional video and its corresponding instructions in detailed texts. In this study, we focus on multimedia, which includes image, video, audio, and text about real world everyday events, mostly focused on human activities.

We narrow our study to the important direction of common space learning since we want to bridge between different modalities using the overlap that a given pair of modalities have.There are multiple applications which require a strong common space to be able to perform desirably. We choose image-text grounding, video-audio autoencoding, video-conditioned text generation, and video-audio-text common space learning for semantic encoding. We examine multiple ideas in each direction and achieve important conclusions. In image-text grounding, we learn that different levels of semantic representations are helpful to achieve a thorough common space that is representative of two modalities. In video-audio autoencoding, we observe that reconstruction objectives can help with a representative common space. Moreover, there is an inherent problem when dealing with multiple modalities at the same time, and that is different levels of granularity. For example, the sampling rate and granularity of video is much higher and more complicated compared to audio. Hence, it might be more helpful to find a more semantically abstracted common space which does not carry redundant details, especially considering the temporal aspect of video and audio modalities. In video-conditioned text generation, we examine the possibility of encoding a video sequence using a Transformer (and later decoding the captions using a Transformer decoder). We further explore the possibility of learning latent states for storing real-world concepts without supervision.

Using the observations from these three directions, we propose a unified pipeline based on the Transformer architecture to examine whether it is possible to train a (true) unified pipeline on raw multimodal data without supervision in an end-to-end fashion. This pipeline eliminates ad-hoc feature extraction methods and is independent of any previously trained network, making it simpler and easier to use. Furthermore, since it only utilizes one architecture, which enables us to move towards even more simplicity. Hence, we take an ambitious step forward and further unify this pipeline by sharing only one backbone among four major modalities: image, video, audio, and text. We show that it is not only possible to achieve this goal, but we further show the inherent benefits of such pipeline. We propose a new research direction under multimodal understanding and that is Unified Multimodal Understanding. This study is the first that examines this idea and further pushes its limit by scaling up to multiple tasks, modalities, and datasets.

In a nutshell, we examine different possibilities for bridging between a pair of modalities in different applications and observe several limitations and propose solutions for them. Using these observations, we provide a unified and strong pipeline for learning a common space which could be used for many applications. We show that our approaches perform desirably and significantly outperform state-of-the-art in different downstream tasks. We set a new baseline with competitive performance for our proposed research direction, Unified Multimodal Understanding.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/4t6r-xh45
Date January 2022
CreatorsAkbari, Hassan
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0019 seconds