Return to search

A New Framework and Novel Techniques to Multimodal Concept Representation and Fusion

To solve real-world problems, machines are required to perceive multiple modalities and fuse the information from them. This thesis studies learning to understand and fuse multimodal information. Existing approaches follow a three-stage learning paradigm. The first stage is to train models for each modality. This process for video understanding models is usually based on supervised training, which is not scalable. Moreover, these modality-specific models are updated rather frequently nowadays with improving single-modality perception abilities.

The second stage is crossmodal pretraining, which trains a model to align and fuse multiple modalities based on paired multimodal data, such as video-caption pairs. This process is resource-consuming and expensive. The third stage is to further fine-tune or prompt the resulting model from the second stage towards certain downstream tasks. The key bottleneck of conventional methods lies in the continuous feature representation used for non-textual modalities, which is usually costly to align and fuse with text.

In this thesis, we investigate the representation and the fusion based on textual concepts. We propose to map non-textual modalities to textual concepts and then fuse these textual concepts using text models. We systematically study various specific methods of mapping and different architectures for fusion. The proposed methods include an end-to-end video-based text generation model with differentiable tokenization for video and audio concepts, a contrastive-model-based architecture with zero-shot concept extractor, a deep concept injection algorithm enabling language models to solve multimodal tasks without any training, and a distant supervision framework learning concepts in a long temporal span.

With our concept representation, we empirically demonstrate that without several orders of magnitude more cost for the crossmodal pretraining stage, our models are able to achieve competitive or even superior performance on downstream tasks such as video question answering, video captioning, text-video retrieval, and audio-video dialogue. We also examine the possible limitations of concept representations such as when the text quality of a dataset is poor. We believe we show a potential path towards upgradable multimodal intelligence, whose components can be easily updated towards new models or new modalities of data.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/ckns-m715
Date January 2024
CreatorsLin, Xudong
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.002 seconds