Return to search

Multimodal Representations for Video

My thesis explores the fields of multimodal and video analysis in computer vision, aiming to bridge the gap between human perception and machine understanding. Recognizing the interplay among various signals such as text, audio, and visual data, my research explores novel frameworks to integrate these diverse modalities in order to achieve a deeper understanding of complex scenes, with a particular emphasis on video analysis.

As part of this exploration, I study diverse tasks such as translation, future prediction, or visual question answering, all connected through the lens of multimodal and video representations. I present novel approaches for each of these challenges, contributing across different facets of computer vision, from dataset creation to algorithmic innovations, and from achieving state-of-the-art results on established benchmarks to introducing new tasks.

Methodologically, my thesis embraces two key approaches: self-supervised learning and the integration of structured representations. Self-supervised learning, a technique that allows computers to learn from unlabeled data, helps uncovering inherent connections within multimodal and video inputs. Structured representations, on the other hand, serve as a means to capture complex temporal patterns and uncertainties inherent in video analysis. By employing these techniques, I offer novel insights into modeling multimodal representations for video analysis, showing improved performance with respect to prior work in all studied scenarios.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/a90w-t724
Date January 2024
CreatorsSuris Coll-Vinent, Didac
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0014 seconds