Zero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks. / Master of Science / Natural Language Video Localization (NLVL) is the task of retrieving relevant video segments from an untrimmed video given a user text query. To train an NLVL system, traditional methods demand annotations on the input videos, which include video segment spans (i.e., start and end timestamps) and the accompanying text query describing the segment. These annotations are laborious to collect for any domain and video length. To alleviate this, zero-shot NLVL methods generate the aforementioned annotations dynamically. However, current zero-shot NLVL approaches suffer from poor alignment between the video and the dynamically generated query, which can introduce noise in the localization process. To this end, this work aims to investigate the impact of implicit commonsensical knowledge, which humans innately possess, on zero-shot NLVL. We introduce CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries. Experiments on two benchmark datasets, containing diverse themes of videos, highlight the effectiveness of leveraging commonsense information.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/115684 |
Date | 07 July 2023 |
Creators | Holla, Meghana |
Contributors | Computer Science and Applications, Lourentzou, Ismini, Ramakrishnan, Narendran, Huang, Lifu |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.002 seconds