Return to search

Concept Vectors for Zero-Shot Video Generation

Zero-shot video generation involves generating videos of concepts (action classes) that are not seen in the training phase. Even though the research community has explored conditional video generation for long high-resolution videos, zero-shot video remains a fairly unexplored and challenging task. Most recent works can generate videos for action-object or motion-content pairs, where both the object (content) and action (motion) are observed separately during training, yet results often lack spatial consistency between foreground and background and cannot generalize to complex scenes with multiple objects or actions. In this work, we propose Concept2Vid that generates zero-shot videos for classes that are completely unseen during training.
In contrast to prior work, our model is not limited to a predefined fixed set of class-level attributes, but rather utilizes semantic information from multiple videos of the same topic to generate samples from novel classes. We evaluate qualitatively and quantitatively on the Kinetics400 and UCF101 datasets, demonstrating the effectiveness of our proposed model. / Master of Science / Humans are able to generalize unseen scenarios without explicit feedback. They can be thought of as self-learning Artificial Intelligence agents that can collect data from various modalities (video, audio, text) found in surrounding environments, to develop new knowledge and acclimate to unseen situations without explicit feedback. Many recent studies have learned how to perform this process for images, but very few have been able to extend it to videos. Videos provide rich multi-modal data, such as text, audio, and images, and hence are composed of multifaceted knowledge that can introduce more complex temporal and spatial constraints. Leveraging videos in combination with text and audio data can assist intelligent systems to learn similar to how humans do. Zero-shot video generation (ZSVG) involves generating videos of concepts that are not seen in the training phase of a machine learning model. Generating a zero-shot video requires a multitude of temporal and spatial dependencies. In generating a video, not only does the model need temporal coherence but also the understanding of object properties. Current approaches for ZSVG are not well suited due to these challenges.
We propose Concept2Vid which generates zero-shot videos for classes that are completely unseen during training. In contrast to prior work, our model is not limited to a predefined fixed set of class descriptions, but rather utilizes semantic information from multiple videos of the same topic to generate samples from novel classes. We evaluate qualitatively and quantitatively on the Kinetics400 and UCF101 datasets, demonstrating the effectiveness of our proposed model.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/110590
Date09 June 2022
CreatorsDani, Riya Jinesh
ContributorsComputer Science, Lourentzou, Ismini, Eldardiry, Hoda, Zhou, Dawei
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeThesis
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0016 seconds