Return to search

Video Action Understanding: Action Classification, Temporal Localization, And Detection

Video action understanding involves comprehending actions performed by humans, depicted in videos. Central to the task of video action understanding are four fundamental questions: What, When, Where, and Who. These questions encapsulate the essence of action classification, temporal action localization, action detection, and actor recognition. Despite notable progress in research related to these tasks, many challenges persist and in this dissertation, we propose innovative solutions to tackle these challenges head-on.
First, we address the challenges in action classification (``What?"), specifically related to multi-view action recognition. We propose a novel transformer decoder-based model, with learnable view and action queries, to enforce the learning of action features robust to shifts in viewpoints. Next, we focus on temporal action localization (``What?" and ``When?") and address challenges introduced in the multi-label setting. Our proposed solution involves leveraging the inherent relationships between complex actions in real-world videos. We introduce an attention-based architecture that models these relationships for the task of temporal action localization.
Next, we propose \textit{Gabriella}, a real-time online system for activity detection (``What?", ``When?", and ``Where?") in security videos. Our proposed solution has three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network that detects potential foreground regions to generate action tubelets. The detected tubelets are assigned activity class scores by the classification network and merged using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. Finally, we introduce an approach to solve the novel task of joint action and actor recognition (``What?" and ``Who?") and solve it using disentangled representation learning. We introduce a novel method to simultaneously identify both subjects (actors) and their actions. Our transformer-based model learns to separate actor and action features effectively by employing supervised contrastive losses alongside standard cross-entropy loss to ensure proper feature separation.

Identiferoai:union.ndltd.org:ucf.edu/oai:stars.library.ucf.edu:etd2023-1402
Date01 January 2024
CreatorsTirupattur, Praveen
PublisherSTARS
Source SetsUniversity of Central Florida
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceGraduate Thesis and Dissertation 2023-2024

Page generated in 0.0016 seconds