Return to search

What, When, and Where Exactly? Human Activity Detection in Untrimmed Videos Using Deep Learning

Over the past decade, there has been an explosion in the volume of video data, including internet videos and surveillance camera footage. These videos often feature extended durations with unedited content, predominantly filled with background clutter, while the relevant activities of interest occupy only a small portion of the footage. Consequently, there is a compelling need for advanced processing techniques to automatically analyze this vast reservoir of video data, specifically with the goal of identifying the segments that contain the events of interest. Given that humans are the primary subjects in these videos, comprehending human activities plays a pivotal role in automated video analysis.
This thesis seeks to tackle the challenge of detecting human activities from untrimmed videos, aiming to classify and pinpoint these activities both in their spatial and temporal dimensions. To achieve this, we propose a modular approach. We begin by developing a temporal activity detection framework, and then progressively extend the framework to support activity detection in the spatio-temporal dimension.
To perform temporal activity detection, we introduce an end-to-end trainable deep learning model leveraging 3D convolutions. Additionally, we propose a novel and adaptable fusion strategy to combine both the appearance and motion information extracted from a video, using RGB and optical flow frames. Importantly, we incorporate the learning of this fusion strategy into the activity detection framework.
Building upon the temporal activity detection framework, we extend it by incorporating a spatial localization module to enable activity detection both in space and time in a holistic end-to-end manner. To accomplish this, we leverage shared spatio-temporal feature maps to jointly optimize both spatial and temporal localization of activities, thus making the entire pipeline more effective and efficient.
Finally, we introduce several novel techniques for modeling actor motion, specifically designed for efficient activity recognition. This is achieved by harnessing 2D pose information extracted from video frames and then representing human motion through bone movement, bone orientation, and body joint positions.
Our experimental evaluations, conducted using benchmark datasets, showcase the effectiveness of the proposed temporal and spatio-temporal activity detection methods when compared to the current state-of-the-art methods. Moreover, the proposed motion representations excel in both performance and computational efficiency. Ultimately, this research shall pave the way forward towards imbuing computers with social visual intelligence, enabling them to comprehend human activities in any given time and space, opening up exciting possibilities for the future.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/45709
Date06 December 2023
CreatorsRahman, Md Atiqur
ContributorsLaganière, Robert
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Formatapplication/pdf

Page generated in 0.0022 seconds