1 |
Active learning of an action detector on untrimmed videosBandla, Sunil 22 July 2014 (has links)
Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the "untrimmed" nature of real video data. / text
|
2 |
Human extremity detection and its applications in action detection and recognitionYu, Qingfeng 02 June 2010 (has links)
It is proven that locations of internal body joints are sufficient visual cues to characterize human motion. In this dissertation I propose that locations of human extremities including heads, hands and feet provide powerful approximation to internal body motion. I propose detection of precise extremities from contours obtained from image segmentation or contour tracking. Junctions of medial axis of contours are selected as stars. Contour points with a local maximum distance to various stars are chosen as candidate extremities. All the candidates are filtered by cues including proximity to other candidates, visibility to stars and robustness to noise smoothing parameters. I present my applications of using precise extremities for fast human action detection and recognition. Environment specific features are built from precise extremities and feed into a block based Hidden Markov Model to decode the fence climbing action from continuous videos. Precise extremities are grouped into stable contacts if the same extremity does not move for a certain duration. Such stable contacts are utilized to decompose a long continuous video into shorter pieces. Each piece is associated with certain motion features to form primitive motion units. In this way the sequence is abstracted into more meaningful segments and a searching strategy is used to detect the fence climbing action. Moreover, I propose the histogram of extremities as a general posture descriptor. It is tested in a Hidden Markov Model based framework for action recognition. I further propose detection of probable extremities from raw images without any segmentation. Modeling the extremity as an image patch instead of a single point on the contour helps overcome the segmentation difficulty and increase the detection robustness. I represent the extremity patches with Histograms of Oriented Gradients. The detection is achieved by window based image scanning. In order to reduce computation load, I adopt the integral histograms technique without sacrificing accuracy. The result is a probability map where each pixel denotes probability of the patch forming the specific class of extremities. With a probable extremity map, I propose the histogram of probable extremities as another general posture descriptor. It is tested on several data sets and the results are compared with that of precise extremities to show the superiority of probable extremities. / text
|
3 |
Video Action Understanding: Action Classification, Temporal Localization, And DetectionTirupattur, Praveen 01 January 2024 (has links) (PDF)
Video action understanding involves comprehending actions performed by humans, depicted in videos. Central to the task of video action understanding are four fundamental questions: What, When, Where, and Who. These questions encapsulate the essence of action classification, temporal action localization, action detection, and actor recognition. Despite notable progress in research related to these tasks, many challenges persist and in this dissertation, we propose innovative solutions to tackle these challenges head-on.
First, we address the challenges in action classification (``What?"), specifically related to multi-view action recognition. We propose a novel transformer decoder-based model, with learnable view and action queries, to enforce the learning of action features robust to shifts in viewpoints. Next, we focus on temporal action localization (``What?" and ``When?") and address challenges introduced in the multi-label setting. Our proposed solution involves leveraging the inherent relationships between complex actions in real-world videos. We introduce an attention-based architecture that models these relationships for the task of temporal action localization.
Next, we propose \textit{Gabriella}, a real-time online system for activity detection (``What?", ``When?", and ``Where?") in security videos. Our proposed solution has three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network that detects potential foreground regions to generate action tubelets. The detected tubelets are assigned activity class scores by the classification network and merged using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. Finally, we introduce an approach to solve the novel task of joint action and actor recognition (``What?" and ``Who?") and solve it using disentangled representation learning. We introduce a novel method to simultaneously identify both subjects (actors) and their actions. Our transformer-based model learns to separate actor and action features effectively by employing supervised contrastive losses alongside standard cross-entropy loss to ensure proper feature separation.
|
4 |
Le mouvement en action : estimation du flot optique et localisation d'actions dans les vidéos / Motion in action : optical flow estimation and action localization in videosWeinzaepfel, Philippe 23 September 2016 (has links)
Avec la récente et importante croissance des contenus vidéos, la compréhension automatique de vidéos est devenue un problème majeur.Ce mémoire présente plusieurs contributions sur deux tâches de la compréhension automatique de vidéos : l'estimation du flot optique et la localisation d'actions humaines.L'estimation du flot optique consiste à calculer le déplacement de chaque pixel d'une vidéo et fait face à plusieurs défis tels que les grands déplacements non rigides, les occlusions et les discontinuités du mouvement.Nous proposons tout d'abord une méthode pour le calcul du flot optique, basée sur un modèle variationnel qui incorpore une nouvelle méthode d'appariement.L'algorithme d'appariement proposé repose sur une architecture corrélationnelle hiérarchique à plusieurs niveaux et gère les déformations non rigides ainsi que les textures répétitives.Il permet d'améliorer l'estimation du flot en présence de changements d'apparence significatifs et de grands déplacements.Nous présentons également une nouvelle approche pour l'estimation du flot optique basée sur une interpolation dense de correspondances clairsemées tout en respectant les contours.Cette méthode tire profit d'une distance géodésique basée sur les contours qui permet de respecter les discontinuités du mouvement et de gérer les occlusions.En outre, nous proposons une approche d'apprentissage pour détecter les discontinuités du mouvement.Les motifs de discontinuité du mouvement sont prédits au niveau d'un patch en utilisant des forêts aléatoires structurées.Nous montrons expérimentalement que notre approche surclasse la méthode basique construite sur le gradient du flot tant sur des données synthétiques que sur des vidéos réelles.Nous présentons à cet effet une base de données contenant des vidéos d'utilisateurs.La localisation d'actions humaines consiste à reconnaître les actions présentes dans une vidéo, comme `boire' ou `téléphoner', ainsi que leur étendue temporelle et spatiale.Nous proposons tout d'abord une nouvelle approche basée sur les réseaux de neurones convolutionnels profonds.La méthode passe par l'extraction de tubes dépendants de la classe à détecter, tirant parti des dernières avancées en matière de détection et de suivi.La description des tubes est enrichie par des descripteurs spatio-temporels locaux.La détection temporelle est effectuée à l'aide d'une fenêtre glissante à l'intérieur de chaque tube.Notre approche surclasse l'état de l'art sur des bases de données difficiles de localisation d'actions.Deuxièmement, nous présentons une méthode de localisation d'actions faiblement supervisée, c'est-à-dire qui ne nécessite pas l'annotation de boîtes englobantes.Des candidats de localisation d'actions sont calculés en extrayant des tubes autour des humains.Cela est fait en utilisant un détecteur d'humains robuste aux poses inhabituelles et aux occlusions, appris sur une base de données de poses humaines.Un rappel élevé est atteint avec seulement quelques tubes, permettant d'appliquer un apprentissage à plusieurs instances.En outre, nous présentons une nouvelle base de données pour la localisation d'actions humaines.Elle surmonte les limitations des bases existantes, telles la diversité et la durée des vidéos.Notre approche faiblement supervisée obtient des résultats proches de celles totalement supervisées alors qu'elle réduit significativement l'effort d'annotations requis. / With the recent overwhelming growth of digital video content, automatic video understanding has become an increasingly important issue.This thesis introduces several contributions on two automatic video understanding tasks: optical flow estimation and human action localization.Optical flow estimation consists in computing the displacement of every pixel in a video andfaces several challenges including large non-rigid displacements, occlusions and motion boundaries.We first introduce an optical flow approach based on a variational model that incorporates a new matching method.The proposed matching algorithm is built upon a hierarchical multi-layer correlational architecture and effectively handles non-rigid deformations and repetitive textures.It improves the flow estimation in the presence of significant appearance changes and large displacements.We also introduce a novel scheme for estimating optical flow based on a sparse-to-dense interpolation of matches while respecting edges.This method leverages an edge-aware geodesic distance tailored to respect motion boundaries and to handle occlusions.Furthermore, we propose a learning-based approach for detecting motion boundaries.Motion boundary patterns are predicted at the patch level using structured random forests.We experimentally show that our approach outperforms the flow gradient baseline on both synthetic data and real-world videos,including an introduced dataset with consumer videos.Human action localization consists in recognizing the actions that occur in a video, such as `drinking' or `phoning', as well as their temporal and spatial extent.We first propose a novel approach based on Deep Convolutional Neural Network.The method extracts class-specific tubes leveraging recent advances in detection and tracking.Tube description is enhanced by spatio-temporal local features.Temporal detection is performed using a sliding window scheme inside each tube.Our approach outperforms the state of the art on challenging action localization benchmarks.Second, we introduce a weakly-supervised action localization method, ie, which does not require bounding box annotation.Action proposals are computed by extracting tubes around the humans.This is performed using a human detector robust to unusual poses and occlusions, which is learned on a human pose benchmark.A high recall is reached with only several human tubes, allowing to effectively apply Multiple Instance Learning.Furthermore, we introduce a new dataset for human action localization.It overcomes the limitations of existing benchmarks, such as the diversity and the duration of the videos.Our weakly-supervised approach obtains results close to fully-supervised ones while significantly reducing the required amount of annotations.
|
Page generated in 0.1307 seconds