1 |
Trajectory-based Descriptors for Action Recognition in Real-world VideosNarayan, Sanath January 2015 (has links) (PDF)
This thesis explores motion trajectory-based approaches to recognize human actions in
real-world, unconstrained videos. Recognizing actions is an important task in applications
such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis.
At first, an effective motion segmentation method to identify the camera motion
in videos captured by moving cameras is explored. Next, action recognition in videos
captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos.
The first part of the thesis explores two effective segmentation approaches to identify
the motion due to camera. The first approach is based on curve fitting of the motion
trajectories and finding the model which best fits the camera motion model. The curve
fitting approach works when the trajectories generated are smooth enough. To overcome
this drawback and segment trajectories under non-smooth conditions, a second approach
based on trajectory scoring and grouping is proposed. By identifying the instantaneous
dominant background motion and accordingly aggregating the scores (denoting the
“foregroundness”) along the trajectory, the motion that is associated with the camera can
be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras.
In the second part of the thesis, recognising actions in normal videos captured from
third-person cameras is investigated. To this end, two kinds of descriptors are explored.
The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second
descriptor is based on Granger causality. The novel causality descriptor encodes the
“cause and effect” relationships between the motion trajectories of the actions. This
type of interaction descriptors captures the causal inter-dependencies among the motion
trajectories and encodes complimentary information different from those descriptors
based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts.
An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities.
This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities
performed by the POI are called first-person actions and third-person actions are those
done by others and observed by the POI. The third part of the thesis explores action
recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to
recognise the action and the perspective from which it is being observed. Trajectory
descriptors are adapted to recognise actions along with the motion trajectory ranking
method of segmentation as pre-processing step to identify the camera motion. The motion
segmentation step is necessary to remove unintended head motion (camera motion) during
video capture. To recognise actions and corresponding perspectives in a multi-camera
setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views.
In the final part of the thesis, a feature encoding scheme and a subvolume sampling
scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector
feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector
encoding. The resulting encoding is simple, effective and improves the classification
performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos.
|
2 |
Détection, suivi et ré-identification de personnes à travers un réseau de caméra vidéo / People detection, tracking and re-identification through a video camera networkSouded, Malik 20 December 2013 (has links)
Cette thèse CIFRE est effectuée dans un contexte industriel et présente un framework complet pour la détection, le suivi mono-caméra et de la ré-identification de personnes dans le contexte multi-caméras. Les performances élevés et le traitement en temps réel sont les deux contraintes critiques ayant guidé ce travail. La détection de personnes vise à localiser/délimiter les gens dans les séquences vidéo. Le détecteur proposé est basé sur une cascade de classifieurs de type LogitBoost appliqué sur des descripteurs de covariances. Une approche existante a fortement été optimisée, la rendant applicable en temps réel et fournissant de meilleures performances. La méthode d'optimisation est généralisable à d'autres types de détecteurs d'objets. Le suivi mono-caméra vise à fournir un ensemble d'images de chaque personne observée par chaque caméra afin d'extraire sa signature visuelle, ainsi qu'à fournir certaines informations du monde réel pour l'amélioration de la ré-identification. Ceci est réalisé par le suivi de points SIFT à l'aide d'une filtre à particules, ainsi qu'une méthode d'association de données qui infère le suivi des objets et qui gère la majorité des cas de figures possible, notamment les occultations. Enfin, la ré-identification de personnes est réalisée avec une approche basée sur l'apparence globale en améliorant grandement une approche existante, obtenant de meilleures performances tout en étabt applicable en temps réel. Une partie "conscience du contexte" est introduite afin de gérer le changement d'orientation des personnes, améliorant les performances dans le cas d'applications réelles. / This thesis is performed in industrial context and presents a whole framework for people detection and tracking in a camera network. It addresses the main process steps: people detection, people tracking in mono-camera context, and people re-identification in multi-camera context. High performances and real-time processing are considered as strong constraints. People detection aims to localise and delimits people in video sequences. The proposed people detection is performed using a cascade of classifiers trained using LogitBoost algorithm on region covariance descriptors. A state of the art approach is strongly optimized to process in real time and to provide better detection performances. The optimization scheme is generalizable to many other kind of detectors where all possible weak classifiers cannot be reasonably tested. People tracking in mono-camera context aims to provide a set of reliable images of every observed person by each camera, to extract his visual signature, and it provides some useful real world information for re-identification purpose. It is achieved by tracking SIFT features using a specific particle filter in addition to a data association framework which infer object tracking from SIFT points one, and which deals with most of possible cases, especially occlusions. Finally, people re-identification is performed using an appearance based approach by improving a state of the art approach, providing better performances while keeping the real-time processing advantage. A context-aware part is introduced to robustify the visual signature against people orientations, ensuring better re-identification performances in real application case.
|
Page generated in 0.2354 seconds