Spelling suggestions: "subject:"fisher diectors"" "subject:"fisher detectors""
1 |
Trajectory-based Descriptors for Action Recognition in Real-world VideosNarayan, Sanath January 2015 (has links) (PDF)
This thesis explores motion trajectory-based approaches to recognize human actions in
real-world, unconstrained videos. Recognizing actions is an important task in applications
such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis.
At first, an effective motion segmentation method to identify the camera motion
in videos captured by moving cameras is explored. Next, action recognition in videos
captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos.
The first part of the thesis explores two effective segmentation approaches to identify
the motion due to camera. The first approach is based on curve fitting of the motion
trajectories and finding the model which best fits the camera motion model. The curve
fitting approach works when the trajectories generated are smooth enough. To overcome
this drawback and segment trajectories under non-smooth conditions, a second approach
based on trajectory scoring and grouping is proposed. By identifying the instantaneous
dominant background motion and accordingly aggregating the scores (denoting the
“foregroundness”) along the trajectory, the motion that is associated with the camera can
be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras.
In the second part of the thesis, recognising actions in normal videos captured from
third-person cameras is investigated. To this end, two kinds of descriptors are explored.
The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second
descriptor is based on Granger causality. The novel causality descriptor encodes the
“cause and effect” relationships between the motion trajectories of the actions. This
type of interaction descriptors captures the causal inter-dependencies among the motion
trajectories and encodes complimentary information different from those descriptors
based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts.
An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities.
This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities
performed by the POI are called first-person actions and third-person actions are those
done by others and observed by the POI. The third part of the thesis explores action
recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to
recognise the action and the perspective from which it is being observed. Trajectory
descriptors are adapted to recognise actions along with the motion trajectory ranking
method of segmentation as pre-processing step to identify the camera motion. The motion
segmentation step is necessary to remove unintended head motion (camera motion) during
video capture. To recognise actions and corresponding perspectives in a multi-camera
setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views.
In the final part of the thesis, a feature encoding scheme and a subvolume sampling
scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector
feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector
encoding. The resulting encoding is simple, effective and improves the classification
performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos.
|
2 |
New methods for image classification, image retrieval and semantic correspondence / Nouvelles méthodes pour classification d'image, recherche d'image et correspondence sémantiqueSampaio de Rezende, Rafael 19 December 2017 (has links)
Le problème de représentation d’image est au cœur du domaine de vision. Le choix de représentation d’une image change en fonction de la tâche que nous voulons étudier. Un problème de recherche d’image dans des grandes bases de données exige une représentation globale compressée, alors qu’un problème de segmentation sémantique nécessite une carte de partitionnement de ses pixels. Les techniques d’apprentissage statistique sont l’outil principal pour la construction de ces représentations. Dans ce manuscrit, nous abordons l’apprentissage des représentations visuels dans trois problèmes différents : la recherche d’image, la correspondance sémantique et classification d’image. Premièrement, nous étudions la représentation vectorielle de Fisher et sa dépendance sur le modèle de mélange Gaussien employé. Nous introduisons l’utilisation de plusieurs modèles de mélange Gaussien pour différents types d’arrière-plans, e.g., différentes catégories de scènes, et analyser la performance de ces représentations pour objet classification et l’impact de la catégorie de scène en tant que variable latente. Notre seconde approche propose une extension de la représentation l’exemple SVM pipeline. Nous montrons d’abord que, en remplaçant la fonction de perte de la SVM par la perte carrée, on obtient des résultats similaires à une fraction de le coût de calcul. Nous appelons ce modèle la « square-loss exemplar machine », ou SLEM en anglais. Nous introduisons une variante de SLEM à noyaux qui bénéficie des même avantages computationnelles mais affiche des performances améliorées. Nous présentons des expériences qui établissent la performance et l’efficacité de nos méthodes en utilisant une grande variété de représentations de base et de jeux de données de recherche d’images. Enfin, nous proposons un réseau neuronal profond pour le problème de l’établissement sémantique correspondance. Nous utilisons des boîtes d’objets en tant qu’éléments de correspondance pour construire une architecture qui apprend simultanément l’apparence et la cohérence géométrique. Nous proposons de nouveaux scores géométriques de cohérence adaptés à l’architecture du réseau de neurones. Notre modèle est entrainé sur des paires d’images obtenues à partir des points-clés d’un jeu de données de référence et évaluées sur plusieurs ensembles de données, surpassant les architectures d’apprentissage en profondeur récentes et méthodes antérieures basées sur des caractéristiques artisanales. Nous terminons la thèse en soulignant nos contributions et en suggérant d’éventuelles directions de recherche futures. / The problem of image representation is at the heart of computer vision. The choice of feature extracted of an image changes according to the task we want to study. Large image retrieval databases demand a compressed global vector representing each image, whereas a semantic segmentation problem requires a clustering map of its pixels. The techniques of machine learning are the main tool used for the construction of these representations. In this manuscript, we address the learning of visual features for three distinct problems: Image retrieval, semantic correspondence and image classification. First, we study the dependency of a Fisher vector representation on the Gaussian mixture model used as its codewords. We introduce the use of multiple Gaussian mixture models for different backgrounds, e.g. different scene categories, and analyze the performance of these representations for object classification and the impact of scene category as a latent variable. Our second approach proposes an extension to the exemplar SVM feature encoding pipeline. We first show that, by replacing the hinge loss by the square loss in the ESVM cost function, similar results in image retrieval can be obtained at a fraction of the computational cost. We call this model square-loss exemplar machine, or SLEM. Secondly, we introduce a kernelized SLEM variant which benefits from the same computational advantages but displays improved performance. We present experiments that establish the performance and efficiency of our methods using a large array of base feature representations and standard image retrieval datasets. Finally, we propose a deep neural network for the problem of establishing semantic correspondence. We employ object proposal boxes as elements for matching and construct an architecture that simultaneously learns the appearance representation and geometric consistency. We propose new geometrical consistency scores tailored to the neural network’s architecture. Our model is trained on image pairs obtained from keypoints of a benchmark dataset and evaluated on several standard datasets, outperforming both recent deep learning architectures and previous methods based on hand-crafted features. We conclude the thesis by highlighting our contributions and suggesting possible future research directions.
|
Page generated in 0.0522 seconds