Global ETD Search

1	Trajectory-based Descriptors for Action Recognition in Real-world Videos Narayan, Sanath January 2015 (has links) (PDF) This thesis explores motion trajectory-based approaches to recognize human actions in real-world, unconstrained videos. Recognizing actions is an important task in applications such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis. At first, an effective motion segmentation method to identify the camera motion in videos captured by moving cameras is explored. Next, action recognition in videos captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos. The first part of the thesis explores two effective segmentation approaches to identify the motion due to camera. The first approach is based on curve fitting of the motion trajectories and finding the model which best fits the camera motion model. The curve fitting approach works when the trajectories generated are smooth enough. To overcome this drawback and segment trajectories under non-smooth conditions, a second approach based on trajectory scoring and grouping is proposed. By identifying the instantaneous dominant background motion and accordingly aggregating the scores (denoting the “foregroundness”) along the trajectory, the motion that is associated with the camera can be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras. In the second part of the thesis, recognising actions in normal videos captured from third-person cameras is investigated. To this end, two kinds of descriptors are explored. The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second descriptor is based on Granger causality. The novel causality descriptor encodes the “cause and effect” relationships between the motion trajectories of the actions. This type of interaction descriptors captures the causal inter-dependencies among the motion trajectories and encodes complimentary information different from those descriptors based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts. An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities. This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities performed by the POI are called first-person actions and third-person actions are those done by others and observed by the POI. The third part of the thesis explores action recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to recognise the action and the perspective from which it is being observed. Trajectory descriptors are adapted to recognise actions along with the motion trajectory ranking method of segmentation as pre-processing step to identify the camera motion. The motion segmentation step is necessary to remove unintended head motion (camera motion) during video capture. To recognise actions and corresponding perspectives in a multi-camera setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views. In the final part of the thesis, a feature encoding scheme and a subvolume sampling scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector encoding. The resulting encoding is simple, effective and improves the classification performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos. Trajectory-based Descriptors Action Recognition Egocentric Videos Covariance Descriptors Hyper-Fisher Vector Dense Trajectories Fisher Vectors Fisher Kernel Fisher Vector (FV) Coding Method Motion Segmentation Method Electrical Engineering
2	Generative models : from data generation to representation learning Zhang, Ruixiang 08 1900 (has links) La modélisation générative est un domaine en pleine expansion dans l'apprentissage automatique, avec des modèles démontrant des capacités impressionnantes pour la synthèse de données en haute dimension à travers diverses modalités, y compris les images, le texte et l'audio. Cependant, des défis significatifs subsistent pour améliorer la qualité des échantillons et la contrôlabilité des modèles, ainsi que pour développer des méthodes plus principiées et efficaces pour apprendre des représentations de caractéristiques structurées avec des modèles génératifs. Cette thèse conduit une enquête complète en deux parties sur les frontières de la modélisation générative, en mettant l'accent sur l'amélioration de la qualité des échantillons et la manœuvrabilité, ainsi que sur l'apprentissage de représentations latentes de haute qualité. La première partie de la thèse propose de nouvelles techniques pour améliorer la qualité des échantillons et permettre un contrôle fin des modèles génératifs. Premièrement, une nouvelle perspective est introduite pour reformuler les réseaux antagonistes génératifs pré-entraînés comme des modèles basés sur l'énergie, permettant un échantillonnage plus efficace en exploitant à la fois le générateur et le discriminateur. Deuxièmement, un cadre théorique basé sur l'information est développé pour incorporer des biais inductifs explicites dans les modèles à variables latentes grâce aux réseaux bayésiens et à la théorie du goulot d'étranglement multivarié. Cela fournit une vision unifiée pour l'apprentissage de représentations structurées adaptées à différentes applications comme la modélisation multi-modale et l'équité algorithmique. La deuxième partie de la thèse se concentre sur l'apprentissage et l'extraction de caractéristiques de haute qualité des modèles génératifs de manière entièrement non supervisée. Premièrement, une approche basée sur l'énergie est présentée pour l'apprentissage non supervisé de représentations de scènes centrées sur l'objet avec une invariance de permutation. La compositionnalité de la fonction d'énergie permet également une manipulation contrôlable de la scène. Deuxièmement, des noyaux de Fisher neuronaux sont proposés pour extraire des représentations compactes et utiles des modèles génératifs pré-entraînés. Il est démontré que les approximations de rang faible du noyau de Fisher fournissent une technique d'extraction de représentation unifiée compétitive par rapport aux références courantes. Ensemble, ces contributions font progresser la modélisation générative et l'apprentissage de représentations sur des fronts complémentaires. Elles améliorent la qualité des échantillons et la manœuvrabilité grâce à de nouveaux objectifs d'entraînement et des techniques d'inférence. Elles permettent également d'extraire des caractéristiques latentes structurées des modèles génératifs en utilisant des perspectives théoriques basées sur l'information et le noyau neuronal. La thèse offre une enquête complète sur les défis interconnectés de la synthèse de données et de l'apprentissage de représentation pour les modèles génératifs modernes. / Generative modeling is a rapidly advancing field in machine learning, with models demonstrating impressive capabilities for high-dimensional data synthesis across modalities including images, text, and audio. However, significant challenges remain in enhancing sample quality and model controllability, as well as developing more principled and effective methods for learning structured feature representations with generative models. This dissertation conducts a comprehensive two-part investigation into pushing the frontiers of generative modeling, with a focus on improving sample quality and steerability, as well as enabling learning high-quality latent representations. The first part of the dissertation proposes novel techniques to boost sample quality and enable fine-grained control for generative models. First, a new perspective is introduced to reformulate pretrained generative adversarial networks as energy-based models, enabling more effective sampling leveraging both the generator and discriminator. Second, an information-theoretic framework is developed to incorporate explicit inductive biases into latent variable models through Bayesian networks and multivariate information bottleneck theory. This provides a unified view for learning structured representations catered to different applications like multi-modal modeling and algorithmic fairness. The second part of the dissertation focuses on learning and extracting high-quality features from generative models in a fully unsupervised manner. First, an energy-based approach is presented for unsupervised learning of object-centric scene representations with permutation invariance. Compositionality of the energy function also enables controllable scene manipulation. Second, neural fisher kernels are proposed to extract compact and useful representations from pretrained generative models. It is shown that low-rank approximations of the Fisher Kernel provide a unified representation extraction technique competitive with common baselines. Together, the contributions advance generative modeling and representation learning along complementary fronts. They improve sample quality and steerability through new training objectives and inference techniques. They also enable extracting structured latent features from generative models using information-theoretic and neural kernel perspectives. The thesis provides a comprehensive investigation into the interconnected challenges of data synthesis and representation learning for modern generative models. Generative models Representation Learning Modèles génératifs Apprentissage de représentation Modèles basés sur l’énergie Réseaux antagonistes génératifs Auto-encodeurs variationnels Apprentissage non supervisé Apprentissage centré sur l’objet Compréhension de scène Échantillonnage MCMC Réseaux bayésiens Inférence variationnelle Noyau de Fisher Energy-based models Generative adversarial networks Variational auto-encoders Unsupervised learning Object-centric learning Scene-understanding MCMC sampling Bayesian networks Variational inference Fisher kernel

Search results

Trajectory-based Descriptors for Action Recognition in Real-world Videos

Generative models : from data generation to representation learning