Spelling suggestions: "subject:"video depresentation"" "subject:"video prepresentation""
1 |
Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video RepresentationsJanuary 2016 (has links)
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.
Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.
In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
2 |
Towards Label Efficiency and Privacy Preservation in Video UnderstandingDave, Ishan Rajendrakumar 01 January 2024 (has links) (PDF)
Video understanding involves tasks like action recognition, video retrieval, human pose propagation which are essential for applications such as surveillance, surgical videos, sports analysis, and content recommendation. The progress in this domain has been largely driven by advancements in deep learning, facilitated by large-scale labeled datasets. However, video annotation presents significant challenges due to its time-consuming and expensive nature. This limitation underscores the importance of developing methods that can learn effectively from unlabeled or limited-labeled data, which makes self-supervised learning (SSL) and semi-supervised learning particularly relevant for video understanding. Another significant challenge in video understanding is privacy preservation, as methods often inadvertently leak private information, presenting a growing concern in the field. In this dissertation, we present methods to improve the label efficiency of deep video models by employing self-supervised and semi-supervised methods, and a self-supervised method designed to mitigate privacy leakage in action recognition task. Our first contribution is the Temporal Contrastive Learning framework for Video Representation (TCLR). Unlike prior contrastive self-supervised learning methods which aim to learn temporal similarity between different clips of the same video, TCLR encourages the learning differences rather than similarities in clips from the same video. TCLR consists of two novel losses to improve upon existing contrastive self-supervised video representations, contrasting temporal segments of the same video at two different temporal aggregation steps: clip level and temporal pooling level. Although TCLR offers an effective solution for video-level downstream tasks, it does not encourage framewise video representation for addressing low-level temporal correspondence-based downstream tasks. To promote a more effective framewise video representation, we first eliminate learning shortcuts present in existing temporal pretext tasks by introducing framewise spatial jittering and proposing more challenging frame-level temporal pretext tasks. Our approach "No More Shortcuts"(NMS) results in state-of-the-art performance across a wide range of downstream tasks, encompassing both high-level semantic and low-level temporal correspondence tasks. While the VideoSSL approaches, TCLR and NMS, focus only on learning from unlabeled videos, in practice, some labeled data often exists. Our next focus is on semi-supervised action recognition, where we have a small set of labeled videos with a large pool of unlabeled videos. Using the observations from the self-supervised representations, we leverage the unlabeled videos using the complementary strengths of temporally-invariant and temporally-distinctive contrastive self-supervised video representations. Our proposed semi-supervised method "TimeBalance" introduces a student-teacher framework that dynamically combines the knowledge of two self-supervised teachers based on the nature of the unlabeled video using the proposed reweighting strategy. Although TimeBalance performs well for coarse-grained actions, it struggles with fine-grained actions. To address this, we propose "FinePseudo" framework, which leverages temporal alignability to learn phase-aware distances. It also introduces collaborative pseudo-labeling between video-level and alignability encoder, refining the pseudo-labeling process for fine-grained actions. Although the above mentioned video representations are useful for various downstream applications, they often leak a considerable amount of private information present in the videos. To mitigate the privacy leaks in videos, we propose SPAct, a self-supervised framework that removes private information from input videos without requiring privacy labels. SPAct exhibits competitive performance compared to supervised methods and introduces new evaluation protocols to assess the generalization capability of the anonymization across novel action and privacy attributes. Overall, this dissertation contributes to the advancement of label-efficient and privacy-preserving video understanding by exploring novel self-supervised and semi-supervised learning approaches and their applications in privacy-preserving action recognition.
|
3 |
Segrega??o socioespacial e turismo: estudo da representa??o f?lmica criada pelos turistas e residentes sobre Natal Rio Grande do NorteSilva, Michel Jairo Vieira da 09 June 2011 (has links)
Made available in DSpace on 2015-02-24T20:17:23Z (GMT). No. of bitstreams: 1
MichelJVS_DISSERT.pdf: 5644903 bytes, checksum: c8bdd9fe13f9b806623ce8c3346831e9 (MD5)
Previous issue date: 2011-06-09 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior / This study emerges and develops, from a note by Italo Calvino, who in his novel Invisible Cities advised to avoid saying "that sometimes different cities follow on the same site and with the same name, born and die without knowing, without communication among itself ". The research with a transdisciplinary work ( using elements of sociology, anthropology, geography and communication) made a reflection about segregation and tourism: poverty-richness, center-periphery, tradition- spectaclezation , the visitor-visited maping the touristic circuit and discussing about the phenomenon on the real city and touristic place: Natal and the "Sun City" - Rio Grande do Norte, studying videos produced by residents (documentary) and tourists ( posted on the Internet). Doing a comparative analysis between the realities of these two subjects (resident and tourist), the research found few similarities, many differences on the urban experience, with the existence of two distinct realities (tourist region X the periphery region). Based on theory of phenomenology, social representation, and using content analysis of film, it was noted that promotes to the visitor a trip segmented and disintegrated to daily life, culture and contact with the resident. Resident that, in largely part, lives in a unattended area, with no prospect of life (represented by Novo Horizonte Community). The confinement and segregation occurs even in his moments of leisure and cultural expressions (represented by Redinha‟s Beach), because the private an public leisure areas of tourism indirectly prevent access by people who can not contribute to the consumption on this places. This papper concluded that the tourism in Natal is an activity-phenomenon that directs and focuses on public investments for infrastructure tourist region (Ponta Negra Beach), in detriment of the poorest and periphery areas of the city / O presente estudo inquieta-se e desdobra-se a partir de um apontamento de ?talo Calvino, que em seu romance Cidades Invis?veis, aconselha a se evitar dizer que, algumas vezes, cidades diferentes sucedem-se no mesmo solo e com mesmo nome, nascem e morrem sem se conhecer, incomunic?veis entre si . De car?ter transdisciplinar (pressupostos sociologia, antropologia, geografia e comunica??o), esta pesquisa faz uma reflex?o acerca da segrega??o s?cio-espacial e turismo: pobreza-riqueza; centro-periferia; tradi??o-espetaculariza??o; visitante-visitado ao mapear o circuito tur?stico e discutir o fen?meno sitiando a cidade real e tur?stica; Natal e Cidade do Sol o Rio Grande do Norte, atrav?s de v?deos produzidos por residentes (document?rios) e turistas (registros postados na Internet). Ao fazer uma an?lise comparativa entre as realidades vividas entre esses dois sujeitos (residente e turista), a pesquisa constatou poucas semelhan?as, muitas disparidades de experi?ncia com o urbano, apresentando a exist?ncia de duas realidades bastante distintas (regi?o tur?stica X regi?o da periferia). Baseado na corrente te?rica da fenomenologia, da representa??o social, e se utilizando da an?lise de conte?do f?lmico, verificou-se que a experi?ncia tur?stica promove ao visitante uma viagem segmentada e desintegrada do cotidiano, da cultura e do contato com o residente. Residente esse que, em grande parte, ? morador de ?rea desassistida, carente e sem perspectiva de vida (representada pela Comunidade Novo Horizonte). O confinamento e a segrega??o dos sujeitos, ocorrem at? em seus momentos de lazer e express?es culturais (representado pela Praia da Redinha), visto que as ?reas privadas e p?blicas de lazer tur?stico, indiretamente impedem o acesso daqueles que n?o podem contribuir ? atividade de consumo presente nessas regi?es. Conclui-se que o turismo em Natal enquanto atividade-fen?meno direciona e concentra os investimentos p?blicos infraestruturais na regi?o de Ponta Negra e entorno (ilha tur?stica), em detrimento de ?reas mais carentes e perif?ricas da cidade
|
4 |
Modèles robustes et efficaces pour la reconnaissance d'action et leur localisation / Robust and efficient models for action recognition and localizationOneata, Dan 20 July 2015 (has links)
Vidéo d'interprétation et de compréhension est l'un des objectifs de recherche à long terme dans la vision par ordinateur. Vidéos réalistes tels que les films présentent une variété de problèmes difficiles d'apprentissage machine, telles que la classification d'action / récupération d'action, de suivi humaines, la classification interaction homme / objet, etc Récemment robustes descripteurs visuels pour la classification vidéo ont été développés, et ont montré qu'il est possible d'apprendre classificateurs visuels réalistes des paramètres difficile. Toutefois, afin de déployer des systèmes de reconnaissance visuelle à grande échelle dans la pratique, il devient important d'aborder l'évolutivité des techniques. L'objectif principal est cette thèse est de développer des méthodes évolutives pour l'analyse de contenu vidéo (par exemple pour le classement ou la classification). / Video interpretation and understanding is one of the long-term research goals in computer vision. Realistic videos such as movies present a variety of challenging machine learning problems, such as action classification/action retrieval, human tracking, human/object interaction classification, etc. Recently robust visual descriptors for video classification have been developed, and have shown that it is possible to learn visual classifiers in realistic difficult settings. However, in order to deploy visual recognition systems on large-scale in practice it becomes important to address the scalability of the techniques. The main goal is this thesis is to develop scalable methods for video content analysis (eg for ranking, or classification).
|
5 |
Bayesian Nonparametric Modeling of Temporal Coherence for Entity-Driven Video AnalyticsMitra, Adway January 2015 (has links) (PDF)
In recent times there has been an explosion of online user-generated video content. This has generated significant research interest in video analytics. Human users understand videos based on high-level semantic concepts. However, most of the current research in video analytics are driven by low-level features and descriptors, which often lack semantic interpretation. Existing attempts in semantic video analytics are specialized and require additional resources like movie scripts, which are not available for most user-generated videos. There are no general purpose approaches to understanding videos through semantic concepts.
In this thesis we attempt to bridge this gap. We view videos as collections of entities which are semantic visual concepts like the persons in a movie, or cars in a F1 race video. We focus on two fundamental tasks in Video Understanding, namely summarization and scene- discovery. Entity-driven Video Summarization and Entity-driven Scene discovery are important open problems. They are challenging due to the spatio-temporal nature of videos, and also due to lack of apriori information about entities. We use Bayesian nonparametric methods to solve these problems. In the absence of external resources like scripts we utilize fundamental structural properties like temporal coherence in videos- which means that adjacent frames should contain the same set of entities and have similar visual features. There have been no focussed attempts to model this important property. This thesis makes several contributions in Computer Vision and Bayesian nonparametrics by addressing Entity-driven Video Understanding through temporal coherence modeling.
Temporal Coherence in videos is observed across its frames at the level of features/descriptors, as also at semantic level. We start with an attempt to model TC at the level of features/descriptors. A tracklet is a spatio-temporal fragment of a video- a set of spatial regions in a short sequence (5-20) of consecutive frames, each of which enclose a particular entity. We attempt to find a representation of tracklets to aid tracking of entities. We explore region descriptors like Covari- ance Matrices of spatial features in individual frames. Due to temporal coherence, such matrices from corresponding spatial regions in successive frames have nearly identical eigenvectors. We utilize this property to model a tracklet using a covariance matrix, and use it for region-based entity tracking. We propose a new method to estimate such a matrix. Our method is found to be much more efficient and effective than alternative covariance-based methods for entity tracking.
Next, we move to modeling temporal coherence at a semantic level, with special emphasis on videos of movies and TV-series episodes. Each tracklet is associated with an entity (say a particular person). Spatio-temporally close but non-overlapping tracklets are likely to belong to the same entity, while tracklets that overlap in time can never belong to the same entity. Our aim is to cluster the tracklets based on the entities associated with them, with the goal of discovering the entities in a video along with all their occurrences. We argue that Bayesian Nonparametrics is the most convenient way for this task. We propose a temporally coherent version of Chinese Restaurant Process (TC-CRP) that can encode such constraints easily, and results in discovery of pure clusters of tracklets, and also filter out tracklets resulting from false detections. TC-CRP shows excellent performance on person discovery from TV-series videos. We also discuss semantic video summarization, based on entity discovery.
Next, we consider entity-driven temporal segmentation of a video into scenes, where each scene is characterized by the entities present in it. This is a novel application, as existing work on temporal segmentation have focussed on low-level features of frames, rather than entities. We propose EntScene: a generative model for videos based on entities and scenes, and propose an inference algorithm based on Blocked Gibbs Sampling, for simultaneous entity discovery and scene discovery. We compare it to alternative inference algorithms, and show significant improvements in terms of segmentatio and scene discovery.
Video representation by low-rank matrix has gained popularity recently, and has been used for various tasks in Computer Vision. In such a representation, each column corresponds to a frame or a single detection. Such matrices are likely to have contiguous sets of identical columns due to temporal coherence, and hence they should be low-rank. However, we discover that none of the existing low-rank matrix recovery algorithms are able to preserve such structures. We study regularizers to encourage these structures for low-rank matrix recovery through convex optimization, but note that TC-CRP-like Bayesian modeling is better for enforcing them.
We then focus our attention on modeling temporal coherence in hierarchically grouped sequential data, such as word-tokens grouped into sentences, paragraphs, documents etc in a text corpus. We attempt Bayesian modeling for such data, with application to multi-layer segmentation. We first make a detailed study of existing models for such data. We present a taxonomy for such models called Degree-of-Sharing (DoS), based on how various mixture components are shared by the groups of data in these models. We come up with Layered Dirichlet Process which generalizes Hierarchical Dirichlet Process to multiple layers, and can also handle sequential information easily through Markovian approach. This is applied to hierarchical co-segmentation of a set of news transcripts- into broad categories (like politics, sports etc) and individual stories. We also propose a explicit-duration (semi-Markov) approach for this purpose, and provide an efficient inference algorithm for this. We also discuss generative processes for distribution matrices, where each column is a probability distribution. For this we discuss an application: to infer the correct answers to questions on online answering forums from opinions provided by different users.
|
Page generated in 0.106 seconds