• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • 1
  • 1
  • Tagged with
  • 4
  • 4
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video Representations

January 2016 (has links)
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos. Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information. In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset. / Dissertation/Thesis / Masters Thesis Computer Science 2016
2

Segrega??o socioespacial e turismo: estudo da representa??o f?lmica criada pelos turistas e residentes sobre Natal Rio Grande do Norte

Silva, Michel Jairo Vieira da 09 June 2011 (has links)
Made available in DSpace on 2015-02-24T20:17:23Z (GMT). No. of bitstreams: 1 MichelJVS_DISSERT.pdf: 5644903 bytes, checksum: c8bdd9fe13f9b806623ce8c3346831e9 (MD5) Previous issue date: 2011-06-09 / Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior / This study emerges and develops, from a note by Italo Calvino, who in his novel Invisible Cities advised to avoid saying "that sometimes different cities follow on the same site and with the same name, born and die without knowing, without communication among itself ". The research with a transdisciplinary work ( using elements of sociology, anthropology, geography and communication) made a reflection about segregation and tourism: poverty-richness, center-periphery, tradition- spectaclezation , the visitor-visited maping the touristic circuit and discussing about the phenomenon on the real city and touristic place: Natal and the "Sun City" - Rio Grande do Norte, studying videos produced by residents (documentary) and tourists ( posted on the Internet). Doing a comparative analysis between the realities of these two subjects (resident and tourist), the research found few similarities, many differences on the urban experience, with the existence of two distinct realities (tourist region X the periphery region). Based on theory of phenomenology, social representation, and using content analysis of film, it was noted that promotes to the visitor a trip segmented and disintegrated to daily life, culture and contact with the resident. Resident that, in largely part, lives in a unattended area, with no prospect of life (represented by Novo Horizonte Community). The confinement and segregation occurs even in his moments of leisure and cultural expressions (represented by Redinha‟s Beach), because the private an public leisure areas of tourism indirectly prevent access by people who can not contribute to the consumption on this places. This papper concluded that the tourism in Natal is an activity-phenomenon that directs and focuses on public investments for infrastructure tourist region (Ponta Negra Beach), in detriment of the poorest and periphery areas of the city / O presente estudo inquieta-se e desdobra-se a partir de um apontamento de ?talo Calvino, que em seu romance Cidades Invis?veis, aconselha a se evitar dizer que, algumas vezes, cidades diferentes sucedem-se no mesmo solo e com mesmo nome, nascem e morrem sem se conhecer, incomunic?veis entre si . De car?ter transdisciplinar (pressupostos sociologia, antropologia, geografia e comunica??o), esta pesquisa faz uma reflex?o acerca da segrega??o s?cio-espacial e turismo: pobreza-riqueza; centro-periferia; tradi??o-espetaculariza??o; visitante-visitado ao mapear o circuito tur?stico e discutir o fen?meno sitiando a cidade real e tur?stica; Natal e Cidade do Sol o Rio Grande do Norte, atrav?s de v?deos produzidos por residentes (document?rios) e turistas (registros postados na Internet). Ao fazer uma an?lise comparativa entre as realidades vividas entre esses dois sujeitos (residente e turista), a pesquisa constatou poucas semelhan?as, muitas disparidades de experi?ncia com o urbano, apresentando a exist?ncia de duas realidades bastante distintas (regi?o tur?stica X regi?o da periferia). Baseado na corrente te?rica da fenomenologia, da representa??o social, e se utilizando da an?lise de conte?do f?lmico, verificou-se que a experi?ncia tur?stica promove ao visitante uma viagem segmentada e desintegrada do cotidiano, da cultura e do contato com o residente. Residente esse que, em grande parte, ? morador de ?rea desassistida, carente e sem perspectiva de vida (representada pela Comunidade Novo Horizonte). O confinamento e a segrega??o dos sujeitos, ocorrem at? em seus momentos de lazer e express?es culturais (representado pela Praia da Redinha), visto que as ?reas privadas e p?blicas de lazer tur?stico, indiretamente impedem o acesso daqueles que n?o podem contribuir ? atividade de consumo presente nessas regi?es. Conclui-se que o turismo em Natal enquanto atividade-fen?meno direciona e concentra os investimentos p?blicos infraestruturais na regi?o de Ponta Negra e entorno (ilha tur?stica), em detrimento de ?reas mais carentes e perif?ricas da cidade
3

Modèles robustes et efficaces pour la reconnaissance d'action et leur localisation / Robust and efficient models for action recognition and localization

Oneata, Dan 20 July 2015 (has links)
Vidéo d'interprétation et de compréhension est l'un des objectifs de recherche à long terme dans la vision par ordinateur. Vidéos réalistes tels que les films présentent une variété de problèmes difficiles d'apprentissage machine, telles que la classification d'action / récupération d'action, de suivi humaines, la classification interaction homme / objet, etc Récemment robustes descripteurs visuels pour la classification vidéo ont été développés, et ont montré qu'il est possible d'apprendre classificateurs visuels réalistes des paramètres difficile. Toutefois, afin de déployer des systèmes de reconnaissance visuelle à grande échelle dans la pratique, il devient important d'aborder l'évolutivité des techniques. L'objectif principal est cette thèse est de développer des méthodes évolutives pour l'analyse de contenu vidéo (par exemple pour le classement ou la classification). / Video interpretation and understanding is one of the long-term research goals in computer vision. Realistic videos such as movies present a variety of challenging machine learning problems, such as action classification/action retrieval, human tracking, human/object interaction classification, etc. Recently robust visual descriptors for video classification have been developed, and have shown that it is possible to learn visual classifiers in realistic difficult settings. However, in order to deploy visual recognition systems on large-scale in practice it becomes important to address the scalability of the techniques. The main goal is this thesis is to develop scalable methods for video content analysis (eg for ranking, or classification).
4

Bayesian Nonparametric Modeling of Temporal Coherence for Entity-Driven Video Analytics

Mitra, Adway January 2015 (has links) (PDF)
In recent times there has been an explosion of online user-generated video content. This has generated significant research interest in video analytics. Human users understand videos based on high-level semantic concepts. However, most of the current research in video analytics are driven by low-level features and descriptors, which often lack semantic interpretation. Existing attempts in semantic video analytics are specialized and require additional resources like movie scripts, which are not available for most user-generated videos. There are no general purpose approaches to understanding videos through semantic concepts. In this thesis we attempt to bridge this gap. We view videos as collections of entities which are semantic visual concepts like the persons in a movie, or cars in a F1 race video. We focus on two fundamental tasks in Video Understanding, namely summarization and scene- discovery. Entity-driven Video Summarization and Entity-driven Scene discovery are important open problems. They are challenging due to the spatio-temporal nature of videos, and also due to lack of apriori information about entities. We use Bayesian nonparametric methods to solve these problems. In the absence of external resources like scripts we utilize fundamental structural properties like temporal coherence in videos- which means that adjacent frames should contain the same set of entities and have similar visual features. There have been no focussed attempts to model this important property. This thesis makes several contributions in Computer Vision and Bayesian nonparametrics by addressing Entity-driven Video Understanding through temporal coherence modeling. Temporal Coherence in videos is observed across its frames at the level of features/descriptors, as also at semantic level. We start with an attempt to model TC at the level of features/descriptors. A tracklet is a spatio-temporal fragment of a video- a set of spatial regions in a short sequence (5-20) of consecutive frames, each of which enclose a particular entity. We attempt to find a representation of tracklets to aid tracking of entities. We explore region descriptors like Covari- ance Matrices of spatial features in individual frames. Due to temporal coherence, such matrices from corresponding spatial regions in successive frames have nearly identical eigenvectors. We utilize this property to model a tracklet using a covariance matrix, and use it for region-based entity tracking. We propose a new method to estimate such a matrix. Our method is found to be much more efficient and effective than alternative covariance-based methods for entity tracking. Next, we move to modeling temporal coherence at a semantic level, with special emphasis on videos of movies and TV-series episodes. Each tracklet is associated with an entity (say a particular person). Spatio-temporally close but non-overlapping tracklets are likely to belong to the same entity, while tracklets that overlap in time can never belong to the same entity. Our aim is to cluster the tracklets based on the entities associated with them, with the goal of discovering the entities in a video along with all their occurrences. We argue that Bayesian Nonparametrics is the most convenient way for this task. We propose a temporally coherent version of Chinese Restaurant Process (TC-CRP) that can encode such constraints easily, and results in discovery of pure clusters of tracklets, and also filter out tracklets resulting from false detections. TC-CRP shows excellent performance on person discovery from TV-series videos. We also discuss semantic video summarization, based on entity discovery. Next, we consider entity-driven temporal segmentation of a video into scenes, where each scene is characterized by the entities present in it. This is a novel application, as existing work on temporal segmentation have focussed on low-level features of frames, rather than entities. We propose EntScene: a generative model for videos based on entities and scenes, and propose an inference algorithm based on Blocked Gibbs Sampling, for simultaneous entity discovery and scene discovery. We compare it to alternative inference algorithms, and show significant improvements in terms of segmentatio and scene discovery. Video representation by low-rank matrix has gained popularity recently, and has been used for various tasks in Computer Vision. In such a representation, each column corresponds to a frame or a single detection. Such matrices are likely to have contiguous sets of identical columns due to temporal coherence, and hence they should be low-rank. However, we discover that none of the existing low-rank matrix recovery algorithms are able to preserve such structures. We study regularizers to encourage these structures for low-rank matrix recovery through convex optimization, but note that TC-CRP-like Bayesian modeling is better for enforcing them. We then focus our attention on modeling temporal coherence in hierarchically grouped sequential data, such as word-tokens grouped into sentences, paragraphs, documents etc in a text corpus. We attempt Bayesian modeling for such data, with application to multi-layer segmentation. We first make a detailed study of existing models for such data. We present a taxonomy for such models called Degree-of-Sharing (DoS), based on how various mixture components are shared by the groups of data in these models. We come up with Layered Dirichlet Process which generalizes Hierarchical Dirichlet Process to multiple layers, and can also handle sequential information easily through Markovian approach. This is applied to hierarchical co-segmentation of a set of news transcripts- into broad categories (like politics, sports etc) and individual stories. We also propose a explicit-duration (semi-Markov) approach for this purpose, and provide an efficient inference algorithm for this. We also discuss generative processes for distribution matrices, where each column is a probability distribution. For this we discuss an application: to infer the correct answers to questions on online answering forums from opinions provided by different users.

Page generated in 0.1729 seconds