Global ETD Search

1	The clash between two worlds in human action recognition: supervised feature training vs Recurrent ConvNet Raptis, Konstantinos 28 November 2016 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Action recognition has been an active research topic for over three decades. There are various applications of action recognition, such as surveillance, human-computer interaction, and content-based retrieval. Recently, research focuses on movies, web videos, and TV shows datasets. The nature of these datasets make action recognition very challenging due to scene variability and complexity, namely background clutter, occlusions, viewpoint changes, fast irregular motion, and large spatio-temporal search space (articulation configurations and motions). The use of local space and time image features shows promising results, avoiding the cumbersome and often inaccurate frame-by-frame segmentation (boundary estimation). We focus on two state of the art methods for the action classification problem: dense trajectories and recurrent neural networks (RNN). Dense trajectories use typical supervised training (e.g., with Support Vector Machines) of features such as 3D-SIFT, extended SURF, HOG3D, and local trinary patterns; the main idea is to densely sample these features in each frame and track them in the sequence based on optical flow. On the other hand, the deep neural network uses the input frames to detect action and produce part proposals, i.e., estimate information on body parts (shapes and locations). We compare qualitatively and numerically these two approaches, indicative to what is used today, and describe our conclusions with respect to accuracy and efficiency. Action Recognition Dense Trajectories R-CNN LSTM RNN Convolution Neural Networks Recurrent Neural Networks
2	Trajectory-based Descriptors for Action Recognition in Real-world Videos Narayan, Sanath January 2015 (has links) (PDF) This thesis explores motion trajectory-based approaches to recognize human actions in real-world, unconstrained videos. Recognizing actions is an important task in applications such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis. At first, an effective motion segmentation method to identify the camera motion in videos captured by moving cameras is explored. Next, action recognition in videos captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos. The first part of the thesis explores two effective segmentation approaches to identify the motion due to camera. The first approach is based on curve fitting of the motion trajectories and finding the model which best fits the camera motion model. The curve fitting approach works when the trajectories generated are smooth enough. To overcome this drawback and segment trajectories under non-smooth conditions, a second approach based on trajectory scoring and grouping is proposed. By identifying the instantaneous dominant background motion and accordingly aggregating the scores (denoting the “foregroundness”) along the trajectory, the motion that is associated with the camera can be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras. In the second part of the thesis, recognising actions in normal videos captured from third-person cameras is investigated. To this end, two kinds of descriptors are explored. The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second descriptor is based on Granger causality. The novel causality descriptor encodes the “cause and effect” relationships between the motion trajectories of the actions. This type of interaction descriptors captures the causal inter-dependencies among the motion trajectories and encodes complimentary information different from those descriptors based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts. An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities. This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities performed by the POI are called first-person actions and third-person actions are those done by others and observed by the POI. The third part of the thesis explores action recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to recognise the action and the perspective from which it is being observed. Trajectory descriptors are adapted to recognise actions along with the motion trajectory ranking method of segmentation as pre-processing step to identify the camera motion. The motion segmentation step is necessary to remove unintended head motion (camera motion) during video capture. To recognise actions and corresponding perspectives in a multi-camera setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views. In the final part of the thesis, a feature encoding scheme and a subvolume sampling scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector encoding. The resulting encoding is simple, effective and improves the classification performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos. Trajectory-based Descriptors Action Recognition Egocentric Videos Covariance Descriptors Hyper-Fisher Vector Dense Trajectories Fisher Vectors Fisher Kernel Fisher Vector (FV) Coding Method Motion Segmentation Method Electrical Engineering
3	A video descriptor using orientation tensors and shape-based trajectory clustering Caetano, Felipe Andrade 29 August 2014 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-06T17:54:07Z No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-07T11:06:08Z (GMT) No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Made available in DSpace on 2017-06-07T11:06:08Z (GMT). No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) Previous issue date: 2014-08-29 / Trajetórias densas têm se mostrado um método extremamente promissor na área de reconhecimento de ações humanas. Baseado nisso, propomos um novo tipo de descritor de vídeos, calculado a partir da relação do ﬂuxo ótico que compõe a trajetória com o gradiente de sua vizinhança e sua localidade espaço-temporal. Tensores de orientação são usados para acumular informação relevante ao longo do vídeo, representando tendências de direção do descritor para aquele tipo de movimento. Além disso, um método para aglomerar trajetórias usando o seu formato como métrica é proposto. Isso permite acu- mular características de movimentos distintos em tensores separados e diferenciar com maior facilidade trajetórias que são criadas por movimentos reais das que são geradas a partir do movimento de câmera. O método proposto foi capaz de atingir os melhores níveis de reconhecimento conhecidos para métodos com a restrição de métodos autodescritores em bases populares — Hollywood2 (Acima de 46%) e KTH (Acima de 94%). / Dense trajectories has been shown as a very promising method in the human action recognition area. Based on that, we propose a new kind of video descriptor, calculated from the relationship between the trajectory’s optical ﬂow with the gradient ﬁeld in its neighborhood and its spatio-temporal location. Orientation tensors are used to accumulate relevant information over the video, representing the tendency of direction for that kind of movement. Furthermore, a method to cluster trajectories using their shape is proposed. This allow us to accumulate diﬀerent motion patterns in diﬀerent tensors and easier distinguish trajectories that are created by real movements from the trajectories generated by the camera’s movement. The proposed method is capable to achieve the best known recognition rates for methods based on the self-descriptor constraint in popular datasets — Hollywood2 (up to 46%) and KTH (up to 94%). Trajetórias densas Autodescritor Tensor de orientação Dense trajectories Human action recognition in videos Self-descriptor Orientation tensor Clustering
4	A video self-descriptor based on sparse trajectory clustering Figueiredo, Ana Mara de Oliveira 10 September 2015 (has links) Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-30T17:44:26Z No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-01T11:48:59Z (GMT) No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Made available in DSpace on 2017-06-01T11:48:59Z (GMT). No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) Previous issue date: 2015-09-10 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O reconhecimento de ações humanas é um problema desaﬁador em visão computacional que tem potenciais áreas de aplicações. Para descrever o principal movimento do vídeo um novo descritor de movimento é proposto neste trabalho. Este trabalho combina dois métodos para estimar o movimento entre as imagens: casamento de blocos e de gradiente de intensidade de brilho da imagem. Neste trabalho usa-se um algoritmo de casamento de blocos de tamanho variável para extrair vetores de deslocamento, os quais contém a informação de movimento. Estes vetores são computados em uma sequência de frames obtendo a trajetória do bloco, que possui a informação temporal. Os vetores obtidos através do casamento de blocos são usados para clusterizar as trajetórias esparsas de acordo com a forma. O método proposto computa essa informação para obter tensores de orientação e gerar o descritor ﬁnal. Este descritor é chamado de autodescritor porque depende apenas do vídeo de entrada. O tensor usado como descritor global é avaliado através da classiﬁcação dos vídeos das bases de dados KTH, UCF11 e Hollywood2 com o classiﬁcador não linear SVM. Os resultados indicam que este método de trajetórias esparsas é competitivo comparado ao já conhecido método de trajetórias densas, usando tensores de orientação, além de requerer menos esforço computacional. / Human action recognition is a challenging problem in Computer Vision which has many potential applications. In order to describe the main movement of the video a new motion descriptor is proposed in this work. We combine two methods for estimating the motion between frames: block matching and brightness gradient of image. In this work we use a variable size block matching algorithm to extract displacement vectors as a motion information. The cross product between the block matching vector and the gra dient is used to obtain the displacement vectors. These vectors are computed in a frame sequence, obtaining the block trajectory which contains the temporal information. The block matching vectors are also used to cluster the sparse trajectories according to their shape. The proposed method computes this information to obtain orientation tensors and to generate the ﬁnal descriptor. It is called self-descriptor because it depends only on the input video. The global tensor descriptor is evaluated by classiﬁcation of KTH, UCF11 and Hollywood2 video datasets with a non-linear SVM classiﬁer. Results indicate that our sparse trajectories method is competitive in comparison to the well known dense tra jectories approach, using orientation tensors, besides requiring less computational eﬀort. Casamento de blocos Reconhecimento de ações humanas Autodescritor Trajetórias esparsas e densas Clusterização de trajetórias Block Matching Human action recognition Self-descriptor Sparse and dense trajectories Trajectory clustering

1

Page generated in 0.0957 seconds