Spelling suggestions: "subject:"sense trajectories"" "subject:"sense rajectories""
1 |
The clash between two worlds in human action recognition: supervised feature training vs Recurrent ConvNetRaptis, Konstantinos 28 November 2016 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Action recognition has been an active research topic for over three decades. There are various applications of action recognition, such as surveillance, human-computer interaction, and content-based retrieval. Recently, research focuses on movies, web videos, and TV shows datasets. The nature of these datasets make action recognition very challenging due to scene variability and complexity, namely background clutter, occlusions, viewpoint changes, fast irregular motion, and large spatio-temporal search space (articulation configurations and motions). The use of local space and time image features shows promising results, avoiding the cumbersome and often inaccurate frame-by-frame segmentation (boundary estimation). We focus on two state of the art methods for the action classification problem: dense trajectories and recurrent neural networks (RNN). Dense trajectories use typical supervised training (e.g., with Support Vector Machines) of features such as 3D-SIFT, extended SURF, HOG3D, and local trinary patterns; the main idea is to densely sample these features in each frame and track them in the sequence based on optical flow. On the other hand, the deep neural network uses the input frames to detect action and produce part proposals, i.e., estimate information on body parts (shapes and locations). We compare qualitatively and numerically these two approaches, indicative to what is used today, and describe our conclusions with respect to accuracy and efficiency.
|
2 |
Trajectory-based Descriptors for Action Recognition in Real-world VideosNarayan, Sanath January 2015 (has links) (PDF)
This thesis explores motion trajectory-based approaches to recognize human actions in
real-world, unconstrained videos. Recognizing actions is an important task in applications
such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis.
At first, an effective motion segmentation method to identify the camera motion
in videos captured by moving cameras is explored. Next, action recognition in videos
captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos.
The first part of the thesis explores two effective segmentation approaches to identify
the motion due to camera. The first approach is based on curve fitting of the motion
trajectories and finding the model which best fits the camera motion model. The curve
fitting approach works when the trajectories generated are smooth enough. To overcome
this drawback and segment trajectories under non-smooth conditions, a second approach
based on trajectory scoring and grouping is proposed. By identifying the instantaneous
dominant background motion and accordingly aggregating the scores (denoting the
“foregroundness”) along the trajectory, the motion that is associated with the camera can
be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras.
In the second part of the thesis, recognising actions in normal videos captured from
third-person cameras is investigated. To this end, two kinds of descriptors are explored.
The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second
descriptor is based on Granger causality. The novel causality descriptor encodes the
“cause and effect” relationships between the motion trajectories of the actions. This
type of interaction descriptors captures the causal inter-dependencies among the motion
trajectories and encodes complimentary information different from those descriptors
based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts.
An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities.
This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities
performed by the POI are called first-person actions and third-person actions are those
done by others and observed by the POI. The third part of the thesis explores action
recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to
recognise the action and the perspective from which it is being observed. Trajectory
descriptors are adapted to recognise actions along with the motion trajectory ranking
method of segmentation as pre-processing step to identify the camera motion. The motion
segmentation step is necessary to remove unintended head motion (camera motion) during
video capture. To recognise actions and corresponding perspectives in a multi-camera
setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views.
In the final part of the thesis, a feature encoding scheme and a subvolume sampling
scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector
feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector
encoding. The resulting encoding is simple, effective and improves the classification
performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos.
|
3 |
A video descriptor using orientation tensors and shape-based trajectory clusteringCaetano, Felipe Andrade 29 August 2014 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-06T17:54:07Z
No. of bitstreams: 1
felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-07T11:06:08Z (GMT) No. of bitstreams: 1
felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Made available in DSpace on 2017-06-07T11:06:08Z (GMT). No. of bitstreams: 1
felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5)
Previous issue date: 2014-08-29 / Trajetórias densas têm se mostrado um método extremamente promissor na área de
reconhecimento de ações humanas. Baseado nisso, propomos um novo tipo de descritor
de vídeos, calculado a partir da relação do fluxo ótico que compõe a trajetória com o
gradiente de sua vizinhança e sua localidade espaço-temporal. Tensores de orientação são
usados para acumular informação relevante ao longo do vídeo, representando tendências
de direção do descritor para aquele tipo de movimento. Além disso, um método para
aglomerar trajetórias usando o seu formato como métrica é proposto. Isso permite acu-
mular características de movimentos distintos em tensores separados e diferenciar com
maior facilidade trajetórias que são criadas por movimentos reais das que são geradas a
partir do movimento de câmera. O método proposto foi capaz de atingir os melhores níveis
de reconhecimento conhecidos para métodos com a restrição de métodos autodescritores
em bases populares — Hollywood2 (Acima de 46%) e KTH (Acima de 94%). / Dense trajectories has been shown as a very promising method in the human action
recognition area. Based on that, we propose a new kind of video descriptor, calculated
from the relationship between the trajectory’s optical flow with the gradient field in its
neighborhood and its spatio-temporal location. Orientation tensors are used to accumulate relevant information over the video, representing the tendency of direction for that
kind of movement. Furthermore, a method to cluster trajectories using their shape is
proposed. This allow us to accumulate different motion patterns in different tensors and
easier distinguish trajectories that are created by real movements from the trajectories
generated by the camera’s movement. The proposed method is capable to achieve the best
known recognition rates for methods based on the self-descriptor constraint in popular
datasets — Hollywood2 (up to 46%) and KTH (up to 94%).
|
4 |
A video self-descriptor based on sparse trajectory clusteringFigueiredo, Ana Mara de Oliveira 10 September 2015 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-30T17:44:26Z
No. of bitstreams: 1
anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-01T11:48:59Z (GMT) No. of bitstreams: 1
anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Made available in DSpace on 2017-06-01T11:48:59Z (GMT). No. of bitstreams: 1
anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5)
Previous issue date: 2015-09-10 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O reconhecimento de ações humanas é um problema desafiador em visão computacional
que tem potenciais áreas de aplicações. Para descrever o principal movimento do vídeo
um novo descritor de movimento é proposto neste trabalho. Este trabalho combina dois
métodos para estimar o movimento entre as imagens: casamento de blocos e de gradiente
de intensidade de brilho da imagem. Neste trabalho usa-se um algoritmo de casamento
de blocos de tamanho variável para extrair vetores de deslocamento, os quais contém a
informação de movimento. Estes vetores são computados em uma sequência de frames
obtendo a trajetória do bloco, que possui a informação temporal. Os vetores obtidos
através do casamento de blocos são usados para clusterizar as trajetórias esparsas de
acordo com a forma. O método proposto computa essa informação para obter tensores
de orientação e gerar o descritor final. Este descritor é chamado de autodescritor porque
depende apenas do vídeo de entrada. O tensor usado como descritor global é avaliado
através da classificação dos vídeos das bases de dados KTH, UCF11 e Hollywood2 com
o classificador não linear SVM. Os resultados indicam que este método de trajetórias
esparsas é competitivo comparado ao já conhecido método de trajetórias densas, usando
tensores de orientação, além de requerer menos esforço computacional. / Human action recognition is a challenging problem in Computer Vision which has
many potential applications. In order to describe the main movement of the video a
new motion descriptor is proposed in this work. We combine two methods for estimating
the motion between frames: block matching and brightness gradient of image. In this
work we use a variable size block matching algorithm to extract displacement vectors as
a motion information. The cross product between the block matching vector and the gra
dient is used to obtain the displacement vectors. These vectors are computed in a frame
sequence, obtaining the block trajectory which contains the temporal information. The
block matching vectors are also used to cluster the sparse trajectories according to their
shape. The proposed method computes this information to obtain orientation tensors and
to generate the final descriptor. It is called self-descriptor because it depends only on the
input video. The global tensor descriptor is evaluated by classification of KTH, UCF11
and Hollywood2 video datasets with a non-linear SVM classifier. Results indicate that
our sparse trajectories method is competitive in comparison to the well known dense tra
jectories approach, using orientation tensors, besides requiring less computational effort.
|
Page generated in 0.0957 seconds