• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 10
  • 4
  • 1
  • 1
  • Tagged with
  • 20
  • 20
  • 20
  • 11
  • 7
  • 6
  • 6
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Geometric Invariance In The Analysis Of Human Motion In Video Data

Shen, Yuping 01 January 2009 (has links)
Human motion analysis is one of the major problems in computer vision research. It deals with the study of the motion of human body in video data from different aspects, ranging from the tracking of body parts and reconstruction of 3D human body configuration, to higher level of interpretation of human action and activities in image sequences. When human motion is observed through video camera, it is perspectively distorted and may appear totally different from different viewpoints. Therefore it is highly challenging to establish correct relationships between human motions across video sequences with different camera settings. In this work, we investigate the geometric invariance in the motion of human body, which is critical to accurately understand human motion in video data regardless of variations in camera parameters and viewpoints. In human action analysis, the representation of human action is a very important issue, and it usually determines the nature of the solutions, including their limits in resolving the problem. Unlike existing research that study human motion as a whole 2D/3D object or a sequence of postures, we study human motion as a sequence of body pose transitions. We also decompose a human body pose further into a number of body point triplets, and break down a pose transition into the transition of a set of body point triplets. In this way the study of complex non-rigid motion of human body is reduced to that of the motion of rigid body point triplets, i.e. a collection of planes in motion. As a result, projective geometry and linear algebra can be applied to explore the geometric invariance in human motion. Based on this formulation, we have discovered the fundamental ratio invariant and the eigenvalue equality invariant in human motion. We also propose solutions based on these geometric invariants to the problems of view-invariant recognition of human postures and actions, as well as analysis of human motion styles. These invariants and their applicability have been validated by experimental results supporting that their effectiveness in understanding human motion with various camera parameters and viewpoints.
12

Um descritor tensorial de movimento baseado em múltiplos estimadores de gradiente

Sad, Dhiego Cristiano Oliveira da Silva 22 February 2013 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-30T19:45:09Z No. of bitstreams: 1 dhiegocristianooliveiradasilvasad.pdf: 1920111 bytes, checksum: c7bccda6c65e798776738b9581721c98 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-01T11:37:10Z (GMT) No. of bitstreams: 1 dhiegocristianooliveiradasilvasad.pdf: 1920111 bytes, checksum: c7bccda6c65e798776738b9581721c98 (MD5) / Made available in DSpace on 2017-06-01T11:37:10Z (GMT). No. of bitstreams: 1 dhiegocristianooliveiradasilvasad.pdf: 1920111 bytes, checksum: c7bccda6c65e798776738b9581721c98 (MD5) Previous issue date: 2013-02-22 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Este trabalho apresenta uma nova abordagem para a descrição de movimento em vídeos usando múltiplos filtros passa-banda que agem como estimadores derivativos de primeira ordem. A resposta dos filtros em cada quadro do vídeo é extraída e codificada em histogramas de gradientes para reduzir a sua dimensionalidade. Essa combinação é realizada através de tensores de orientação. O grande diferencial deste trabalho em relação à maioria das abordagens encontradas na literatura é que nenhuma característica local é extraída e nenhum método de aprendizagem é realizado previamente, isto é, o descritor depende unicamente do vídeo de entrada. Para o problema de reconhecimento da ação humana utilizando a base de dados KTH, nosso descritor alcançou a taxa de reconhecimento de 93,3% usando três filtros da família Daubechies combinado com mais um filtro extra que é a correlação entre esses três filtros. O descritor resultante é então classificado através do SVM utilizando um protocolo two-fold. Essa classificação se mostra superior para a maioria das abordagens que usam descritores globais e pode ser comparável aos métodos do estado-da-arte. / This work presents a novel approach for motion description in videos using multiple band-pass filters that act as first order derivative estimators. The filters response on each frame are coded into individual histograms of gradients to reduce their dimensionality. They are combined using orientation tensors. No local features are extracted and no learning is performed, i.e., the descriptor depends uniquely on the input video. Motion description can be enhanced even using multiple filters with similar or overlapping fre quency response. For the problem of human action recognition using the KTH database, our descriptor achieved the recognition rate of 93,3% using three Daubechies filters, one extra filter designed to correlate them, two-fold protocol and a SVM classifier. It is su perior to most global descriptor approaches and fairly comparable to the state-of-the-art methods.
13

A video descriptor using orientation tensors and shape-based trajectory clustering

Caetano, Felipe Andrade 29 August 2014 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-06T17:54:07Z No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-07T11:06:08Z (GMT) No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) / Made available in DSpace on 2017-06-07T11:06:08Z (GMT). No. of bitstreams: 1 felipeandradecaetano.pdf: 7461489 bytes, checksum: 93cea870d7bf162be4786d1d6ffb2ec9 (MD5) Previous issue date: 2014-08-29 / Trajetórias densas têm se mostrado um método extremamente promissor na área de reconhecimento de ações humanas. Baseado nisso, propomos um novo tipo de descritor de vídeos, calculado a partir da relação do fluxo ótico que compõe a trajetória com o gradiente de sua vizinhança e sua localidade espaço-temporal. Tensores de orientação são usados para acumular informação relevante ao longo do vídeo, representando tendências de direção do descritor para aquele tipo de movimento. Além disso, um método para aglomerar trajetórias usando o seu formato como métrica é proposto. Isso permite acu- mular características de movimentos distintos em tensores separados e diferenciar com maior facilidade trajetórias que são criadas por movimentos reais das que são geradas a partir do movimento de câmera. O método proposto foi capaz de atingir os melhores níveis de reconhecimento conhecidos para métodos com a restrição de métodos autodescritores em bases populares — Hollywood2 (Acima de 46%) e KTH (Acima de 94%). / Dense trajectories has been shown as a very promising method in the human action recognition area. Based on that, we propose a new kind of video descriptor, calculated from the relationship between the trajectory’s optical flow with the gradient field in its neighborhood and its spatio-temporal location. Orientation tensors are used to accumulate relevant information over the video, representing the tendency of direction for that kind of movement. Furthermore, a method to cluster trajectories using their shape is proposed. This allow us to accumulate different motion patterns in different tensors and easier distinguish trajectories that are created by real movements from the trajectories generated by the camera’s movement. The proposed method is capable to achieve the best known recognition rates for methods based on the self-descriptor constraint in popular datasets — Hollywood2 (up to 46%) and KTH (up to 94%).
14

Multi-view Geometric Constraints For Human Action Recognition And Tracking

Gritai, Alexei 01 January 2007 (has links)
Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions.
15

Modeling Scenes And Human Activities In Videos

Basharat, Arslan 01 January 2009 (has links)
In this dissertation, we address the problem of understanding human activities in videos by developing a two-pronged approach: coarse level modeling of scene activities and fine level modeling of individual activities. At the coarse level, where the resolution of the video is low, we rely on person tracks. At the fine level, richer features are available to identify different parts of the human body, therefore we rely on the body joint tracks. There are three main goals of this dissertation: (1) identify unusual activities at the coarse level, (2) recognize different activities at the fine level, and (3) predict the behavior for synthesizing and tracking activities at the fine level. The first goal is addressed by modeling activities at the coarse level through two novel and complementing approaches. The first approach learns the behavior of individuals by capturing the patterns of motion and size of objects in a compact model. Probability density function (pdf) at each pixel is modeled as a multivariate Gaussian Mixture Model (GMM), which is learnt using unsupervised expectation maximization (EM). In contrast, the second approach learns the interaction of object pairs concurrently present in the scene. This can be useful in detecting more complex activities than those modeled by the first approach. We use a 14-dimensional Kernel Density Estimation (KDE) that captures motion and size of concurrently tracked objects. The proposed models have been successfully used to automatically detect activities like unusual person drop-off and pickup, jaywalking, etc. The second and third goals of modeling human activities at the fine level are addressed by employing concepts from theory of chaos and non-linear dynamical systems. We show that the proposed model is useful for recognition and prediction of the underlying dynamics of human activities. We treat the trajectories of human body joints as the observed time series generated from an underlying dynamical system. The observed data is used to reconstruct a phase (or state) space of appropriate dimension by employing the delay-embedding technique. This transformation is performed without assuming an exact model of the underlying dynamics and provides a characteristic representation that will prove to be vital for recognition and prediction tasks. For recognition, properties of phase space are captured in terms of dynamical and metric invariants, which include the Lyapunov exponent, correlation integral, and correlation dimension. A composite feature vector containing these invariants represents the action and will be used for classification. For prediction, kernel regression is used in the phase space to compute predictions with a specified initial condition. This approach has the advantage of modeling dynamics without making any assumptions about the exact form (polynomial, radial basis, etc.) of the mapping function. We demonstrate the utility of these predictions for human activity synthesis and tracking.
16

Computer Vision in Fitness: Exercise Recognition and Repetition Counting / Datorseende i fitness: Träningsigenkänning och upprepningsräkning

Barysheva, Anna January 2022 (has links)
Motion classification and action localization have rapidly become essential tasks in computer vision and video analytics. In particular, Human Action Recognition (HAR), which has important applications in clinical assessments, activity monitoring, and sports performance evaluation, has drawn a lot of attention in research communities. Nevertheless, the high-dimensional and time-continuous nature of motion data creates non-trivial challenges in action detection and action recognition. In this degree project, on a set of recorded unannotated mixed workouts, we test and evaluate unsupervised and semi-supervised machine learning models to identify the correct location, i.e., a timestamp, of various exercises in videos and to study different approaches in clustering detected actions. This is done by modelling the data via the two-step clustering pipeline using the Bag-of-Visual-Words (BoVW) approach. Moreover, the concept of repetition counting is under consideration as a parallel task. We find that clustering alone tends to produce cluster solutions with a mixture of exercises and is not sufficient to solve the exercise recognition problem. Instead, we use clustering as an initial step to aggregate similar exercises. This allows us to effectively find many repetitions of similar exercises for their further annotation. When combined with a subsequent Support Vector Machine (SVM) classifier, the BoVW concept proved itself, achieving an accuracy score of 95.5% on the labelled subset. Much attention has also been paid to various methods of dimensionality reduction and benchmarking their ability to encode the original data into a lower-dimensional latent space. / Rörelseklassificering och handlingslokalisering har snabbt blivit viktiga uppgifter inom datorseende och videoanalys. I synnerhet har HAR fångat en stor uppmärksamhet i forskarsamhällen, då den har viktiga tillämpningar i kliniska bedömningar, aktivitetsövervakning och utvärdering av sportprestanda.Likväl så skapar den högdimensionella och tidskontinuerliga naturen hos rörelsedata icke-triviala utmaningar i handlingsdetektering och handlingsigenkänning. I detta examensarbete testar vi samt utvärderar oövervakade och semi-övervarakde maskininlärningsmodeller på en samling av inspelade blandade träningspass, som inte är noterade. Detta är för att identifiera den korrekta positionen, d.v.s en tidsstämpel, för olika övningar i videofilmer och för att studera olika tillvägagångssätt för att gruppera upptäckta handlingar. Detta görs genom att modellera data via tvåstegs klustringspipeline, med tillämpning av BoVW-metoden. Som en parallell uppgift övervägs även repetitionsräkning som koncept. Vi finner att kluster enbart tenderar att producera klusterlösningar med en blandning av övningar och är därför inte tillräckligt för att lösa problemet med övningsigenkänning. Istället, använder vi klustring som ett första steg för att sammanställa liknande övningar. Detta gör att vi effektivt kan hitta många upprepningar av liknande övningar för att vidare hantera dess anteckningar. Detta, kombinerad med en efterföljande SVM-klassificerare, visade sig att BoVWkonceptet är mycket effektivt, vilket uppnådde en noggrannhet på 95, 5% på den märkta delmängden. Mycket uppmärksamhet har också ägnats åt olika metoder för dimensionalitetsreduktion och jämförelse av dessa metoders förmåga att koda originaldata till ett dimensionellt lägre latentutrymme.
17

A video self-descriptor based on sparse trajectory clustering

Figueiredo, Ana Mara de Oliveira 10 September 2015 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-30T17:44:26Z No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-01T11:48:59Z (GMT) No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) / Made available in DSpace on 2017-06-01T11:48:59Z (GMT). No. of bitstreams: 1 anamaradeoliveirafigueiredo.pdf: 5190215 bytes, checksum: f9ec4e5f37ac1ca446fcef9ac91c1fb5 (MD5) Previous issue date: 2015-09-10 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / O reconhecimento de ações humanas é um problema desafiador em visão computacional que tem potenciais áreas de aplicações. Para descrever o principal movimento do vídeo um novo descritor de movimento é proposto neste trabalho. Este trabalho combina dois métodos para estimar o movimento entre as imagens: casamento de blocos e de gradiente de intensidade de brilho da imagem. Neste trabalho usa-se um algoritmo de casamento de blocos de tamanho variável para extrair vetores de deslocamento, os quais contém a informação de movimento. Estes vetores são computados em uma sequência de frames obtendo a trajetória do bloco, que possui a informação temporal. Os vetores obtidos através do casamento de blocos são usados para clusterizar as trajetórias esparsas de acordo com a forma. O método proposto computa essa informação para obter tensores de orientação e gerar o descritor final. Este descritor é chamado de autodescritor porque depende apenas do vídeo de entrada. O tensor usado como descritor global é avaliado através da classificação dos vídeos das bases de dados KTH, UCF11 e Hollywood2 com o classificador não linear SVM. Os resultados indicam que este método de trajetórias esparsas é competitivo comparado ao já conhecido método de trajetórias densas, usando tensores de orientação, além de requerer menos esforço computacional. / Human action recognition is a challenging problem in Computer Vision which has many potential applications. In order to describe the main movement of the video a new motion descriptor is proposed in this work. We combine two methods for estimating the motion between frames: block matching and brightness gradient of image. In this work we use a variable size block matching algorithm to extract displacement vectors as a motion information. The cross product between the block matching vector and the gra dient is used to obtain the displacement vectors. These vectors are computed in a frame sequence, obtaining the block trajectory which contains the temporal information. The block matching vectors are also used to cluster the sparse trajectories according to their shape. The proposed method computes this information to obtain orientation tensors and to generate the final descriptor. It is called self-descriptor because it depends only on the input video. The global tensor descriptor is evaluated by classification of KTH, UCF11 and Hollywood2 video datasets with a non-linear SVM classifier. Results indicate that our sparse trajectories method is competitive in comparison to the well known dense tra jectories approach, using orientation tensors, besides requiring less computational effort.
18

Video motion description based on histograms of sparse trajectories

Oliveira, Fábio Luiz Marinho de 05 September 2016 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-06-06T19:12:19Z No. of bitstreams: 1 fabioluizmarinhodeoliveira.pdf: 1410854 bytes, checksum: cb71ee666cda7d462ce0dd33963a988c (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-06-07T13:33:08Z (GMT) No. of bitstreams: 1 fabioluizmarinhodeoliveira.pdf: 1410854 bytes, checksum: cb71ee666cda7d462ce0dd33963a988c (MD5) / Made available in DSpace on 2017-06-07T13:33:08Z (GMT). No. of bitstreams: 1 fabioluizmarinhodeoliveira.pdf: 1410854 bytes, checksum: cb71ee666cda7d462ce0dd33963a988c (MD5) Previous issue date: 2016-09-05 / Descrição de movimento tem sido um tema desafiador e popular há muitos anos em visão computacional e processamento de sinais, mas também intimamente relacionado a aprendizado de máquina e reconhecimento de padrões. Frequentemente, para realizar essa tarefa, informação de movimento é extraída e codificada em um descritor. Este trabalho apresenta um método simples e de rápida computação para extrair essa informação e codificá-la em descritores baseados em histogramas de deslocamentos relativos. Nossos descritores são compactos, globais, que agregam informação de quadros inteiros, e o que chamamos de auto-descritor, que não depende de informações de sequências senão aquela que pretendemos descrever. Para validar estes descritores e compará-los com outros tra balhos, os utilizamos no contexto de Reconhecimento de Ações Humanas, no qual cenas são classificadas de acordo com as ações nelas exibidas. Nessa validação, obtemos resul tados comparáveis aos do estado-da-arte para a base de dados KTH. Também avaliamos nosso método utilizando as bases UCF11 e Hollywood2, com menores taxas de reconhe cimento, considerando suas maiores complexidades. Nossa abordagem é promissora, pelas razoáveis taxas de reconhecimento obtidas com um método muito menos complexo que os do estado-da-arte, em termos de velocidade de computação e compacidade dos descritores obtidos. Adicionalmente, experimentamos com o uso de Aprendizado de Métrica para a classificação de nossos descritores, com o intuito de melhorar a separabilidade e a com pacidade dos descritores. Os resultados com Aprendizado de Métrica apresentam taxas de reconhecimento inferiores, mas grande melhoria na compacidade dos descritores. / Motion description has been a challenging and popular theme over many years within computer vision and signal processing, but also very closely related to machine learn ing and pattern recognition. Very frequently, to address this task, one extracts motion information from image sequences and encodes this information into a descriptor. This work presents a simple and fast computing method to extract this information and en code it into descriptors based on histograms of relative displacements. Our descriptors are compact, global, meaning it aggregates information from whole frames, and what we call self-descriptors, meaning they do not depend on information from sequences other than the one we want to describe. To validate these descriptors and compare them to other works, we use them in the context of Human Action Recognition, where scenes are classified according to the action portrayed. In this validation, we achieve results that are comparable to those in the state-of-the-art for the KTH dataset. We also evaluate our method on the UCF11 and Hollywood2 datasets, with lower recognition rates, considering their higher complexity. Our approach is a promising one, due to the fairly good recogni tion rates we obtain with a much less complex method than those of the state-of-the-art, in terms of speed of computation and final descriptor compactness. Additionally, we ex periment with the use of Metric Learning in the classification of our descriptors, aiming to improve the separability and compactness of the descriptors. Our results for Metric Learning show inferior recognition rates, but great improvement for the compactness of the descriptors.
19

Motion Based Event Analysis

Biswas, Sovan January 2014 (has links) (PDF)
Motion is an important cue in videos that captures the dynamics of moving objects. It helps in effective analysis of various event related tasks such as human action recognition, anomaly detection, tracking, crowd behavior analysis, traffic monitoring, etc. Generally, accurate motion information is computed using various optical flow estimation techniques. On the other hand, coarse motion information is readily available in the form of motion vectors in compressed videos. Utilizing these encoded motion vectors reduces the computational burden involved in flow estimation and enables rapid analysis of video streams. In this work, the focus is on analyzing motion patterns, retrieved from either motion vectors or optical flow, in order to do various event analysis tasks such as video classification, anomaly detection and crowd flow segmentation. In the first section, we utilize the motion vectors from H.264 compressed videos, a compression standard widely used due to its high compression ratio, to address the following problems. i) Video classification: This work proposes an approach to classify videos based on human action by capturing spatio-temporal motion pattern of the actions using Histogram of Oriented Motion Vector (HOMV) ii) Crowd flow segmentation: In this work, we have addressed the problem of flow segmentation of the dominant motion patterns of the crowds. The proposed approach combines multi-scale super-pixel segmentation of the motion vectors to obtain the final flow segmentation. iii) Anomaly detection: This problem is addressed by local modeling of usual behavior by capturing features such as magnitude and orientation of each moving object. In all the above approaches, the focus was to reduce computations while retaining comparable accuracy to pixel domain processing. In second section, we propose two approaches for anomaly detection using optical flow. The first approach uses spatio-temporal low level motion features and detects anomalies based on the reconstruction error of the sparse representation of the candidate feature over a dictionary of usual behavior features. The main contribution is in enhancing each local dictionary by applying an appropriate transformation on dictionaries of the neighboring regions. The other algorithm aims to improve the accuracy of anomaly localization through short local trajectories of super pixels belonging to moving objects. These trajectories capture both spatial as well as temporal information effectively. In contrast to compressed domain analysis, these pixel level approaches focus on improving the accuracy of detection with reasonable detection speed.
20

Spatial information and end-to-end learning for visual recognition / Informations spatiales et apprentissage bout-en-bout pour la reconnaissance visuelle

Jiu, Mingyuan 03 April 2014 (has links)
Dans cette thèse nous étudions les algorithmes d'apprentissage automatique pour la reconnaissance visuelle. Un accent particulier est mis sur l'apprentissage automatique de représentations, c.à.d. l'apprentissage automatique d'extracteurs de caractéristiques; nous insistons également sur l'apprentissage conjoint de ces dernières avec le modèle de prédiction des problèmes traités, tels que la reconnaissance d'objets, la reconnaissance d'activités humaines, ou la segmentation d'objets. Dans ce contexte, nous proposons plusieurs contributions : Une première contribution concerne les modèles de type bags of words (BoW), où le dictionnaire est classiquement appris de manière non supervisée et de manière autonome. Nous proposons d'apprendre le dictionnaire de manière supervisée, c.à.d. en intégrant les étiquettes de classes issues de la base d'apprentissage. Pour cela, l'extraction de caractéristiques et la prédiction de la classe sont formulées en un seul modèle global de type réseau de neurones (end-to-end training). Deux algorithmes d'apprentissage différents sont proposés pour ce modèle : le premier est basé sur la retro-propagation du gradient de l'erreur, et le second procède par des mises à jour dans le diagramme de Voronoi calculé dans l'espace des caractéristiques. Une deuxième contribution concerne l'intégration d'informations géométriques dans l'apprentissage supervisé et non-supervisé. Elle se place dans le cadre d'applications nécessitant une segmentation d'un objet en un ensemble de régions avec des relations de voisinage définies a priori. Un exemple est la segmentation du corps humain en parties ou la segmentation d'objets spécifiques. Nous proposons une nouvelle approche intégrant les relations spatiales dans l'algorithme d'apprentissage du modèle de prédication. Contrairement aux méthodes existantes, les relations spatiales sont uniquement utilisées lors de la phase d'apprentissage. Les algorithmes de classification restent inchangés, ce qui permet d'obtenir une amélioration du taux de classification sans augmentation de la complexité de calcul lors de la phase de test. Nous proposons trois algorithmes différents intégrant ce principe dans trois modèles : - l'apprentissage du modèle de prédiction des forêts aléatoires, - l'apprentissage du modèle de prédiction des réseaux de neurones (et de la régression logistique), - l'apprentissage faiblement supervisé de caractéristiques visuelles à l'aide de réseaux de neurones convolutionnels. / In this thesis, we present our research on visual recognition and machine learning. Two types of visual recognition problems are investigated: action recognition and human body part segmentation problem. Our objective is to combine spatial information such as label configuration in feature space, or spatial layout of labels into an end-to-end framework to improve recognition performance. For human action recognition, we apply the bag-of-words model and reformulate it as a neural network for end-to-end learning. We propose two algorithms to make use of label configuration in feature space to optimize the codebook. One is based on classical error backpropagation. The codewords are adjusted by using gradient descent algorithm. The other is based on cluster reassignments, where the cluster labels are reassigned for all the feature vectors in a Voronoi diagram. As a result, the codebook is learned in a supervised way. We demonstrate the effectiveness of the proposed algorithms on the standard KTH human action dataset. For human body part segmentation, we treat the segmentation problem as classification problem, where a classifier acts on each pixel. Two machine learning frameworks are adopted: randomized decision forests and convolutional neural networks. We integrate a priori information on the spatial part layout in terms of pairs of labels or pairs of pixels into both frameworks in the training procedure to make the classifier more discriminative, but pixelwise classification is still performed in the testing stage. Three algorithms are proposed: (i) Spatial part layout is integrated into randomized decision forest training procedure; (ii) Spatial pre-training is proposed for the feature learning in the ConvNets; (iii) Spatial learning is proposed in the logistical regression (LR) or multilayer perceptron (MLP) for classification.

Page generated in 0.1343 seconds