Global ETD Search

1	The Optimal Design for Action Recognition Algorithm on Cell Processor Architecture Pan, Po-Hsun 23 August 2011 (has links) In recent years, automatic human action recognition has been widely researched within the computer vision and image processing communities. To identify human behavior which achieve the surveillance has great help by video automation in aspect of home caring, personal property and homeland security. To achieve action recognition, there are many factors to be considered, primarily the accuracy and real-time. If we can parallelize the action recognition algorithm, it will be a greatly improvement to the real-time processing capability of the algorithm. To achieve real-time demand, we study how to implement action recognition algorithm parallelization in the CELL B.E. platform. The action recognition algorithm with our design is faster than the original algorithm; it has 231 times speed up. We found that in the action recognition algorithm, there are many repeated operation between blocks, it can be parallelize by using single-instruction multiple-data architecture. In the action recognition algorithms, there are four major algorithms, DMASKS, HMHHb, MGD, SVM. The SIMD instructions in CELL B.E. platform can compute 128 bits data at once. While doing DMASKS, SIMD parallelism can reach 16 times, HMHHb parallelism up to 128 times, MGD parallelism up to 8 times, and SVM can reach 4 times. Based on CELL B.E. acceleration mechanism, we achieve high-performance computing models with multi-threading and multiple streaming. Our study showed that the action recognition algorithm is very suitable for multi-core system with parallel processing SIMD architecture. The parallelization for action recognition algorithm will have more immediate response in identifying human action. With the advantages of real-time, it can be expected to include more complex algorithms for the accuracy of algorithm in the future, to achieve both immediacy and accuracy. action recognition SIMD CELL parallelize
2	Im2Vid: Future Video Prediction for Static Image Action Recognition AlBahar, Badour A Sh A. 20 June 2018 (has links) Static image action recognition aims at identifying the action performed in a given image. Most existing static image action recognition approaches use high-level cues present in the image such as objects, object human interaction, or human pose to better capture the action performed. Unlike images, videos have temporal information that greatly improves action recognition by resolving potential ambiguity. We propose to leverage a large amount of readily available unlabeled videos to transfer the temporal information from video domain to static image domain and hence improve static image action recognition. Specifically, We propose a video prediction model to predict the future video of a static image and use the future predicted video to improve static image action recognition. Our experimental results on four datasets validate that the idea of transferring the temporal information from videos to static images is promising, and can enhance static image action recognition performance. / Master of Science / Static image action recognition is the problem of identifying the action performed in a given image. Most existing approaches use the high-level cues present in the image like objects, object human interaction, or human pose to better capture the action performed. Unlike images, videos have temporal information that greatly improves action recognition. Looking at a static image of a man who is about to sit on a chair might be misunderstood as an image of a man who is standing from the chair. Because of the temporal information in videos, such ambiguity is not present. To transfer the temporal information and action features from video domain to static image domain and hence improve static image action recognition, we propose a model that learns a mapping from a static image to its future video by looking at a large number of existing images and their future videos. We then use this model to predict the future video of a static image to improve its action recognition. Our experimental results on four datasets show that the idea of transferring the temporal information from videos to static images is promising, and can enhance static image action recognition performance. Human Action Recognition Static Image Action Recognition Video Action Recognition Future Video Prediction
3	TOWARDS IMPROVED REPRESENTATIONS ON HUMAN ACTIVITY UNDERSTANDING Hyung-gun Chi (17543172) 04 December 2023 (has links) <p dir="ltr">Human action recognition stands as a cornerstone in the domain of computer vision, with its utility spanning across emergency response, sign language interpretation, and the burgeoning fields of augmented and virtual reality. The transition from conventional video-based recognition to skeleton-based methodologies has been a transformative shift, offering a robust alternative less susceptible to environmental noise and more focused on the dynamics of human movement.</p><p dir="ltr">This body of work encapsulates the evolution of action recognition, emphasizing the pivotal role of Graph Convolution Network (GCN) based approaches, particularly through the innovative InfoGCN framework. InfoGCN has set a new precedent in the field by introducing an information bottleneck-based learning objective, a self-attention graph convolution module, and a multi-modal representation of the human skeleton. These advancements have collectively elevated the accuracy and efficiency of action recognition systems.</p><p dir="ltr">Addressing the prevalent challenge of occlusions, particularly in single-camera setups, the Pose Relation Transformer (PORT) framework has been introduced. Inspired by the principles of Masked Language Modeling in natural language processing, PORT refines the detection of occluded joints, thereby enhancing the reliability of pose estimation under visually obstructive conditions.</p><p dir="ltr">Building upon the foundations laid by InfoGCN, the Skeleton ODE framework has been developed for online action recognition, enabling real-time inference without the need for complete action observation. By integrating Neural Ordinary Differential Equations, Skeleton ODE facilitates the prediction of future movements, thus reducing latency and paving the way for real-time applications.</p><p dir="ltr">The implications of this research are vast, indicating a future where real-time, efficient, and accurate human action recognition systems could significantly impact various sectors, including healthcare, autonomous vehicles, and interactive technologies. Future research directions point towards the integration of multi-modal data, the application of transfer learning for enhanced generalization, the optimization of models for edge computing, and the ethical deployment of action recognition technologies. The potential for these systems to contribute to healthcare, particularly in patient monitoring and disease detection, underscores the need for continued interdisciplinary collaboration and innovation.</p> Computer vision Human Action Recognition Representation Learning
4	Multi-view Geometric Constraints For Human Action Recognition And Tracking Gritai, Alexei 01 January 2007 (has links) Human actions are the essence of a human life and a natural product of the human mind. Analysis of human activities by a machine has attracted the attention of many researchers. This analysis is very important in a variety of domains including surveillance, video retrieval, human-computer interaction, athlete performance investigation, etc. This dissertation makes three major contributions to automatic analysis of human actions. First, we conjecture that the relationship between body joints of two actors in the same posture can be described by a 3D rigid transformation. This transformation simultaneously captures different poses and various sizes and proportions. As a consequence of this conjecture, we show that there exists a fundamental matrix between the imaged positions of the body joints of two actors, if they are in the same posture. Second, we propose a novel projection model for cameras moving at a constant velocity in 3D space, \emph cameras, and derive the Galilean fundamental matrix and apply it to human action recognition. Third, we propose a novel use for the invariant ratio of areas under an affine transformation and utilizing the epipolar geometry between two cameras for 2D model-based tracking of human body joints. In the first part of the thesis, we propose an approach to match human actions using semantic correspondences between human bodies. These correspondences are used to provide geometric constraints between multiple anatomical landmarks ( e.g. hands, shoulders, and feet) to match actions observed from different viewpoints and performed at different rates by actors of differing anthropometric proportions. The fact that the human body has approximate anthropometric proportion allows for innovative use of the machinery of epipolar geometry to provide constraints for analyzing actions performed by people of different anthropometric sizes, while ensuring that changes in viewpoint do not affect matching. A novel measure in terms of rank of matrix constructed only from image measurements of the locations of anatomical landmarks is proposed to ensure that similar actions are accurately recognized. Finally, we describe how dynamic time warping can be used in conjunction with the proposed measure to match actions in the presence of nonlinear time warps. We demonstrate the versatility of our algorithm in a number of challenging sequences and applications including action synchronization , odd one out, following the leader, analyzing periodicity etc. Next, we extend the conventional model of image projection to video captured by a camera moving at constant velocity. We term such moving camera Galilean camera. To that end, we derive the spacetime projection and develop the corresponding epipolar geometry between two Galilean cameras. Both perspective imaging and linear pushbroom imaging form specializations of the proposed model and we show how six different ``fundamental" matrices including the classic fundamental matrix, the Linear Pushbroom (LP) fundamental matrix, and a fundamental matrix relating Epipolar Plane Images (EPIs) are related and can be directly recovered from a Galilean fundamental matrix. We provide linear algorithms for estimating the parameters of the the mapping between videos in the case of planar scenes. For applying fundamental matrix between Galilean cameras to human action recognition, we propose a measure that has two important properties. First property makes it possible to recognize similar actions, if their execution rates are linearly related. Second property allows recognizing actions in video captured by Galilean cameras. Thus, the proposed algorithm guarantees that actions can be correctly matched despite changes in view, execution rate, anthropometric proportions of the actor, and even if the camera moves with constant velocity. Finally, we also propose a novel 2D model based approach for tracking human body parts during articulated motion. The human body is modeled as a 2D stick figure of thirteen body joints and an action is considered as a sequence of these stick figures. Given the locations of these joints in every frame of a model video and the first frame of a test video, the joint locations are automatically estimated throughout the test video using two geometric constraints. First, invariance of the ratio of areas under an affine transformation is used for initial estimation of the joint locations in the test video. Second, the epipolar geometry between the two cameras is used to refine these estimates. Using these estimated joint locations, the tracking algorithm determines the exact location of each landmark in the test video using the foreground silhouettes. The novelty of the proposed approach lies in the geometric formulation of human action models, the combination of the two geometric constraints for body joints prediction, and the handling of deviations in anthropometry of individuals, viewpoints, execution rate, and style of performing action. The proposed approach does not require extensive training and can easily adapt to a wide variety of articulated actions. Human Action Recognition Human Joint Tracking 2D Human Model Based Tracking View Invariance in Action Recognition Multi-view Geomety in Action Recognition Computer Sciences Engineering
5	Interactive tracking and action retrieval to support human behavior analysis Ciptadi, Arridhana 27 May 2016 (has links) The goal of this thesis is to develop a set of tools for continuous tracking of behavioral phenomena in videos to support human behavior study. Current standard practices for extracting useful behavioral information from a video are typically difficult to replicate and require a lot of human time. For example, extensive training is typically required for a human coder to reliably code a particular behavior/interaction. Also, manual coding typically takes a lot more time than the actual length of the video (e.g. , it can take up to 6 times the actual length of the video to do human-assisted single object tracking. The time intensive nature of this process (due to the need to train expert and manual coding) puts a strong burden on the research process. In fact, it is not uncommon for an institution that heavily uses videos for behavioral research to have a massive backlog of unprocessed video data. To address this issue, I have developed an efficient behavior retrieval and interactive tracking system. These tools allow behavioral researchers/clinicians to more easily extract relevant behavioral information, and more objectively analyze behavioral data from videos. I have demonstrated that my behavior retrieval system achieves state-of-the-art performance for retrieving stereotypical behaviors of individuals with autism in a real-world video data captured in a classroom setting. I have also demonstrated that my interactive tracking system is able to produce high-precision tracking results with less human effort compared to the state-of-the-art. I further show that by leveraging the tracking results, we can extract an objective measure based on proximity between people that is useful for analyzing certain social interactions. I validated this new measure by showing that we can use it to predict qualitative expert ratings in the Strange Situation (a procedure for studying infant attachment security), a quantity that is difficult to obtain due to the difficulty in training the human expert. Behavior analysis Tracking Action recognition Action retrieval Attachment security
6	Action recognition using deep learning Palasek, Petar January 2017 (has links) In this thesis we study deep learning architectures for the problem of human action recognition in image sequences, i.e. the problem of automatically recognizing what people are doing in a given video. As unlabeled video data is easily accessible these days, we first explore models that can learn meaningful representations of sequences without actually having to know what is happening in the sequences at hand. More specifically, we first explore the convolutional restricted Boltzmann machine (RBM) and show how a stack of convolutional RBMs can be used to learn and extract features from sequences in an unsupervised way. Using the classical Fisher vector pipeline to encode the extracted features we apply them on the task of action classification. We move on to feature extraction using larger, deep convolutional neural networks and propose a novel architecture which expresses the processing steps of the classical Fisher vector pipeline as network layers. By contrast to other methods where these steps are performed consecutively and the corresponding parameters are learned in an unsupervised manner, defining them as a single neural network allows us to refine the whole model discriminatively in an end to end fashion. We show that our method achieves significant improvements in comparison to the classical Fisher vector extraction chain and results in a comparable performance to other convolutional networks, while largely reducing the number of required trainable parameters. Finally, we explore how the proposed architecture can be modified into a hybrid network that combines the benefits of both unsupervised and supervised training methods, resulting in a model that learns a semi-supervised Fisher vector descriptor of the input data. We evaluate the proposed model at image classification and action recognition problems and show how the model's classification performance improves as the amount of unlabeled data increases during training.
7	Biologically Plausible Neural Model for the Recognition of Biological Motion and Actions Giese, Martin Alexander, Poggio, Tomaso 01 August 2002 (has links) The visual recognition of complex movements and actions is crucial for communication and survival in many species. Remarkable sensitivity and robustness of biological motion perception have been demonstrated in psychophysical experiments. In recent years, neurons and cortical areas involved in action recognition have been identified in neurophysiological and imaging studies. However, the detailed neural mechanisms that underlie the recognition of such complex movement patterns remain largely unknown. This paper reviews the experimental results and summarizes them in terms of a biologically plausible neural model. The model is based on the key assumption that action recognition is based on learned prototypical patterns and exploits information from the ventral and the dorsal pathway. The model makes specific predictions that motivate new experiments. AI biological motion action recognition visual pathways hierarchical processing
8	Surveillance of Time-varying Geometry Objects using a Multi-camera Active-vision System Mackay, Matthew Donald 10 January 2012 (has links) The recognition of time-varying geometry (TVG) objects (in particular, humans) and their actions is a complex task due to common real-world sensing challenges, such as obstacles and environmental variations, as well as due to issues specific to TVG objects, such as self-occlusion. Herein, it is proposed that a multi-camera active-vision system, which dynamically selects camera poses in real-time, be used to improve TVG action sensing performance by selecting camera views on-line for near-optimal sensing-task performance. Active vision for TVG objects requires an on-line sensor-planning strategy that incorporates information about the object itself, including its current action, and information about the state of the environment, including obstacles, into the pose-selection process. Thus, the focus of this research is the development of a novel methodology for real-time sensing-system reconfiguration (active vision), designed specifically for the recognition of a single TVG object and its actions in a cluttered, dynamic environment, which may contain multiple other dynamic (maneuvering) obstacles. The proposed methodology was developed as a complete, customizable sensing-system framework which can be readily modified to suit a variety of specific TVG action-sensing tasks – a 10-stage pipeline real-time architecture. Sensor Agents capture and correct camera images, removing noise and lens distortion, and segment the images into regions of interest. A Synchronization Agent aligns multiple images from different cameras to a single ‘world-time.’ Point Tracking and De-Projection Agents detect, identify, and track points of interest in the resultant 2-D images, and form constraints in normalized camera coordinates using the tracked pixel coordinates. A 3-D Solver Agent combines all constraints to estimate world-coordinate positions for all visible features of the object-of-interest (OoI) 3-D articulated model. A Form-Recovery Agent uses an iterative process to combine model constraints, detected feature points, and other contextual information to produce an estimate of the OoI’s current form. This estimate is used by an Action-Recognition Agent to determine which action the OoI is performing, if any, from a library of known actions, using a feature-vector descriptor for identification. A Prediction Agent provides estimates of future OoI and obstacle poses, given past detected locations, and estimates of future OoI forms given the current action and past forms. Using all of the data accumulated in the pipeline, a Central Planning Agent implements a formal, mathematical optimization developed from the general sensing problem. The agent seeks to optimize a visibility metric, which is positively related to sensing-task performance, to select desirable, feasible, and achievable camera poses for the next sensing instant. Finally, a Referee Agent examines the complete set of chosen poses for consistency, enforces global rules not captured through the optimization, and maintains system functionality if a suitable solution cannot be determined. In order to validate the proposed methodology, rigorous experiments are also presented herein. They confirm the basic assumptions of active vision for TVG objects, and characterize the gains in sensing-task performance. Simulated experiments provide a method for rapid evaluation of new sensing tasks. These experiments demonstrate a tangible increase in single-action recognition performance over the use of a static-camera sensing system. Furthermore, they illustrate the need for feedback in the pose-selection process, allowing the system to incorporate knowledge of the OoI’s form and action. Later real-world, multi-action and multi-level action experiments demonstrate the same tangible increase when sensing real-world objects that perform multiple actions which may occur simultaneously, or at differing levels of detail. A final set of real-world experiments characterizes the real-time performance of the proposed methodology in relation to several important system design parameters, such as the number of obstacles in the environment, and the size of the action library. Overall, it is concluded that the proposed system tangibly increases TVG action-sensing performance, and can be generalized to a wide range of applications, including human-action sensing. Future research is proposed to develop similar methods to address deformable objects and multiple objects of interest. Active Vision Action Recognition Multi-Camera Surveillance 0548
9	Surveillance of Time-varying Geometry Objects using a Multi-camera Active-vision System Mackay, Matthew Donald 10 January 2012 (has links) The recognition of time-varying geometry (TVG) objects (in particular, humans) and their actions is a complex task due to common real-world sensing challenges, such as obstacles and environmental variations, as well as due to issues specific to TVG objects, such as self-occlusion. Herein, it is proposed that a multi-camera active-vision system, which dynamically selects camera poses in real-time, be used to improve TVG action sensing performance by selecting camera views on-line for near-optimal sensing-task performance. Active vision for TVG objects requires an on-line sensor-planning strategy that incorporates information about the object itself, including its current action, and information about the state of the environment, including obstacles, into the pose-selection process. Thus, the focus of this research is the development of a novel methodology for real-time sensing-system reconfiguration (active vision), designed specifically for the recognition of a single TVG object and its actions in a cluttered, dynamic environment, which may contain multiple other dynamic (maneuvering) obstacles. The proposed methodology was developed as a complete, customizable sensing-system framework which can be readily modified to suit a variety of specific TVG action-sensing tasks – a 10-stage pipeline real-time architecture. Sensor Agents capture and correct camera images, removing noise and lens distortion, and segment the images into regions of interest. A Synchronization Agent aligns multiple images from different cameras to a single ‘world-time.’ Point Tracking and De-Projection Agents detect, identify, and track points of interest in the resultant 2-D images, and form constraints in normalized camera coordinates using the tracked pixel coordinates. A 3-D Solver Agent combines all constraints to estimate world-coordinate positions for all visible features of the object-of-interest (OoI) 3-D articulated model. A Form-Recovery Agent uses an iterative process to combine model constraints, detected feature points, and other contextual information to produce an estimate of the OoI’s current form. This estimate is used by an Action-Recognition Agent to determine which action the OoI is performing, if any, from a library of known actions, using a feature-vector descriptor for identification. A Prediction Agent provides estimates of future OoI and obstacle poses, given past detected locations, and estimates of future OoI forms given the current action and past forms. Using all of the data accumulated in the pipeline, a Central Planning Agent implements a formal, mathematical optimization developed from the general sensing problem. The agent seeks to optimize a visibility metric, which is positively related to sensing-task performance, to select desirable, feasible, and achievable camera poses for the next sensing instant. Finally, a Referee Agent examines the complete set of chosen poses for consistency, enforces global rules not captured through the optimization, and maintains system functionality if a suitable solution cannot be determined. In order to validate the proposed methodology, rigorous experiments are also presented herein. They confirm the basic assumptions of active vision for TVG objects, and characterize the gains in sensing-task performance. Simulated experiments provide a method for rapid evaluation of new sensing tasks. These experiments demonstrate a tangible increase in single-action recognition performance over the use of a static-camera sensing system. Furthermore, they illustrate the need for feedback in the pose-selection process, allowing the system to incorporate knowledge of the OoI’s form and action. Later real-world, multi-action and multi-level action experiments demonstrate the same tangible increase when sensing real-world objects that perform multiple actions which may occur simultaneously, or at differing levels of detail. A final set of real-world experiments characterizes the real-time performance of the proposed methodology in relation to several important system design parameters, such as the number of obstacles in the environment, and the size of the action library. Overall, it is concluded that the proposed system tangibly increases TVG action-sensing performance, and can be generalized to a wide range of applications, including human-action sensing. Future research is proposed to develop similar methods to address deformable objects and multiple objects of interest. Active Vision Action Recognition Multi-Camera Surveillance 0548
10	Implementation of Action Recognition Algorithm on Multiple-Streaming Multimedia Unit Lin, Tzu-chun 03 August 2010 (has links) Action recognition had become prosperous in development and been broadly applied in several sectors. From homeland security, personal property, home caring, even the smart environment and the motion-sensing games, are in its territories This paper analysis the algorithm of Action recognition for embedded system, finds that there are many blocks can use the parallel execution to compute more efficiently. This paper tries to implement action recognition algorithm on Multiple-Streaming Multimedia Unit (MSMU). MSMU is a MMX-like SIMD architecture, with SIMD Operation and Data Storage. By introduction the concept of multiple streaming, MSMU will be able to modulate the amount of parallel data streams dynamically via switching the instruction mode. With Mode Switching and new added transfer instruction to compute 2D image processing, study the benefit of the instruction mode switching Through comparing the 128-bit SSE architecture and MSMU architecture with the practical example, highlight the problems that exploiting the subword parallelisms facing and bring out the advantage of Multistreaming. For the algorithm, study the slicing the minimum element and using the bitwise operation approach to better efficiency. Compare to embedded SIMD architecture "WMMX", MSMU can achieve 3.49¡Ñ overall speedup. SIMD Action Recognition Embedded computer vision MMX Streaming Processing

Search results