Global ETD Search

1	Trajectory-based Descriptors for Action Recognition in Real-world Videos Narayan, Sanath January 2015 (has links) (PDF) This thesis explores motion trajectory-based approaches to recognize human actions in real-world, unconstrained videos. Recognizing actions is an important task in applications such as video retrieval, surveillance, human-robot interactions, analysis of sports videos, summarization of videos, behaviour monitoring, etc. There has been a considerable amount of research done in this regard. Earlier work used to be on videos captured by static cameras where it was relatively easy to recognise the actions. With more videos being captured by moving cameras, recognition of actions in such videos with irregular camera motion is still a challenge in unconstrained settings with variations in scale, view, illumination, occlusion and unrelated motions in the background. With the increase in videos being captured from wearable or head-mounted cameras, recognizing actions in egocentric videos is also explored in this thesis. At first, an effective motion segmentation method to identify the camera motion in videos captured by moving cameras is explored. Next, action recognition in videos captured in normal third-person view (perspective) is discussed. Further, the action recognition approaches for first-person (egocentric) views are investigated. First-person videos are often associated with frequent unintended camera motion. This is due to the motion of the head resulting in the motion of the head-mounted cameras (wearable cameras). This is followed by recognition of actions in egocentric videos in a multicamera setting. And lastly, novel feature encoding and subvolume sampling (for “deep” approaches) techniques are explored in the context of action recognition in videos. The first part of the thesis explores two effective segmentation approaches to identify the motion due to camera. The first approach is based on curve fitting of the motion trajectories and finding the model which best fits the camera motion model. The curve fitting approach works when the trajectories generated are smooth enough. To overcome this drawback and segment trajectories under non-smooth conditions, a second approach based on trajectory scoring and grouping is proposed. By identifying the instantaneous dominant background motion and accordingly aggregating the scores (denoting the “foregroundness”) along the trajectory, the motion that is associated with the camera can be separated from the motion due to foreground objects. Additionally, the segmentation result has been used to align videos from moving cameras, resulting in videos that seem to be captured by nearly-static cameras. In the second part of the thesis, recognising actions in normal videos captured from third-person cameras is investigated. To this end, two kinds of descriptors are explored. The first descriptor is the covariance descriptor adapted for the motion trajectories. The covariance descriptor for a trajectory encodes the co-variations of different features along the trajectory’s length. Covariance, being a second-order encoding, encodes information of the trajectory that is different from that of the first-order encoding. The second descriptor is based on Granger causality. The novel causality descriptor encodes the “cause and effect” relationships between the motion trajectories of the actions. This type of interaction descriptors captures the causal inter-dependencies among the motion trajectories and encodes complimentary information different from those descriptors based on the occurrence of features. The causal dependencies are traditionally computed on time-varying signals. We extend it further to capture dependencies between spatiotemporal signals and compute generalised causality descriptors which perform better than their traditional counterparts. An egocentric or first-person video is captured from the perspective of the personof-interest (POI). The POI wears a camera and moves around doing his/her activities. This camera records the events and activities as seen by him/her. The POI who is performing actions or activities is not seen by the camera worn by him/her. Activities performed by the POI are called first-person actions and third-person actions are those done by others and observed by the POI. The third part of the thesis explores action recognition in egocentric videos. Differentiating first-person and third-person actions is important when summarising/analysing the behaviour of the POI. Thus, the goal is to recognise the action and the perspective from which it is being observed. Trajectory descriptors are adapted to recognise actions along with the motion trajectory ranking method of segmentation as pre-processing step to identify the camera motion. The motion segmentation step is necessary to remove unintended head motion (camera motion) during video capture. To recognise actions and corresponding perspectives in a multi-camera setup, a novel inter-view causality descriptor based on the causal dependencies between trajectories in different views is explored. Since this is a new problem being addressed, two first-person datasets are created with eight actions in third-person and first-person perspectives. The first dataset is a single camera dataset with action instances from first-person and third-person views. The second dataset is a multi-camera dataset with each action instance having multiple first-person and third-person views. In the final part of the thesis, a feature encoding scheme and a subvolume sampling scheme for recognising actions in videos is proposed. The proposed Hyper-Fisher Vector feature encoding is based on embedding the Bag-of-Words encoding into the Fisher Vector encoding. The resulting encoding is simple, effective and improves the classification performance over the state-of-the-art techniques. This encoding can be used in place of the traditional Fisher Vector encoding in other recognition approaches. The proposed subvolume sampling scheme, used to generate second layer features in “deep” approaches for action recognition in videos, is based on iteratively increasing the size of the valid subvolumes in the temporal direction to generate newer subvolumes. The proposed sampling requires lesser number of subvolumes to be generated to “better represent” the actions and thus, is less computationally intensive compared to the original sampling scheme. The techniques are evaluated on large-scale, challenging, publicly available datasets. The Hyper-Fisher Vector combined with the proposed sampling scheme perform better than the state-of-the-art techniques for action classification in videos. Trajectory-based Descriptors Action Recognition Egocentric Videos Covariance Descriptors Hyper-Fisher Vector Dense Trajectories Fisher Vectors Fisher Kernel Fisher Vector (FV) Coding Method Motion Segmentation Method Electrical Engineering
2	A deep learning model for scene recognition Meng, Zhaoxin January 2019 (has links) Scene recognition is a hot research topic in the field of image recognition. It is necessary that we focus on the research on scene recognition, because it is helpful to the scene understanding topic, and can provide important contextual information for object recognition. The traditional approaches for scene recognition still have a lot of shortcomings. In these years, the deep learning method, which uses convolutional neural network, has got state-of-the-art results in this area. This thesis constructs a model based on multi-layer feature extraction of CNN and transfer learning for scene recognition tasks. Because scene images often contain multiple objects, there may be more useful local semantic information in the convolutional layers of the network, which may be lost in the full connected layers. Therefore, this paper improved the traditional architecture of CNN, adopted the existing improvement which enhanced the convolution layer information, and extracted it using Fisher Vector. Then this thesis introduced the idea of transfer learning, and tried to introduce the knowledge of two different fields, which are scene and object. We combined the output of these two networks to achieve better results. Finally, this thesis implemented the method using Python and PyTorch. This thesis applied the method to two famous scene datasets. the UIUC-Sports and Scene-15 datasets. Compared with traditional CNN AlexNet architecture, we improve the result from 81% to 93% in UIUC-Sports, and from 79% to 91% in Scene- 15. It shows that our method has good performance on scene recognition tasks. Scene recognition CNN convolutional supervised Fisher Vector transfer learning Software Engineering Programvaruteknik
3	Video content analysis for intelligent forensics Fraz, Muhammad January 2014 (has links) The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes. To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects. To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression. In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain. The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images. Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild. 363.25

1

Page generated in 0.0514 seconds