One of the major research topics in computer vision is automatic video scene understanding where the ultimate goal is to build artificial intelligence systems comparable with humans in understanding video contents. Automatic video scene understanding covers many applications including (i) semantic functional complex scene categorization, (ii) human body-pose estimation in videos, (iii) human fine-grained daily living action recognition, (vi) video retrieval, and genre recognition. In this thesis, we introduce computer vision and pattern analysis techniques that outperform the state of art of the above mentioned applications on some publicly available datasets. Our major research contributions towards automatic video scene understanding are (i) introducing an efficient approach to combine low and high-level information content of videos, (ii) modeling temporal variation of frame-based descriptors in videos, and (iii) proposing a multitask learning framework to leverage the huge amount of unlabeled videos. The first category covers a method for enriching visual words that contain local motion information but they lack information about the cause of the motion. Our proposed approach embeds the source of a generated motion in video descriptors and hence induces some semantic information in the employed visual words in the pattern analysis task. Our approach is validated on traffic scene analysis as well as human body pose estimation applications. When employing an already-trained off-the-shelves model over an unseen dataset, the accuracy of the model usually drops significantly. We present an approach that considers low-level cues such as the optical flow in the foreground of a video to make an already-trained, off-the-shelves, pictorial deformable model work well on a body pose estimation working well for an unseen dataset. The second category covers methods that induce temporal variation information to video descriptors. Many video descriptors are based on global video representations, where, frame-based descriptors are combined to a unified video descriptor without preserving much of the temporal information content. To include the temporal information content in video descriptors, we introduce a descriptor, namely, the Hard and Soft Cluster Encoding. The descriptor includes how similar frames are distributed over a video timespan. We present that our approach yields significant improvements on the human fine-grained daily living action recognition task. The third category includes a novel Multi-Task Clustering (MTC) approach to leverage the information of unlabeled videos. Our proposed method is on human fine-grained daily living action recognition application. People tend to perform similar activities in the similar environments. Therefore, a proper clustering approach could determine patterns of fine-grained activities during some learning process. Our proposed MTC approach rather than clustering the data of each individual separately, capture more generic patterns across users over the training data and hence leads to remarkable recognition rates. Finally, we discuss opportunities for future applications of our research and conclude with a summary of our contributions to video understanding.
Identifer | oai:union.ndltd.org:unitn.it/oai:iris.unitn.it:11572/368321 |
Date | January 2017 |
Creators | Rostamzadeh, Negar |
Contributors | Rostamzadeh, Negar |
Publisher | Università degli studi di Trento, place:TRENTO |
Source Sets | Università di Trento |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/doctoralThesis |
Rights | info:eu-repo/semantics/openAccess |
Relation | firstpage:1, lastpage:125, numberofpages:125 |
Page generated in 0.0046 seconds