Spelling suggestions: "subject:"video depresentation learning"" "subject:"video prepresentation learning""
1 |
Towards Label Efficiency and Privacy Preservation in Video UnderstandingDave, Ishan Rajendrakumar 01 January 2024 (has links) (PDF)
Video understanding involves tasks like action recognition, video retrieval, human pose propagation which are essential for applications such as surveillance, surgical videos, sports analysis, and content recommendation. The progress in this domain has been largely driven by advancements in deep learning, facilitated by large-scale labeled datasets. However, video annotation presents significant challenges due to its time-consuming and expensive nature. This limitation underscores the importance of developing methods that can learn effectively from unlabeled or limited-labeled data, which makes self-supervised learning (SSL) and semi-supervised learning particularly relevant for video understanding. Another significant challenge in video understanding is privacy preservation, as methods often inadvertently leak private information, presenting a growing concern in the field. In this dissertation, we present methods to improve the label efficiency of deep video models by employing self-supervised and semi-supervised methods, and a self-supervised method designed to mitigate privacy leakage in action recognition task. Our first contribution is the Temporal Contrastive Learning framework for Video Representation (TCLR). Unlike prior contrastive self-supervised learning methods which aim to learn temporal similarity between different clips of the same video, TCLR encourages the learning differences rather than similarities in clips from the same video. TCLR consists of two novel losses to improve upon existing contrastive self-supervised video representations, contrasting temporal segments of the same video at two different temporal aggregation steps: clip level and temporal pooling level. Although TCLR offers an effective solution for video-level downstream tasks, it does not encourage framewise video representation for addressing low-level temporal correspondence-based downstream tasks. To promote a more effective framewise video representation, we first eliminate learning shortcuts present in existing temporal pretext tasks by introducing framewise spatial jittering and proposing more challenging frame-level temporal pretext tasks. Our approach "No More Shortcuts"(NMS) results in state-of-the-art performance across a wide range of downstream tasks, encompassing both high-level semantic and low-level temporal correspondence tasks. While the VideoSSL approaches, TCLR and NMS, focus only on learning from unlabeled videos, in practice, some labeled data often exists. Our next focus is on semi-supervised action recognition, where we have a small set of labeled videos with a large pool of unlabeled videos. Using the observations from the self-supervised representations, we leverage the unlabeled videos using the complementary strengths of temporally-invariant and temporally-distinctive contrastive self-supervised video representations. Our proposed semi-supervised method "TimeBalance" introduces a student-teacher framework that dynamically combines the knowledge of two self-supervised teachers based on the nature of the unlabeled video using the proposed reweighting strategy. Although TimeBalance performs well for coarse-grained actions, it struggles with fine-grained actions. To address this, we propose "FinePseudo" framework, which leverages temporal alignability to learn phase-aware distances. It also introduces collaborative pseudo-labeling between video-level and alignability encoder, refining the pseudo-labeling process for fine-grained actions. Although the above mentioned video representations are useful for various downstream applications, they often leak a considerable amount of private information present in the videos. To mitigate the privacy leaks in videos, we propose SPAct, a self-supervised framework that removes private information from input videos without requiring privacy labels. SPAct exhibits competitive performance compared to supervised methods and introduces new evaluation protocols to assess the generalization capability of the anonymization across novel action and privacy attributes. Overall, this dissertation contributes to the advancement of label-efficient and privacy-preserving video understanding by exploring novel self-supervised and semi-supervised learning approaches and their applications in privacy-preserving action recognition.
|
Page generated in 0.1532 seconds