Global ETD Search

1	A Multimodal Framework for Automated Content Moderation of Children's Videos Ahmed, Syed Hammad 01 January 2024 (has links) (PDF) Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings. Automated content moderation Vision-Language models CLIP Prompt engineering
2	Efficient Localization of Human Actions and Moments in Videos Escorcia, Victor 07 1900 (has links) We are stumbling across a video tsunami ﬂooding our communication channels. The ubiquity of digital cameras and social networks has increased the amount of visual media content generated and shared by people, in particular videos. Cisco reports that 82% of the internet traﬃc would be in the form of videos by 2022. The computer vision community has embraced this challenge by oﬀering the ﬁrst building blocks to translate the visual data in segmented video clips into semantic tags. However, users usually require to go beyond tagging at the video level. For example, someone may want to retrieve important moments such as the “ﬁrst steps of her child” from a large collection of untrimmed videos; or retrieving all the instances of a home-run from an unsegmented video of baseball. In the face of this data deluge, it becomes crucial to develop eﬃcient and scalable algorithms that can intelligently localize semantic visual content in untrimmed videos. In this work, I address three diﬀerent challenges on the localization of actions in videos. First, I develop deep-based action proposals and detection models that take a video and generate action-agnostic and class-speciﬁc temporal segments, respectively. These models retrieve temporal locations with high accuracy in an eﬃcient manner, faster than real-time. Second, I propose the new task to retrieve and localize temporal moments from a collection of videos given a natural language query. To tackle this challenge, I introduce an eﬃcient and eﬀective model that aligns the text query to individual clips of ﬁxed length while still retrieves moments spanning multiple clips. This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments. Lastly, I introduce the concept of actor-supervision that exploits the inherent compo sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal localization of actions without the need of action box annotations. By designing ef ﬁcient models to scan a single video in real-time; retrieve and localizing moments of interest from multiple videos; and an eﬀective strategy to localize actions without resorting in action box annotations, this thesis provides insights that put us closer to the goal of general video understanding. action localization video understating human activities understanding computer vision vision-language models deep learning

Search results

A Multimodal Framework for Automated Content Moderation of Children's Videos

Efficient Localization of Human Actions and Moments in Videos