Spelling suggestions: "subject:"visuallanguage models"" "subject:"designlanguage models""
1 |
A Multimodal Framework for Automated Content Moderation of Children's VideosAhmed, Syed Hammad 01 January 2024 (has links) (PDF)
Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings.
|
2 |
Efficient Localization of Human Actions and Moments in VideosEscorcia, Victor 07 1900 (has links)
We are stumbling across a video tsunami flooding our communication channels.
The ubiquity of digital cameras and social networks has increased the amount of visual
media content generated and shared by people, in particular videos. Cisco reports
that 82% of the internet traffic would be in the form of videos by 2022. The computer
vision community has embraced this challenge by offering the first building blocks to
translate the visual data in segmented video clips into semantic tags. However, users
usually require to go beyond tagging at the video level. For example, someone may
want to retrieve important moments such as the “first steps of her child” from a large
collection of untrimmed videos; or retrieving all the instances of a home-run from an
unsegmented video of baseball. In the face of this data deluge, it becomes crucial
to develop efficient and scalable algorithms that can intelligently localize semantic
visual content in untrimmed videos.
In this work, I address three different challenges on the localization of actions in
videos. First, I develop deep-based action proposals and detection models that take a
video and generate action-agnostic and class-specific temporal segments, respectively.
These models retrieve temporal locations with high accuracy in an efficient manner,
faster than real-time. Second, I propose the new task to retrieve and localize temporal
moments from a collection of videos given a natural language query. To tackle this
challenge, I introduce an efficient and effective model that aligns the text query to
individual clips of fixed length while still retrieves moments spanning multiple clips.
This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments.
Lastly, I introduce the concept of actor-supervision that exploits the inherent compo
sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal
localization of actions without the need of action box annotations. By designing ef
ficient models to scan a single video in real-time; retrieve and localizing moments of
interest from multiple videos; and an effective strategy to localize actions without
resorting in action box annotations, this thesis provides insights that put us closer to
the goal of general video understanding.
|
Page generated in 0.0587 seconds