Spelling suggestions: "subject:"designlanguage codels"" "subject:"designlanguage 2models""
1 |
A Multimodal Framework for Automated Content Moderation of Children's VideosAhmed, Syed Hammad 01 January 2024 (has links) (PDF)
Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings.
|
2 |
Efficient Localization of Human Actions and Moments in VideosEscorcia, Victor 07 1900 (has links)
We are stumbling across a video tsunami flooding our communication channels.
The ubiquity of digital cameras and social networks has increased the amount of visual
media content generated and shared by people, in particular videos. Cisco reports
that 82% of the internet traffic would be in the form of videos by 2022. The computer
vision community has embraced this challenge by offering the first building blocks to
translate the visual data in segmented video clips into semantic tags. However, users
usually require to go beyond tagging at the video level. For example, someone may
want to retrieve important moments such as the “first steps of her child” from a large
collection of untrimmed videos; or retrieving all the instances of a home-run from an
unsegmented video of baseball. In the face of this data deluge, it becomes crucial
to develop efficient and scalable algorithms that can intelligently localize semantic
visual content in untrimmed videos.
In this work, I address three different challenges on the localization of actions in
videos. First, I develop deep-based action proposals and detection models that take a
video and generate action-agnostic and class-specific temporal segments, respectively.
These models retrieve temporal locations with high accuracy in an efficient manner,
faster than real-time. Second, I propose the new task to retrieve and localize temporal
moments from a collection of videos given a natural language query. To tackle this
challenge, I introduce an efficient and effective model that aligns the text query to
individual clips of fixed length while still retrieves moments spanning multiple clips.
This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments.
Lastly, I introduce the concept of actor-supervision that exploits the inherent compo
sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal
localization of actions without the need of action box annotations. By designing ef
ficient models to scan a single video in real-time; retrieve and localizing moments of
interest from multiple videos; and an effective strategy to localize actions without
resorting in action box annotations, this thesis provides insights that put us closer to
the goal of general video understanding.
|
3 |
Learning without Expert Labels for Multimodal DataMaruf, Md Abdullah Al 09 January 2025 (has links)
While advancements in deep learning have been largely possible due to the availability of large-scale labeled datasets, obtaining labeled datasets at the required granularity is challenging in many real-world applications, especially in scientific domains, due to the costly and labor-intensive nature of generating annotations. Hence, there is a need to develop new paradigms for learning that do not rely on expert-labeled data and can work even with indirect supervision. Approaches for learning with indirect supervision include unsupervised learning, self-supervised learning, weakly supervised learning, few-shot learning, and knowledge distillation. This thesis addresses these opportunities in the context of multi-modal data through three main contributions. First, this thesis proposes a novel Distance-aware Negative Sampling method for self-supervised Graph Representation Learning (GRL) that learns node representations directly from the graph structure by maximizing separation between distant nodes and maximizing cohesion among nearby nodes. Second, this thesis introduces effective modifications to weakly supervised semantic segmentation (WS3) models, such as stochastic aggregation to saliency maps that improve the learning of pseudo-ground truths from class-level coarse-grained labels and address the limitations of class activation maps. Finally, this thesis evaluates whether pre-trained Vision-Language Models (VLMs) contain the necessary scientific knowledge to identify and reason about biological traits from scientific images. The zero-shot performance of 12 large VLMs is evaluated on a novel VLM4Bio dataset, along with the effects of prompting and reasoning hallucinations are explored. / Doctor of Philosophy / While advancements in machine learning (ML), such as deep learning, have been largely possible due to the availability of large-scale labeled datasets, obtaining high-quality and high-resolution labels is challenging in many real-world applications due to the costly and labor-intensive nature of generating annotations. This thesis explores new ways of training ML models without relying heavily on expert-labeled data using indirect supervision. First, it introduces a novel way of using the structure of graphs for learning representations of graph-based data. Second, it analyzes the effect of weak supervision using coarse labels for image-based data. Third, it evaluates whether current ML models can recognize and reason about scientific images on their own, aiming to make learning more efficient and less dependent on exhaustive labeling.
|
Page generated in 0.0676 seconds