Spelling suggestions: "subject:"visuallanguage"" "subject:"designlanguage""
1 |
A Multimodal Framework for Automated Content Moderation of Children's VideosAhmed, Syed Hammad 01 January 2024 (has links) (PDF)
Online video platforms receive hundreds of hours of uploads every minute, making manual moderation of inappropriate content impossible. The most vulnerable consumers of malicious video content are children from ages 1-5 whose attention is easily captured by bursts of color and sound. Prominent video hosting platforms like YouTube have taken measures to mitigate malicious content, but these videos often go undetected by current automated content moderation tools that are focused on removing explicit or copyrighted content. Scammers attempting to monetize their content may craft malicious children's videos that are superficially similar to educational videos, but include scary and disgusting characters, violent motions, loud music, and disturbing noises. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. Additionally, there is a dearth of comprehensive datasets for content moderation tasks which include these audio-visual feature annotations. This dissertation addresses these challenges and makes several contributions to the problem of content moderation for children’s videos. The first contribution is identifying a set of malicious features that are harmful to preschool children but remain unaddressed and publishing a labeled dataset (Malicious or Benign) of cartoon video clips that include these features. We provide a user-friendly web-based video annotation tool which can easily be customized and used for video classification tasks with any number of ground truth classes. The second contribution is adapting state-of-the-art Vision-Language models to apply content moderation techniques on the MOB benchmark. We perform prompt engineering and an in-depth analysis of how context-specific language prompts affect the content moderation performance of different CLIP (Contrastive Language-Image Pre-training) variants. This dissertation introduces new benchmark natural language prompt templates for cartoon videos that can be used with Vision-Language models. Finally, we introduce a multimodal framework that includes the audio modality for more robust content moderation of children's cartoon videos and extend our dataset to include audio labels. We present ablations to demonstrate the enhanced performance of adding audio. The audio modality and prompt learning are incorporated while keeping the backbone modules of each modality frozen. Experiments were conducted on a multimodal version of the MOB (Malicious or Benign) dataset in both supervised and few-shot settings.
|
2 |
Efficient Localization of Human Actions and Moments in VideosEscorcia, Victor 07 1900 (has links)
We are stumbling across a video tsunami flooding our communication channels.
The ubiquity of digital cameras and social networks has increased the amount of visual
media content generated and shared by people, in particular videos. Cisco reports
that 82% of the internet traffic would be in the form of videos by 2022. The computer
vision community has embraced this challenge by offering the first building blocks to
translate the visual data in segmented video clips into semantic tags. However, users
usually require to go beyond tagging at the video level. For example, someone may
want to retrieve important moments such as the “first steps of her child” from a large
collection of untrimmed videos; or retrieving all the instances of a home-run from an
unsegmented video of baseball. In the face of this data deluge, it becomes crucial
to develop efficient and scalable algorithms that can intelligently localize semantic
visual content in untrimmed videos.
In this work, I address three different challenges on the localization of actions in
videos. First, I develop deep-based action proposals and detection models that take a
video and generate action-agnostic and class-specific temporal segments, respectively.
These models retrieve temporal locations with high accuracy in an efficient manner,
faster than real-time. Second, I propose the new task to retrieve and localize temporal
moments from a collection of videos given a natural language query. To tackle this
challenge, I introduce an efficient and effective model that aligns the text query to
individual clips of fixed length while still retrieves moments spanning multiple clips.
This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments.
Lastly, I introduce the concept of actor-supervision that exploits the inherent compo
sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal
localization of actions without the need of action box annotations. By designing ef
ficient models to scan a single video in real-time; retrieve and localizing moments of
interest from multiple videos; and an effective strategy to localize actions without
resorting in action box annotations, this thesis provides insights that put us closer to
the goal of general video understanding.
|
3 |
Domain-Aware Continual Zero-Shot LearningYi, Kai 29 November 2021 (has links)
We introduce Domain Aware Continual Zero-Shot Learning (DACZSL), the task of visually recognizing images of unseen categories in unseen domains sequentially. We created DACZSL on top of the DomainNet dataset by dividing it into a sequence of tasks, where classes are incrementally provided on seen domains during training and evaluation is conducted on unseen domains for both seen and unseen classes. We also proposed a novel Domain-Invariant CZSL Network (DIN), which outperforms state-of-the-art baseline models that we adapted to DACZSL setting. We adopt a structure-based approach to alleviate forgetting knowledge from previous tasks with a small per-task private network in addition to a global shared network. To encourage the private network to capture the domain and task-specific representation, we train our model with a novel adversarial knowledge disentanglement setting to make our global network task-invariant and domain-invariant over all the tasks. Our method also learns a class-wise learnable prompt to obtain better class-level text representation, which is used to represent side information to enable zero-shot prediction of future unseen classes. Our code and benchmarks are made available at https://zero-shot-learning.github.io/daczsl.
|
4 |
APPLYING CLIP FOR LAND COVER CLASSIFICATION USING AERIAL AND SATELLITE IMAGERYKexin Meng (17541795) 04 December 2023 (has links)
<p dir="ltr">Land cover classification has always been a crucial topic in the remote sensing domain. Utilizing data collected by unmanned aerial vehicles and satellites, researchers can detect land degradation, monitor environmental changes, and provide insights for urban planning. Recent advancements in large multi-modal models have enabled open-vocabulary classification, which is particularly beneficial in this field. Becuase of the pre-training method, these models can perform zero-shot inference on unseen data, significantly reducing the costs associated with data collection and model training. This open-vocabulary feature of large-scale vision-language pre-training aligns well with the requirements of land cover classification, where benchmark datasets in the remote sensing domain comprise various categories, and transferring results from one dataset to another through supervised learning methods is challenging.</p><p dir="ltr">In this thesis, the author explored the performance of zero-shot CLIP and linear probe CLIP to assess the feasibility of using the CLIP model for land cover classification tasks. Further, the author fine-tuned CLIP by creating hierarchical label sets for the datasets, leading to better zero-shot classification results and improving overall accuracy by 2.5%. Regarding data engineering, the author examined the performance of zero-shot CLIP and linear probe CLIP across different categories and proposed a categorization method for land cover datasets. In summary, this work evaluated CLIP's overall performance on land cover datasets of varying spatial resolutions and proposed a hierarchical classification method to enhance its zero-shot performance. The thesis also offers a practical approach for modifying current dataset categorizations to better align with the model.</p>
|
5 |
Toward Robust Class-Agnostic Object CountingJiban, Md Jibanul Haque 01 January 2024 (has links) (PDF)
Object counting is a process of determining the quantity of specific objects in images. Accurate object counting is key for various applications in image understanding. The common applications are traffic monitoring, crowd management, wildlife migration monitoring, cell counting in medical images, plant and insect counting in agriculture, etc. Occlusions, complex backgrounds, changes in scale, and variations in object appearance in real-world settings make object counting challenging. This dissertation explores a progression of techniques to achieve robust localization and counting under diverse image modalities.
The exploration initiates with addressing the challenges of vehicular target localization in cluttered environments using infrared (IR) imagery. We propose a network, called TCRNet-2, that processes target and clutter information in two parallel channels and then combines them to optimize the target-to-clutter ratio (TCR) metric. Next, we explore class-agnostic object counting in RGB images using vision transformers. The primary motivation for this work is that most current methods excel at counting known object types but struggle with unseen categories. To solve these drawbacks, we propose a class-agnostic object counting method. We introduce a dual-branch architecture with interconnected cross-attention that generates feature pyramids for robust object representations, and a dedicated feature aggregator module that further improves performance. Finally, we propose a novel framework that leverages vision-language models (VLM) for zero-shot object counting. While our earlier class-agnostic counting method demonstrates high efficacy in generalized counting tasks, it relies on user-defined exemplars of target objects, presenting a limitation. Additionally, the previous zero-shot counting method was a reference-less approach, which limits the ability to control the selection of the target object of interest in multi-class scenarios. To address these shortcomings, we propose to utilize vision-language models for zero-shot counting where object categories of interest can be specified by text prompts.
|
6 |
VISION-LANGUAGE MODEL FOR ROBOT GRASPINGAbhinav Kaushal Keshari (15348490) 01 May 2023 (has links)
<p>Robot grasping is emerging as an active area of research in robotics as the interest in human-robot interaction is gaining worldwide because of diverse industrial settings for sharing tasks and workplaces. It mainly focuses on the quality of generated grasps for object manipulation. However, despite advancements, these methods need to consider the human-robot collaboration settings where robots and humans will have to grasp the same objects concurrently. Therefore, generating robot grasps compatible with human preferences of simultaneously holding an object is necessary to ensure a safe and natural collaboration experience. In this work, we propose a novel, deep neural network-based method called CoGrasp that generates human-aware robot grasps by contextualizing human preference models of object grasping into the robot grasp selection process. We validate our approach against existing state-of-the-art robot grasping methods through simulated and real-robot experiments and user studies. In real robot experiments, our method achieves about 88% success rate in producing stable grasps that allow humans to interact and grasp objects simultaneously in a socially compliant manner. Furthermore, our user study with 10 independent participants indicated our approach enables a safe, natural, and socially aware human-robot objects' co-grasping experience compared to a standard robot grasping technique.</p>
<p>To facilitate the grasping process, we also introduce a vision-language model that works as a pre-processing system before the grasping action takes place. In most settings, the robots are equipped with sensors that allow them to capture the scene, on which the vision model is used to do a detection task and objectify the visible contents in the environment. The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among these queries to localize the object that needs attention, and once identified, we can use a grasping process for the task at hand.</p>
|
7 |
Image-Text context relation using Machine Learning : Research on performance of different datasetsSun, Yuqi January 2022 (has links)
Based on the progress in Computer Vision and Natural Language Processing fields, Vision-Language (VL) models are designed to process information from images and texts. The thesis focused on the performance of a model, Oscar, on different datasets. Oscar is a State-of-The-Art VL representation learning model based on a pre-trained model for Object Detection and a pre-trained Bert model. By comparing the performance of datasets, we could understand the relationship between the properties of datasets and the performance of models. The conclusions could provide the direction for future work on VL datasets and models. In this thesis, I collected five VL datasets that have at least one main difference from each other and generated 8 subsets from these datasets. I trained the same model with different subsets to classify whether an image is related to a text. In common sense, clear datasets have better performance because their images are of everyday scenes and annotated by human annotators. Thus, the size of clear datasets is always limited. However, an interesting phenomenon in the thesis is that the dataset generated by models trained on different datasets has achieved as good performance as clear datasets. This would encourage the research on models for data collection. The experiment results also indicated that future work on the VL model could focus on improving feature extraction from images, as the images have a great influence on the performance of VL models. / Baserat på prestationerna inom Computer Vision och Natural Language Processing-fält, är Vision-Language (VL)-modeller utformade för att bearbeta information från bilder och texter. Projektet fokuserade på prestanda av en modell, Oscar, på olika datamängder. Oscar är en State-of-The-Art VL-representationsinlärningsmodell baserad på en förutbildad modell för Objektdetektion och en förutbildad Bert-modell. Genom att jämföra datauppsättningarnas prestanda kunde vi förstå sambandet mellan datauppsättningarnas egenskaper och modellernas prestanda. Slutsatserna skulle kunna ge riktning för framtida arbete med VL-datauppsättningar och modeller. I detta projekt samlade jag fem VL-datauppsättningar som har minst en huvudskillnad från varandra och genererade 8 delmängder från dessa datauppsättningar. Jag tränade samma modell med olika delmängder för att klassificera om en bild är relaterad till en text. I sunt förnuft har tydliga datauppsättningar bättre prestanda eftersom deras bilder är av vardagliga scener och kommenterade av människor. Storleken på tydliga datamängder är därför alltid begränsad. Ett intressant fenomen i projektet är dock att den datauppsättning som genereras av modeller har uppnått lika bra prestanda som tydliga datauppsättningar. Detta skulle uppmuntra forskning om modeller för datainsamling. Experimentresultaten indikerade också att framtida arbete med VL-modellen kan fokusera på att förbättra funktionsextraktion från bilder, eftersom bilderna har ett stort inflytande på prestandan hos VL-modeller.
|
Page generated in 0.0456 seconds