• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • Tagged with
  • 5
  • 5
  • 5
  • 3
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Efficient Localization of Human Actions and Moments in Videos

Escorcia, Victor 07 1900 (has links)
We are stumbling across a video tsunami flooding our communication channels. The ubiquity of digital cameras and social networks has increased the amount of visual media content generated and shared by people, in particular videos. Cisco reports that 82% of the internet traffic would be in the form of videos by 2022. The computer vision community has embraced this challenge by offering the first building blocks to translate the visual data in segmented video clips into semantic tags. However, users usually require to go beyond tagging at the video level. For example, someone may want to retrieve important moments such as the “first steps of her child” from a large collection of untrimmed videos; or retrieving all the instances of a home-run from an unsegmented video of baseball. In the face of this data deluge, it becomes crucial to develop efficient and scalable algorithms that can intelligently localize semantic visual content in untrimmed videos. In this work, I address three different challenges on the localization of actions in videos. First, I develop deep-based action proposals and detection models that take a video and generate action-agnostic and class-specific temporal segments, respectively. These models retrieve temporal locations with high accuracy in an efficient manner, faster than real-time. Second, I propose the new task to retrieve and localize temporal moments from a collection of videos given a natural language query. To tackle this challenge, I introduce an efficient and effective model that aligns the text query to individual clips of fixed length while still retrieves moments spanning multiple clips. This approach not only allows smooth interactions with users via natural languagequeries but also reduce the index size and search time for retrieving the moments. Lastly, I introduce the concept of actor-supervision that exploits the inherent compo sitionality of actions, in terms of transformations of actors, to achieve spatiotemporal localization of actions without the need of action box annotations. By designing ef ficient models to scan a single video in real-time; retrieve and localizing moments of interest from multiple videos; and an effective strategy to localize actions without resorting in action box annotations, this thesis provides insights that put us closer to the goal of general video understanding.
2

Domain-Aware Continual Zero-Shot Learning

Yi, Kai 29 November 2021 (has links)
We introduce Domain Aware Continual Zero-Shot Learning (DACZSL), the task of visually recognizing images of unseen categories in unseen domains sequentially. We created DACZSL on top of the DomainNet dataset by dividing it into a sequence of tasks, where classes are incrementally provided on seen domains during training and evaluation is conducted on unseen domains for both seen and unseen classes. We also proposed a novel Domain-Invariant CZSL Network (DIN), which outperforms state-of-the-art baseline models that we adapted to DACZSL setting. We adopt a structure-based approach to alleviate forgetting knowledge from previous tasks with a small per-task private network in addition to a global shared network. To encourage the private network to capture the domain and task-specific representation, we train our model with a novel adversarial knowledge disentanglement setting to make our global network task-invariant and domain-invariant over all the tasks. Our method also learns a class-wise learnable prompt to obtain better class-level text representation, which is used to represent side information to enable zero-shot prediction of future unseen classes. Our code and benchmarks are made available at https://zero-shot-learning.github.io/daczsl.
3

APPLYING CLIP FOR LAND COVER CLASSIFICATION USING AERIAL AND SATELLITE IMAGERY

Kexin Meng (17541795) 04 December 2023 (has links)
<p dir="ltr">Land cover classification has always been a crucial topic in the remote sensing domain. Utilizing data collected by unmanned aerial vehicles and satellites, researchers can detect land degradation, monitor environmental changes, and provide insights for urban planning. Recent advancements in large multi-modal models have enabled open-vocabulary classification, which is particularly beneficial in this field. Becuase of the pre-training method, these models can perform zero-shot inference on unseen data, significantly reducing the costs associated with data collection and model training. This open-vocabulary feature of large-scale vision-language pre-training aligns well with the requirements of land cover classification, where benchmark datasets in the remote sensing domain comprise various categories, and transferring results from one dataset to another through supervised learning methods is challenging.</p><p dir="ltr">In this thesis, the author explored the performance of zero-shot CLIP and linear probe CLIP to assess the feasibility of using the CLIP model for land cover classification tasks. Further, the author fine-tuned CLIP by creating hierarchical label sets for the datasets, leading to better zero-shot classification results and improving overall accuracy by 2.5%. Regarding data engineering, the author examined the performance of zero-shot CLIP and linear probe CLIP across different categories and proposed a categorization method for land cover datasets. In summary, this work evaluated CLIP's overall performance on land cover datasets of varying spatial resolutions and proposed a hierarchical classification method to enhance its zero-shot performance. The thesis also offers a practical approach for modifying current dataset categorizations to better align with the model.</p>
4

VISION-LANGUAGE MODEL FOR ROBOT GRASPING

Abhinav Kaushal Keshari (15348490) 01 May 2023 (has links)
<p>Robot grasping is emerging as an active area of research in robotics as the interest in human-robot interaction is gaining worldwide because of diverse industrial settings for sharing tasks and workplaces. It mainly focuses on the quality of generated grasps for object manipulation. However, despite advancements, these methods need to consider the human-robot collaboration settings where robots and humans will have to grasp the same objects concurrently. Therefore, generating robot grasps compatible with human preferences of simultaneously holding an object is necessary to ensure a safe and natural collaboration experience. In this work, we propose a novel, deep neural network-based method called CoGrasp that generates human-aware robot grasps by contextualizing human preference models of object grasping into the robot grasp selection process. We validate our approach against existing state-of-the-art robot grasping methods through simulated and real-robot experiments and user studies. In real robot experiments, our method achieves about 88% success rate in producing stable grasps that allow humans to interact and grasp objects simultaneously in a socially compliant manner. Furthermore, our user study with 10 independent participants indicated our approach enables a safe, natural, and socially aware human-robot objects' co-grasping experience compared to a standard robot grasping technique.</p> <p>To facilitate the grasping process, we also introduce a vision-language model that works as a pre-processing system before the grasping action takes place. In most settings, the robots are equipped with sensors that allow them to capture the scene, on which the vision model is used to do a detection task and objectify the visible contents in the environment. The language model is used to program the robot to make it possible for them to understand and execute the required sequence of tasks. Using the process of object detection, we build a set of object queries from the sensor image and allow the user to provide an input query for a task to be performed. We then perform a similarity score among these queries to localize the object that needs attention, and once identified, we can use a grasping process for the task at hand.</p>
5

Image-Text context relation using Machine Learning : Research on performance of different datasets

Sun, Yuqi January 2022 (has links)
Based on the progress in Computer Vision and Natural Language Processing fields, Vision-Language (VL) models are designed to process information from images and texts. The thesis focused on the performance of a model, Oscar, on different datasets. Oscar is a State-of-The-Art VL representation learning model based on a pre-trained model for Object Detection and a pre-trained Bert model. By comparing the performance of datasets, we could understand the relationship between the properties of datasets and the performance of models. The conclusions could provide the direction for future work on VL datasets and models. In this thesis, I collected five VL datasets that have at least one main difference from each other and generated 8 subsets from these datasets. I trained the same model with different subsets to classify whether an image is related to a text. In common sense, clear datasets have better performance because their images are of everyday scenes and annotated by human annotators. Thus, the size of clear datasets is always limited. However, an interesting phenomenon in the thesis is that the dataset generated by models trained on different datasets has achieved as good performance as clear datasets. This would encourage the research on models for data collection. The experiment results also indicated that future work on the VL model could focus on improving feature extraction from images, as the images have a great influence on the performance of VL models. / Baserat på prestationerna inom Computer Vision och Natural Language Processing-fält, är Vision-Language (VL)-modeller utformade för att bearbeta information från bilder och texter. Projektet fokuserade på prestanda av en modell, Oscar, på olika datamängder. Oscar är en State-of-The-Art VL-representationsinlärningsmodell baserad på en förutbildad modell för Objektdetektion och en förutbildad Bert-modell. Genom att jämföra datauppsättningarnas prestanda kunde vi förstå sambandet mellan datauppsättningarnas egenskaper och modellernas prestanda. Slutsatserna skulle kunna ge riktning för framtida arbete med VL-datauppsättningar och modeller. I detta projekt samlade jag fem VL-datauppsättningar som har minst en huvudskillnad från varandra och genererade 8 delmängder från dessa datauppsättningar. Jag tränade samma modell med olika delmängder för att klassificera om en bild är relaterad till en text. I sunt förnuft har tydliga datauppsättningar bättre prestanda eftersom deras bilder är av vardagliga scener och kommenterade av människor. Storleken på tydliga datamängder är därför alltid begränsad. Ett intressant fenomen i projektet är dock att den datauppsättning som genereras av modeller har uppnått lika bra prestanda som tydliga datauppsättningar. Detta skulle uppmuntra forskning om modeller för datainsamling. Experimentresultaten indikerade också att framtida arbete med VL-modellen kan fokusera på att förbättra funktionsextraktion från bilder, eftersom bilderna har ett stort inflytande på prestandan hos VL-modeller.

Page generated in 0.3696 seconds