Global ETD Search

1	Unified Approaches for Multi-Task Vision-Language Interactions You, Haoxuan January 2024 (has links) Vision and Language are two major modalities that humans rely on to perceive the environment and understand the world. Recent advances in Artificial Intelligence (AI) facilitate the development of a variety of vision-language tasks derived from diverse multimodal interactions in daily life, such as image captioning, image-text matching, visual question answering (VQA), text-to-image generation, etc. Despite the remarkable performance, most previous state-of-the-art models are merely specialized for a single vision-language task, which lack generalizability across multiple tasks. Additionally, those specialized models sophisticate the algorithm designs and bring redundancy to model deployment when dealing with complex scenes. In this study, we investigate developing unified approaches capable of solving various vision-language interactions in a multi-task manner. We argue that unified multi-task methods could enjoy several potential advantages: (1) A unified framework for multiple tasks can reduce human efforts in designing different models for different tasks; (2) Reusing and sharing parameters across tasks can improve efficiency; (3) Some tasks may be complementary to other tasks so that multi-tasking can boost the performance; (4) They can deal with the complex tasks that need a joint collaborating of multiple basic tasks and enable new applications. In the first part of this thesis, we explore unified multi-task models with the goal of sharing and reusing as many parameters as possible between different tasks. We started with unifying many vision-language question-answering tasks, such as visual entailment, outside-knowledge VQA, and visual commonsense reasoning, in a simple iterative divide-and-conquer framework. Specifically, it iteratively decomposes the original text question into sub-question, solves each sub-question, and derives the answer to the original question, which can uniformly handle reasoning of various types and semantics levels within one framework. In the next work, we take one step further to unify tasks of image-to-text generation, text-to-image generation, vision-language understanding, and image-text matching all in one single large-scale Transformer-based model. The above two works demonstrate the feasibility, effectiveness and efficiency of sharing the parameters across different tasks in a single model. Nevertheless, they still need to switch between different tasks and can only conduct one task at a time. In the second part of this thesis, we introduce our efforts toward simultaneous multi-task models that can conduct multiple tasks at the same time with a single model. It has additional advantages: the model can learn to perform different tasks or combinations of multiple tasks automatically according to user queries; the joint interaction of tasks can enable new potential applications. We begin with compounding spatial understanding and semantic understanding in a single multimodal Transformer-based model. To enable models to understand and localize local regions, we proposed a hybrid region representation that seamlessly bridges regions with image and text. Coupled with a delicately collected training dataset, our model can perform joint spatial and semantic understanding at the same iteration, and empower a new application: spatial reasoning. Continuing the above project, we further introduce an effective module to encode the high-resolution images, and propose a pre-training method that aligns semantics and spatial understanding in high resolution. Besides, we also couple the Optical Character Recognition (OCR) capability together with spatial understanding in the model and study the techniques to improve the compatibility of various tasks. Artificial intelligence Computer multitasking Computer vision--Computer programs Questions and answers--Computer programs
2	Multimodal Representations for Video Suris Coll-Vinent, Didac January 2024 (has links) My thesis explores the fields of multimodal and video analysis in computer vision, aiming to bridge the gap between human perception and machine understanding. Recognizing the interplay among various signals such as text, audio, and visual data, my research explores novel frameworks to integrate these diverse modalities in order to achieve a deeper understanding of complex scenes, with a particular emphasis on video analysis. As part of this exploration, I study diverse tasks such as translation, future prediction, or visual question answering, all connected through the lens of multimodal and video representations. I present novel approaches for each of these challenges, contributing across different facets of computer vision, from dataset creation to algorithmic innovations, and from achieving state-of-the-art results on established benchmarks to introducing new tasks. Methodologically, my thesis embraces two key approaches: self-supervised learning and the integration of structured representations. Self-supervised learning, a technique that allows computers to learn from unlabeled data, helps uncovering inherent connections within multimodal and video inputs. Structured representations, on the other hand, serve as a means to capture complex temporal patterns and uncertainties inherent in video analysis. By employing these techniques, I offer novel insights into modeling multimodal representations for video analysis, showing improved performance with respect to prior work in all studied scenarios. Artificial intelligence Computer science Computer vision--Computer programs Digital video--Computer programs
3	High-level, part-based features for fine-grained visual categorization Berg, Thomas January 2017 (has links) Object recognition--"What is in this image?"--is one of the basic problems of computer vision. Most work in this area has been on finding basic-level object categories such as plant, car, and bird, but recently there has been an increasing amount of work in fine-grained visual categorization, in which the task is to recognize subcategories of a basic-level category, such as blue jay and bluebird. Experimental psychology has found that while basic-level categories are distinguished by the presence or absence of parts (a bird has a beak but car does not), subcategories are more often distinguished by the characteristics of their parts (a starling has a narrow, yellow beak while a cardinal has a wide, red beak). In this thesis we tackle fine-grained visual categorization, guided by this observation. We develop alignment procedures that let us compare corresponding parts, build classifiers tailored to finding the interclass differences at each part, and then combine the per-part classifiers to build subcategory classifiers. Using this approach, we outperform previous work in several fine-grained categorization settings: bird species identification, face recognition, and face attribute classification. In addition, the construction of subcategory classifiers from part classifiers allows us to automatically determine which parts are most relevant when distinguishing between any two subcategories. We can use this to generate illustrations of the differences between subcategories. To demonstrate this, we have built a digital field guide to North American birds which includes automatically generated images highlighting the key differences between visually similar species. This guide, "Birdsnap," also identifies bird species in users' uploaded photos using our subcategory classifiers. We have released Birdsnap as a web site and iPhone application. Optical pattern recognition Optical data processing Computer vision Computer vision--Computer programs Visual perception--Data processing Computer science
4	Vision-based Manipulation In-the-Wild Chi, Cheng January 2024 (has links) Deploying robots in real-world environments involves immense engineering complexity, potentially surpassing the resources required for autonomous vehicles due to the increased dimensionality and task variety. To maximize the chances of successful real-world deployment, finding a simple solution that minimizes engineering complexity at every level, from hardware to algorithm to operations, is crucial. In this dissertation, we consider a vision-based manipulation system that can be deployed in-the-wild when trained to imitate sufficient quantity and diversity of human demonstration data on the desired task. At deployment time, the robot is driven by a single diffusion-based visuomotor policy, with raw RGB images as input and robot end-effector pose as output. Compared to existing policy representations, Diffusion Policy handles multimodal action distributions gracefully, being scalable to high-dimensional action spaces and exhibiting impressive training stability. These properties allow a single software system to be used for multiple tasks, with data collected by multiple demonstrators, deployed to multiple robot embodiments, and without significant hyper-parameter tuning. We developed a Universal Manipulation Interface (UMI), a portable, low-cost, and information-rich data collection system to enable direct manipulation skill learning from in-the-wild human demonstrations. UMI provides an intuitive interface for non-expert users by using hand-held grippers with mounted GoPro cameras. Compared to existing robotic data collection systems, UMI enables robotic data collection without needing a robot, drastically reducing the engineering and operational complexity. Trained with UMI data, the resulting diffusion policies can be deployed across multiple robot platforms in unseen environments for novel objects and to complete dynamic, bimanual, precise, and long-horizon tasks. The Diffusion Policy and UMI combination provides a simple full-stack solution to many manipulation problems. The turn-around time of building a single-task manipulation system (such as object tossing and cloth folding) can be reduced from a few months to a few days. Computer science Computer vision--Computer programs Robots Robotics Robotics--Programming GoPro (Firm)

1

Page generated in 0.1192 seconds