Spelling suggestions: "subject:"cuestions anda answersabout programs"" "subject:"cuestions anda answers.after programs""
1 |
Unified Approaches for Multi-Task Vision-Language InteractionsYou, Haoxuan January 2024 (has links)
Vision and Language are two major modalities that humans rely on to perceive the environment and understand the world. Recent advances in Artificial Intelligence (AI) facilitate the development of a variety of vision-language tasks derived from diverse multimodal interactions in daily life, such as image captioning, image-text matching, visual question answering (VQA), text-to-image generation, etc. Despite the remarkable performance, most previous state-of-the-art models are merely specialized for a single vision-language task, which lack generalizability across multiple tasks. Additionally, those specialized models sophisticate the algorithm designs and bring redundancy to model deployment when dealing with complex scenes.
In this study, we investigate developing unified approaches capable of solving various vision-language interactions in a multi-task manner. We argue that unified multi-task methods could enjoy several potential advantages: (1) A unified framework for multiple tasks can reduce human efforts in designing different models for different tasks; (2) Reusing and sharing parameters across tasks can improve efficiency; (3) Some tasks may be complementary to other tasks so that multi-tasking can boost the performance; (4) They can deal with the complex tasks that need a joint collaborating of multiple basic tasks and enable new applications.
In the first part of this thesis, we explore unified multi-task models with the goal of sharing and reusing as many parameters as possible between different tasks. We started with unifying many vision-language question-answering tasks, such as visual entailment, outside-knowledge VQA, and visual commonsense reasoning, in a simple iterative divide-and-conquer framework. Specifically, it iteratively decomposes the original text question into sub-question, solves each sub-question, and derives the answer to the original question, which can uniformly handle reasoning of various types and semantics levels within one framework. In the next work, we take one step further to unify tasks of image-to-text generation, text-to-image generation, vision-language understanding, and image-text matching all in one single large-scale Transformer-based model. The above two works demonstrate the feasibility, effectiveness and efficiency of sharing the parameters across different tasks in a single model. Nevertheless, they still need to switch between different tasks and can only conduct one task at a time.
In the second part of this thesis, we introduce our efforts toward simultaneous multi-task models that can conduct multiple tasks at the same time with a single model. It has additional advantages: the model can learn to perform different tasks or combinations of multiple tasks automatically according to user queries; the joint interaction of tasks can enable new potential applications. We begin with compounding spatial understanding and semantic understanding in a single multimodal Transformer-based model. To enable models to understand and localize local regions, we proposed a hybrid region representation that seamlessly bridges regions with image and text. Coupled with a delicately collected training dataset, our model can perform joint spatial and semantic understanding at the same iteration, and empower a new application: spatial reasoning. Continuing the above project, we further introduce an effective module to encode the high-resolution images, and propose a pre-training method that aligns semantics and spatial understanding in high resolution. Besides, we also couple the Optical Character Recognition (OCR) capability together with spatial understanding in the model and study the techniques to improve the compatibility of various tasks.
|
2 |
Learning to Remember, Summarize, and Answer Questions about Robot ActionsDeChant, Chad January 2025 (has links)
Robots and other systems using deep learning are increasingly common. It is essential to accurately keep track of what they have done and determine if they are operating as we intend. In this dissertation, I introduce several approaches to enable better monitoring and understanding of these systems.
I propose and demonstrate robot action summarization and question answering in natural language. The first step in understanding and controlling robots is knowing what they are doing. Much research has been done on training robots to learn to follow natural language instructions; robot action summarization is a complement to this research, enabling robots to report back succinctly and accurately what they have done after completing a task. I demonstrate that such summaries can be generated from multimodal inputs using recurrent and transformer networks.
I then investigate question answering about robot action episodes using a dataset of questions and answers I introduce. I show that learning to answer questions can help a model summarize by enabling it to learn about some objects solely during question answering and then transferring that representational knowledge to summarization.
If robots are to summarize and answer questions about their past actions, it will be necessary for them to store and recall episodes of action. I introduce a technique to form compact memory representations which can be used for these tasks, as well as for guiding choices about actions which should be taken during an action sequence. In addition to helping users keep track of what robots or other machine learning systems are doing, such artificial episodic memory representations could also pose some undesirable risks. I therefore propose a set of principles to guide the safe development of artificial episodic memory.
Finally, I introduce a method to learn to predict a neural network’s accuracy on particular inputs by training a second network to examine the outputs from its intermediate, hidden layers or its final outputs.
|
Page generated in 0.1218 seconds