Global ETD Search

1	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation Aakur, Sathyanarayanan Narasimhan 28 June 2019 (has links) One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines these attributes into a coherent description. While significant milestones have been achieved in the field of computer vision, majority of the work has been concentrated on supervised visual recognition where complex visual representations are learned and a few discrete categories or labels are assigned to these representations. This implies a closed world where the underlying assumption is that all environments contain the same objects and events, which are in one-to-one correspondence with the ground evidence in the image. Hence, the learned knowledge is limited to the annotated training set. An open world, on the other hand, does not assume the distribution of semantics and requires generalization beyond the training annotations. Increasingly complex models require massive amounts of training data and offer little to no explainability due to the lack of transparency in the decision-making process. The strength of artificial intelligence systems to offer explanations for their decisions is central to building user confidence and structuring smart human-machine interactions. In this dissertation, we develop an inherently explainable approach for generating rich interpretations of visual scenes. We move towards an open world open-domain visual understanding by decoupling the ideas of recognition and reasoning. We integrate common sense knowledge from large knowledge bases such as ConceptNet and the representation learning capabilities of deep learning approaches in a pattern theory formalism to interpret a complex visual scene. To be specific, we first define and develop the idea of contextualization to model and establish complex semantic relationships among concepts grounded in visual data. The resulting semantic structures, called interpretations allow us to represent the visual scene in an intermediate representation that can then be used as the source of knowledge for various modes of expression such as labels, captions and even question answering. Second, we explore the inherent explainability of such visual interpretations and define key components for extending the notion of explainability to intelligent agents for visual recognition. Finally, we describe a self-supervised model for segmenting untrimmed videos into its constituent events. We show that this approach can segment videos without the need for supervision - neither implicit nor explicit. Combined, we argue that these approaches offer an elegant path to inherently explainable, open domain visual understanding while negating the need for human supervision in the form of labels and/or captions. We show that the proposed approach can advance the state-of-the-art results in complex benchmarks to handle data imbalance, complex semantics, and complex visual scenes without the need for vast amounts of domain-specific training data. Extensive experiments on several publicly available datasets show the efficacy of the proposed approaches. We show that the proposed approaches outperform weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines. The self-supervised approach for video segmentation complements this top-down inference with efficient bottom-up processing, resulting in an elegant formalism for open-domain visual understanding. Common sense knowledge Visual understanding ConceptNet Computer Sciences
2	Label-Efficient Visual Understanding with Consistency Constraints Zou, Yuliang 24 May 2022 (has links) Modern deep neural networks are proficient at solving various visual recognition and understanding tasks, as long as a sufficiently large labeled dataset is available during the training time. However, the progress of these visual tasks is limited by the number of manual annotations. On the other hand, it is usually time-consuming and error-prone to annotate visual data, rendering the challenge of scaling up human labeling for many visual tasks. Fortunately, it is easy to collect large-scale, diverse unlabeled visual data from the Internet. And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how to utilize the unlabeled data and synthetic labeled data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea is to encourage deep neural networks to produce consistent predictions across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose to use the consistency over different geometric formulations and a cycle consistency over time to tackle the low-level scene geometry perception tasks in a self-supervised learning setting. In Part II, we tackle the high-level semantic understanding tasks in a semi-supervised learning setting, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains with one single forward pass, without model training or optimization at the inference time. / Doctor of Philosophy / Recently, deep learning has emerged as one of the most powerful tools to solve various visual understanding tasks. However, the development of deep learning methods is significantly limited by the amount of manually labeled data. On the other hand, it is usually time-consuming and error-prone to annotate visual data, making the human labeling process not easily scalable. Fortunately, it is easy to collect large-scale, diverse raw visual data from the Internet (\eg search engines, YouTube, Instagram, etc.). And we can acquire a large amount of synthetic visual data with annotations from game engines effortlessly. In this dissertation, we explore how we can utilize the raw visual data and synthetic data for various visual tasks, aiming to replace or reduce the direct supervision from the manual annotations. The key idea behind this is to encourage deep neural networks to produce consistent predictions of the same visual input across different transformations (\eg geometry, temporal, photometric, etc.). We organize the dissertation as follows. In Part I, we propose using the consistency over different geometric formulations and a forward-backward cycle consistency over time to tackle the low-level scene geometry perception tasks, using unlabeled visual data only. In Part II, we tackle the high-level semantic understanding tasks using both a small amount of labeled data and a large amount of unlabeled data jointly, with the constraint that different augmented views of the same visual input maintain consistent semantic information. In Part III, we tackle the cross-domain image segmentation problem. By encouraging an adaptive segmentation model to output consistent results for a diverse set of strongly-augmented synthetic data, the model learns to perform test-time adaptation on unseen target domains. Label-Efficient Consistency Regularization Visual Understanding Self-Supervised Learning Semi-Supervised Learning Pseudo Labeling Test-Time Adaptation BatchNorm Calibration Cross-Domain Generalization

Search results

Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation

Label-Efficient Visual Understanding with Consistency Constraints