Global ETD Search

1	Learning visually grounded meaning representations Silberer, Carina Helga January 2015 (has links) Humans possess a rich semantic knowledge of words and concepts which captures the perceivable physical properties of their real-world referents and their relations. Encoding this knowledge or some of its aspects is the goal of computational models of semantic representation and has been the subject of considerable research in cognitive science, natural language processing, and related areas. Existing models have placed emphasis on different aspects of meaning, depending ultimately on the task at hand. Typically, such models have been used in tasks addressing the simulation of behavioural phenomena, e.g., lexical priming or categorisation, as well as in natural language applications, such as information retrieval, document classification, or semantic role labelling. A major strand of research popular across disciplines focuses on models which induce semantic representations from text corpora. These models are based on the hypothesis that the meaning of words is established by their distributional relation to other words (Harris, 1954). Despite their widespread use, distributional models of word meaning have been criticised as ‘disembodied’ in that they are not grounded in perception and action (Perfetti, 1998; Barsalou, 1999; Glenberg and Kaschak, 2002). This lack of grounding contrasts with many experimental studies suggesting that meaning is acquired not only from exposure to the linguistic environment but also from our interaction with the physical world (Landau et al., 1998; Bornstein et al., 2004). This criticism has led to the emergence of new models aiming at inducing perceptually grounded semantic representations. Essentially, existing approaches learn meaning representations from multiple views corresponding to different modalities, i.e. linguistic and perceptual input. To approximate the perceptual modality, previous work has relied largely on semantic attributes collected from humans (e.g., is round, is sour), or on automatically extracted image features. Semantic attributes have a long-standing tradition in cognitive science and are thought to represent salient psychological aspects of word meaning including multisensory information. However, their elicitation from human subjects limits the scope of computational models to a small number of concepts for which attributes are available. In this thesis, we present an approach which draws inspiration from the successful application of attribute classifiers in image classification, and represent images and the concepts depicted by them by automatically predicted visual attributes. To this end, we create a dataset comprising nearly 700K images and a taxonomy of 636 visual attributes and use it to train attribute classifiers. We show that their predictions can act as a substitute for human-produced attributes without any critical information loss. In line with the attribute-based approximation of the visual modality, we represent the linguistic modality by textual attributes which we obtain with an off-the-shelf distributional model. Having first established this core contribution of a novel modelling framework for grounded meaning representations based on semantic attributes, we show that these can be integrated into existing approaches to perceptually grounded representations. We then introduce a model which is formulated as a stacked autoencoder (a variant of multilayer neural networks), which learns higher-level meaning representations by mapping words and images, represented by attributes, into a common embedding space. In contrast to most previous approaches to multimodal learning using different variants of deep networks and data sources, our model is defined at a finer level of granularity—it computes representations for individual words and is unique in its use of attributes as a means of representing the textual and visual modalities. We evaluate the effectiveness of the representations learnt by our model by assessing its ability to account for human behaviour on three semantic tasks, namely word similarity, concept categorisation, and typicality of category members. With respect to the word similarity task, we focus on the model’s ability to capture similarity in both the meaning and appearance of the words’ referents. Since existing benchmark datasets on word similarity do not distinguish between these two dimensions and often contain abstract words, we create a new dataset in a large-scale experiment where participants are asked to give two ratings per word pair expressing their semantic and visual similarity, respectively. Experimental results show that our model learns meaningful representations which are more accurate than models based on individual modalities or different modality integration mechanisms. The presented model is furthermore able to predict textual attributes for new concepts given their visual attribute predictions only, which we demonstrate by comparing model output with human generated attributes. Finally, we show the model’s effectiveness in an image-based task on visual category learning, in which images are used as a stand-in for real-world objects. 006.3
2	Interactive image search with attributes Kovashka, Adriana Ivanova 18 September 2014 (has links) An image retrieval system needs to be able to communicate with people using a common language, if it is to serve its user's information need. I propose techniques for interactive image search with the help of visual attributes, which are high-level semantic visual properties of objects (like "shiny" or "natural"), and are understandable by both people and machines. My thesis explores attributes as a novel form of user input for search. I show how to use attributes to provide relevance feedback for image search; how to optimally choose what to seek feedback on; how to ensure that the attribute models learned by a system align with the user's perception of these attributes; how to automatically discover the shades of meaning that users employ when applying an attribute term; and how attributes can help learn object category models. I use attributes to provide a channel on which the user of an image retrieval system can communicate her information need precisely and with as little effort as possible. One-shot retrieval is generally insufficient, so interactive retrieval systems seek feedback from the user on the currently retrieved results, and adapt their relevance ranking function accordingly. In traditional interactive search, users mark some images as "relevant" and others as "irrelevant", but this form of feedback is limited. I propose a novel mode of feedback where a user directly describes how high-level properties of retrieved images should be adjusted in order to more closely match her envisioned target images, using relative attribute feedback statements. For example, when conducting a query on a shopping website, the user might state: "I want shoes like these, but more formal." I demonstrate that relative attribute feedback is more powerful than traditional binary feedback. The images believed to be most relevant need not be most informative for reducing the system's uncertainty, so it might be beneficial to seek feedback on something other than the top-ranked images. I propose to guide the user through a coarse-to-fine search using a relative attribute image representation. At each iteration of feedback, the user provides a visual comparison between the attribute in her envisioned target and a "pivot" exemplar, where a pivot separates all database images into two balanced sets. The system actively determines along which of multiple such attributes the user's comparison should next be requested, based on the expected information gain that would result. The proposed attribute search trees allow us to limit the scan for candidate images on which to seek feedback to just one image per attribute, so it is efficient both for the system and the user. No matter what potentially powerful form of feedback the system offers the user, search efficiency will suffer if there is noise on the communication channel between the user and the system. Therefore, I also study ways to capture the user's true perception of the attribute vocabulary used in the search. In existing work, the underlying assumption is that an image has a single "true" label for each attribute that objective viewers could agree upon. However, multiple objective viewers frequently have slightly different internal models of a visual property. I pose user-specific attribute learning as an adaptation problem in which the system leverages any commonalities in perception to learn a generic prediction function. Then, it uses a small number of user-labeled examples to adapt that model into a user-specific prediction function. To further lighten the labeling load, I introduce two ways to extrapolate beyond the labels explicitly provided by a given user. While users differ in how they use the attribute vocabulary, there exist some commonalities and groupings of users around their attribute interpretations. Automatically discovering and exploiting these groupings can help the system learn more robust personalized models. I propose an approach to discover the latent factors behind how users label images with the presence or absence of a given attribute, from a sparse label matrix. I then show how to cluster users in this latent space to expose the underlying "shades of meaning" of the attribute, and subsequently learn personalized models for these user groups. Discovering the shades of meaning also serves to disambiguate attribute terms and expand a core attribute vocabulary with finer-grained attributes. Finally, I show how attributes can help learn object categories faster. I develop an active learning framework where the computer vision learning system actively solicits annotations from a pool of both object category labels and the objects' shared attributes, depending on which will most reduce total uncertainty for multi-class object predictions in the joint object-attribute model. Knowledge of an attribute's presence in an image can immediately influence many object models, since attributes are by definition shared across subsets of the object categories. The resulting object category models can be used when the user initiates a search via keywords such as "Show me images of cats" and then (optionally) refines that search with the attribute-based interactions I propose. My thesis exploits properties of visual attributes that allow search to be both effective and efficient, in terms of both user time and computation time. Further, I show how the search experience for each individual user can be improved, by modeling how she uses attributes to communicate with the retrieval system. I focus on the modes in which an image retrieval system communicates with its users by integrating the computer vision perspective and the information retrieval perspective to image search, so the techniques I propose are a promising step in closing the semantic gap. / text Computer vision Attributes Image retrieval Relevance feedback Object recognition Active learning Personalization Vision and language
3	Commonsense for Zero-Shot Natural Language Video Localization Holla, Meghana 07 July 2023 (has links) Zero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks. / Master of Science / Natural Language Video Localization (NLVL) is the task of retrieving relevant video segments from an untrimmed video given a user text query. To train an NLVL system, traditional methods demand annotations on the input videos, which include video segment spans (i.e., start and end timestamps) and the accompanying text query describing the segment. These annotations are laborious to collect for any domain and video length. To alleviate this, zero-shot NLVL methods generate the aforementioned annotations dynamically. However, current zero-shot NLVL approaches suffer from poor alignment between the video and the dynamically generated query, which can introduce noise in the localization process. To this end, this work aims to investigate the impact of implicit commonsensical knowledge, which humans innately possess, on zero-shot NLVL. We introduce CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries. Experiments on two benchmark datasets, containing diverse themes of videos, highlight the effectiveness of leveraging commonsense information. Video Localization Commonsense Multimodal Machine Learning Vision and Language
4	Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks Lin, Xiao 05 October 2017 (has links) Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. / Ph. D. / Designing systems that learn and reason with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task, our system first represents the input information (e.g., images and text) in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input information. Based on those perspectives, the system performs reasoning to make a joint prediction to solve the target task. Perhaps surprisingly, we show that imagining (generating) abstract scenes behind input textual scene descriptions improves performance on various textual reasoning tasks such as answering fill-in-the-blank and paraphrasing questions, and answering questions about images improves performance on retrieving image captions. Through the use of perspectives from multiple modalities, our system also makes use of training data more efficiently and has a reasoning process that is easy to understand. Complementary to the system design aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA). VQA is the task of answering open-ended natural language questions about images. In active learning for VQA, a model iteratively grows its knowledge through querying informative questions about images for answers. Inspired by human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven query selection function. We show that once initialized with a large training set, a VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. Common Sense Multimodal Visual Question Answering Image-Caption Ranking Vision and Language Active Learning
5	Multicultural Emotional reasoning in Vision Language Models MOHAMED, YOUSSEF SHERIF MANSOUR 03 1900 (has links) Human intelligence, with its many components, has been elusive. Until recently, the emphasis has been on facts and how humans perceive them. Now, it is time to embellish these facts with emotions and commentary. Emotional experiences and expressions play a critical role in human behavior and are influenced by language and cultural diversity. In this thesis, we explore the importance of emotions across multiple languages, such as Arabic, Chinese, and Spanish. In addition, we argue for the importance of collecting diverse emotional experiences including negative ones. We aim to develop AI systems that have a deeper understanding of emotional experiences. We open-source two datasets that emphasize diversity over emotions, language, and culture. ArtELingo contains affective annotations in the aforementioned languages, revealing valuable insights into how linguistic backgrounds shape emotional perception and expression. While ArtEmis 2.0 has a balanced distribution of positive and negative emotional experiences. Studying emotional experiences in AI is crucial for creating applications that genuinely understand and resonate with users. We identify and tackle challenges in popular existing affective captioning datasets, mainly unbalanced emotion distribution, and generic captions, we pro- pose a contrastive data collection method. This approach results in a dataset with a balanced distribution of emotions, significantly enhancing the quality of trained neural speakers and emotion recognition models. Consequently, our trained speakers generate emotionally accurate and relevant captions, demonstrating the advantages of using a linguistically and emotionally diverse dataset in AI systems. In addition, we explore the cultural aspects of emotional experiences and expressions, highlighting the importance of considering cultural differences in the development of AI applications. By incorporating these insights, our research lays the groundwork for future advancements in culturally diverse affective computing. This thesis establishes a foundation for future research in emotionally and culturally diverse affective computing, contributing to the development of AI applications capable of effectively understanding and engaging with humans on a deeper emotional level, regardless of their cultural background. Emotion Vision Language Multi-Culture Multilingual Vision and Language CVPR EMNLP Generative AI Image Captioning Text Generation Affective Captioning

1

Page generated in 0.0695 seconds