Visual Question Answering in the Medical Domain

Sharma, Dhruv 21 July 2020 (has links)
Medical images are extremely complicated to comprehend for a person without expertise. The limited number of practitioners across the globe often face the issue of fatigue due to the high number of cases. This fatigue, physical and mental, can induce human-errors during the diagnosis. In such scenarios, having an additional opinion can be helpful in boosting the confidence of the decision-maker. Thus, it becomes crucial to have a reliable Visual Question Answering (VQA) system which can provide a "second opinion" on medical cases. However, most of the VQA systems that work today cater to real-world problems and are not specifically tailored for handling medical images. Moreover, the VQA system for medical images needs to consider a limited amount of training data available in this domain. In this thesis, we develop a deep learning-based model for VQA on medical images taking the associated challenges into account. Our MedFuseNet system aims at maximizing the learning with minimal complexity by breaking the problem statement into simpler tasks and weaving everything together to predict the answer. We tackle two types of answer prediction - categorization and generation. We conduct an extensive set of both quantitative and qualitative analyses to evaluate the performance of MedFuseNet. Our results conclude that MedFuseNet outperforms other state-of-the-art methods available in the literature for these tasks. / Master of Science / Medical images are extremely complicated to comprehend for a person without expertise. The limited number of practitioners across the globe often face the issue of fatigue due to the high number of cases. This fatigue, physical and mental, can induce human-errors during the diagnosis. In such scenarios, having an additional opinion can be helpful in boosting the confidence of the decision-maker. Thus, it becomes crucial to have a reliable Visual Question Answering (VQA) system which can provide a "second opinion" on medical cases. However, most of the VQA systems that work today cater to real-world problems and are not specifically tailored for handling medical images. In this thesis, we propose an end-to-end deep learning-based system, MedFuseNet, for predicting the answer for the input query associated with the image. We cater to close-ended as well as open-ended type question-answer pairs. We conduct an extensive analysis to evaluate the performance of MedFuseNet. Our results conclude that MedFuseNet outperforms other state-of-the-art methods available in the literature for these tasks.

Advancing Chart Question Answering with Robust Chart Component Recognition

Zheng, Hanwen 13 August 2024 (has links)
The task of comprehending charts [1, 2, 3] presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. The chart extraction task ensures the precise identification of key components, while the chart question answering (ChartQA) task integrates visual and textual information, facilitating accurate responses to queries based on the chart's content. To approach ChartQA, this research focuses on two main aspects. Firstly, we introduce ChartFormer, an integrated framework that simultaneously identifies and classifies every chart element. ChartFormer extends beyond traditional data visualization by identifying descriptive components such as the chart title, legend, and axes, providing a comprehensive understanding of the chart's content. ChartFormer is particularly effective for complex instance segmentation tasks that involve a wide variety of class objects with unique visual structures. It utilizes an end-to-end transformer architecture, which enhances its ability to handle the intricacies of diverse and distinct object features. Secondly, we present Question-guided Deformable Co-Attention (QDCAt), which facilitates multimodal fusion by incorporating question information into a deformable offset network and enhancing visual representation from ChartFormer through a deformable co-attention block. / Master of Science / Real-world data often encompasses multimodal information, blending textual descriptions with visual representations. Charts, in particular, pose a significant challenge for machine learning models due to their condensed and complex structure. Existing multimodal methods often neglect these graphics, failing to integrate them effectively. To address this gap, we introduce ChartFormer, a unified framework designed to enhance chart understanding through instance segmentation, and a novel Question-guided Deformable Co-Attention (QDCAt) mechanism. This approach seamlessly integrates visual and textual features for chart question answering (ChartQA), allowing for more comprehensive reasoning. ChartFormer excels at identifying and classifying chart components such as bars, lines, pies, titles, legends, and axes. The QDCAt mechanism further enhances multimodal fusion by aligning textual information with visual cues, thereby improving answer accuracy. By dynamically adjusting attention based on the question context, QDCAt ensures that the model focuses on the most relevant parts of the chart. Extensive experiments demonstrate that ChartFormer and QDChart significantly outperform their baseline models in chart component recognition and ChartQA tasks by 3.2% in mAP and 15.4% in accuracy, respectively, providing a robust solution for detailed visual data interpretation across various applications. These results highlight the efficacy of our approach in providing a robust solution for detailed visual data interpretation, making it applicable to a wide range of domains, from scientific research to financial analysis and beyond.

Vision and language understanding with localized evidence

Xu, Huijuan 16 February 2019 (has links)
Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval. First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals. In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.

Using Deep Learning to Answer Visual Questions from Blind People / Användning av Deep Learning för att Svara på Visuella Frågor från Blinda

Dushi, Denis January 2019 (has links)
A natural application of artificial intelligence is to help blind people overcome their daily visual challenges through AI-based assistive technologies. In this regard, one of the most promising tasks is Visual Question Answering (VQA): the model is presented with an image and a question about this image. It must then predict the correct answer. Recently has been introduced the VizWiz dataset, a collection of images and questions originating from blind people. Being the first VQA dataset deriving from a natural setting, VizWiz presents many limitations and peculiarities. More specifically, the characteristics observed are the high uncertainty of the answers, the conversational aspect of questions, the relatively small size of the datasets and ultimately, the imbalance between answerable and unanswerable classes. These characteristics could be observed, individually or jointly, in other VQA datasets, resulting in a burden when solving the VQA task. Particularly suitable to address these aspects of the data are data science pre-processing techniques. Therefore, to provide a solid contribution to the VQA task, we answered the research question “Can data science pre-processing techniques improve the VQA task?” by proposing and studying the effects of four different pre-processing techniques. To address the high uncertainty of answers we employed a pre-processing step in which it is computed the uncertainty of each answer and used this measure to weight the soft scores of our model during training. The adoption of an “uncertainty-aware” training procedure boosted the predictive accuracy of our model of 10% providing a new state-of-the-art when evaluated on the test split of the VizWiz dataset. In order to overcome the limited amount of data, we designed and tested a new pre-processing procedure able to augment the training set and almost double its data points by computing the cosine similarity between answers representation. We addressed also the conversational aspect of questions collected from real world verbal conversations by proposing an alternative question pre-processing pipeline in which conversational terms are removed. This led in a further improvement: from a predictive accuracy of 0.516 with the standard question processing pipeline, we were able to achieve 0.527 predictive accuracy when employing the new pre-processing pipeline. Ultimately, we addressed the imbalance between answerable and unanswerable classes when predicting the answerability of a visual question. We tested two standard pre-processing techniques to adjust the dataset class distribution: oversampling and undersampling. Oversampling provided an albeit small improvement in both average precision and F1 score. / En naturlig tillämpning av artificiell intelligens är att hjälpa blinda med deras dagliga visuella utmaningar genom AI-baserad hjälpmedelsteknik. I detta avseende, är en av de mest lovande uppgifterna Visual Question Answering (VQA): modellen presenteras med en bild och en fråga om denna bild, och måste sedan förutspå det korrekta svaret. Nyligen introducerades VizWiz-datamängd, en samling bilder och frågor till dessa från blinda personer. Då detta är det första VQA-datamängden som härstammar från en naturlig miljö, har det många begränsningar och särdrag. Mer specifikt är de observerade egenskaperna: hög osäkerhet i svaren, informell samtalston i frågorna, relativt liten datamängd och slutligen obalans mellan svarbara och icke svarbara klasser. Dessa egenskaper kan även observeras, enskilda eller tillsammans, i andra VQA-datamängd, vilket utgör särskilda utmaningar vid lösning av VQA-uppgiften. Särskilt lämplig för att hantera dessa aspekter av data är förbehandlingsteknik från området data science. För att bidra till VQA-uppgiften, svarade vi därför på frågan “Kan förbehandlingstekniker från området data science bidra till lösningen av VQA-uppgiften?” genom att föreslå och studera effekten av fyra olika förbehandlingstekniker. För att hantera den höga osäkerheten i svaren använde vi ett förbehandlingssteg där vi beräknade osäkerheten i varje svar och använde detta mått för att vikta modellens utdata-värden under träning. Användandet av en ”osäkerhetsmedveten” träningsprocedur förstärkte den förutsägbara noggrannheten hos vår modell med 10%. Med detta nådde vi ett toppresultat när modellen utvärderades på testdelen av VizWiz-datamängden. För att övervinna problemet med den begränsade mängden data, konstruerade och testade vi en ny förbehandlingsprocedur som nästan dubblerar datapunkterna genom att beräkna cosinuslikheten mellan svarens vektorer. Vi hanterade även problemet med den informella samtalstonen i frågorna, som samlats in från den verkliga världens verbala konversationer, genom att föreslå en alternativ väg att förbehandla frågorna, där samtalstermer är borttagna. Detta ledde till en ytterligare förbättring: från en förutsägbar noggrannhet på 0.516 med det vanliga sättet att bearbeta frågorna kunde vi uppnå 0.527 prediktiv noggrannhet vid användning av det nya sättet att förbehandla frågorna. Slutligen hanterade vi obalansen mellan svarbara och icke svarbara klasser genom att förutse om en visuell fråga har ett möjligt svar. Vi testade två standard-förbehandlingstekniker för att justeradatamängdens klassdistribution: översampling och undersampling. Översamplingen gav en om än liten förbättring i både genomsnittlig precision och F1-poäng.

Transfer Learning and Attention Mechanisms in a Multimodal Setting

Greco, Claudio 13 May 2022 (has links)
Humans are able to develop a solid knowledge of the world around them: they can leverage information coming from different sources (e.g., language, vision), focus on the most relevant information from the input they receive in a given life situation, and exploit what they have learned before without forgetting it. In the field of Artificial Intelligence and Computational Linguistics, replicating these human abilities in artificial models is a major challenge. Recently, models based on pre-training and on attention mechanisms, namely pre-trained multimodal Transformers, have been developed. They seem to perform tasks surprisingly well compared to other computational models in multiple contexts. They simulate a human-like cognition in that they supposedly rely on previously acquired knowledge (transfer learning) and focus on the most important information (attention mechanisms) of the input. Nevertheless, we still do not know whether these models can deal with multimodal tasks that require merging different types of information simultaneously to be solved, as humans would do. This thesis attempts to fill this crucial gap in our knowledge of multimodal models by investigating the ability of pre-trained Transformers to encode multimodal information; and the ability of attention-based models to remember how to deal with previously-solved tasks. With regards to pre-trained Transformers, we focused on their ability to rely on pre-training and on attention while dealing with tasks requiring to merge information coming from language and vision. More precisely, we investigate if pre-trained multimodal Transformers are able to understand the internal structure of a dialogue (e.g., organization of the turns); to effectively solve complex spatial questions requiring to process different spatial elements (e.g., regions of the image, proximity between elements, etc.); and to make predictions based on complementary multimodal cues (e.g., guessing the most plausible action by leveraging the content of a sentence and of an image). The results of this thesis indicate that pre-trained Transformers outperform other models. Indeed, they are able to some extent to integrate complementary multimodal information; they manage to pinpoint both the relevant turns in a dialogue and the most important regions in an image. These results suggest that pre-training and attention play a key role in pre-trained Transformers’ encoding. Nevertheless, their way of processing information cannot be considered as human-like. Indeed, when compared to humans, they struggle (as non-pre-trained models do) to understand negative answers, to merge spatial information in difficult questions, and to predict actions based on complementary linguistic and visual cues. With regards to attention-based models, we found out that these kinds of models tend to forget what they have learned in previously-solved tasks. However, training these models on easy tasks before more complex ones seems to mitigate this catastrophic forgetting phenomenon. These results indicate that, at least in this context, attention-based models (and, supposedly, pre-trained Transformers too) are sensitive to tasks’ order. A better control of this variable may therefore help multimodal models learn sequentially and continuously as humans do.

Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks

Lin, Xiao 05 October 2017 (has links)
Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations. / Ph. D. / Designing systems that learn and reason with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task, our system first represents the input information (e.g., images and text) in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input information. Based on those perspectives, the system performs reasoning to make a joint prediction to solve the target task. Perhaps surprisingly, we show that imagining (generating) abstract scenes behind input textual scene descriptions improves performance on various textual reasoning tasks such as answering fill-in-the-blank and paraphrasing questions, and answering questions about images improves performance on retrieving image captions. Through the use of perspectives from multiple modalities, our system also makes use of training data more efficiently and has a reasoning process that is easy to understand. Complementary to the system design aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA). VQA is the task of answering open-ended natural language questions about images. In active learning for VQA, a model iteratively grows its knowledge through querying informative questions about images for answers. Inspired by human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven query selection function. We show that once initialized with a large training set, a VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.

Zodpovídání dotazů o obrázcích / Visual Question Answering

Hajič, Jakub January 2017 (has links)
Visual Question Answering (VQA) is a recently proposed multimodal task in the general area of machine learning. The input to this task consists of a single image and an associated natural language question, and the output is the answer to that question. In this thesis we propose two incremental modifications to an existing model which won the VQA Challenge in 2016 using multimodal compact bilinear pooling (MCB), a novel way of combining modalities. First, we added the language attention mechanism, and on top of that we introduce an image attention mechanism focusing on objects detected in the image ("region attention"). We also experiment with ways of combining these in a single end- to-end model. The thesis describes the MCB model and our extensions and their two different implementations, and evaluates them on the original VQA challenge dataset for direct comparison with the original work. 1

Reducing Training Time in Text Visual Question Answering

Behboud, Ghazale 15 July 2022 (has links)
Artificial Intelligence (AI) and Computer Vision (CV) have brought the promise of many applications along with many challenges to solve. The majority of current AI research has been dedicated to single-modal data processing meaning they use only one modality such as visual recognition or text recognition. However, real-world challenges are often a combination of different modalities of data such as text, audio and images. This thesis focuses on solving the Visual Question Answering (VQA) problem which is a significant multi-modal challenge. VQA is defined as a computer vision system that when given a question about an image will answer based on an understanding of both the question and image. The goal is improving the training time of VQA models. In this thesis, Look, Read, Reason and Answer (LoRRA), which is a state-of-the-art architecture, is used as the base model. Then, Reduce Uni-modal Biases (RUBi) is applied to this model to reduce the importance of uni- modal biases in training. Finally, an early stopping strategy is employed to stop the training process once the model accuracy has converged to prevent the model from overfitting. Numerical results are presented which show that training LoRRA with RUBi and early stopping can converge in less than 5 hours. The impact of batch size, learning rate and warm up hyper parameters is also investigated and experimental results are presented. / Graduate

Visual question answering with modules and language modeling

Pahuja, Vardaan 04 1900 (has links)
No description available.

On sample efficiency and systematic generalization of grounded language understanding with deep learning

Bahdanau, Dzmitry 01 1900 (has links)
En utilisant la méthodologie de l'apprentissage profond qui préconise de s'appuyer davantage sur des données et des modèles neuronaux flexibles plutôt que sur les connaissances de l'expert dans le domaine, la communauté de recherche a récemment réalisé des progrès remarquables dans la compréhension et la génération du langue naturel. Néanmoins, il reste difficile de savoir si une simple extension des méthodes d'apprentissage profond existantes sera suffisante pour atteindre l'objectif d'utiliser le langage naturel pour l'interaction homme-machine. Nous nous concentrons sur deux aspects connexes dans lesquels les méthodes actuelles semblent nécessiter des améliorations majeures. Le premier de ces aspects est l'inefficacité statistique des systèmes d'apprentissage profond: ils sont connus pour nécessiter de grandes quantités de données pour bien fonctionner. Le deuxième aspect est leur capacité limitée à généraliser systématiquement, à savoir à comprendre le langage dans des situations où la distribution des données change mais les principes de syntaxe et de sémantique restent les mêmes. Dans cette thèse, nous présentons quatre études de cas dans lesquelles nous cherchons à apporter plus de clarté concernant l'efficacité statistique susmentionnée et les aspects de généralisation systématique des approches d'apprentissage profond de la compréhension des langues, ainsi qu'à faciliter la poursuite des travaux sur ces sujets. Afin de séparer le problème de la représentation des connaissances du monde réel du problème de l'apprentissage d'une langue, nous menons toutes ces études en utilisant des langages synthétiques ancrés dans des environnements visuels simples. Dans le premier article, nous étudions comment former les agents à suivre des instructions compositionnelles dans des environnements avec une forme de supervision restreinte. À savoir pour chaque instruction et configuration initiale de l'environnement, nous ne fournissons qu'un état cible au lieu d'une trajectoire complète avec des actions à toutes les étapes. Nous adaptons les méthodes d'apprentissage adversariel par imitation à ce paramètre et démontrons qu'une telle forme restreinte de données est suffisante pour apprendre les significations compositionelles des instructions. Notre deuxième article se concentre également sur des agents qui apprennent à exécuter des instructions. Nous développons la plateforme BabyAI pour faciliter des études plus approfondies et plus rigoureuses de ce cadre d'apprentissage. La plateforme fournit une langue BabyAI compositionnelle avec $10 ^ {19}$ instructions, dont la sémantique est précisément définie dans un environnement partiellement observable. Nous rapportons des résultats de référence sur la quantité de supervision nécessaire pour enseigner à l'agent certains sous-ensembles de la langue BabyAI avec différentes méthodes de formation, telles que l'apprentissage par renforcement et l'apprentissage par imitation. Dans le troisième article, nous étudions la généralisation systématique des modèles de réponse visuelle aux questions (VQA). Dans le scénario VQA, le système doit répondre aux questions compositionelles sur les images. Nous construisons un ensemble de données de questions spatiales sur les paires d'objets et évaluons la performance des différents modèles sur les questions concernant les paires d'objets qui ne se sont jamais produites dans la même question dans la distribution d'entraînement. Nous montrons que les modèles dans lesquels les significations des mots sont représentés par des modules séparés qui effectuent des calculs indépendants généralisent beaucoup mieux que les modèles dont la conception n'est pas explicitement modulaire. Cependant, les modèles modulaires ne généralisent bien que lorsque les modules sont connectés dans une disposition appropriée, et nos expériences mettent en évidence les défis de l'apprentissage de la disposition par un apprentissage de bout en bout sur la distribution d'entraînement. Dans notre quatrième et dernier article, nous étudions également la généralisation des modèles VQA à des questions en dehors de la distribution d'entraînement, mais cette fois en utilisant le jeu de données CLEVR, utilisé pour les questions complexes sur des scènes rendues en 3D. Nous générons de nouvelles questions de type CLEVR en utilisant des références basées sur la similitude (par exemple `` la balle qui a la même couleur que ... '') dans des contextes qui se produisent dans les questions CLEVR mais uniquement avec des références basées sur la localisation (par exemple `` le balle qui est à gauche de ... ''). Nous analysons la généralisation avec zéro ou quelques exemples de CLOSURE après un entraînement sur CLEVR pour un certain nombre de modèles existants ainsi qu'un nouveau modèle. / By using the methodology of deep learning that advocates relying more on data and flexible neural models rather than on the expert's knowledge of the domain, the research community has recently achieved remarkable progress in natural language understanding and generation. Nevertheless, it remains unclear whether simply scaling up existing deep learning methods will be sufficient to achieve the goal of using natural language for human-computer interaction. We focus on two related aspects in which current methods appear to require major improvements. The first such aspect is the data inefficiency of deep learning systems: they are known to require extreme amounts of data to perform well. The second aspect is their limited ability to generalize systematically, namely to understand language in situations when the data distribution changes yet the principles of syntax and semantics remain the same. In this thesis, we present four case studies in which we seek to provide more clarity regarding the aforementioned data efficiency and systematic generalization aspects of deep learning approaches to language understanding, as well as to facilitate further work on these topics. In order to separate the problem of representing open-ended real-world knowledge from the problem of core language learning, we conduct all these studies using synthetic languages that are grounded in simple visual environments. In the first article, we study how to train agents to follow compositional instructions in environments with a restricted form of supervision. Namely for every instruction and initial environment configuration we only provide a goal-state instead of a complete trajectory with actions at all steps. We adapt adversarial imitation learning methods to this setting and demonstrate that such a restricted form of data is sufficient to learn compositional meanings of the instructions. Our second article also focuses on instruction following. We develop the BabyAI platform to facilitate further, more extensive and rigorous studies of this setup. The platform features a compositional Baby language with $10^{19}$ instructions, whose semantics is precisely defined in a partially-observable gridworld environment. We report baseline results on how much supervision is required to teach the agent certain subsets of Baby language with different training methods, such as reinforcement learning and imitation learning. In the third article we study systematic generalization of visual question answering (VQA) models. In the VQA setting the system must answer compositional questions about images. We construct a dataset of spatial questions about object pairs and evaluate how well different models perform on questions about pairs of objects that never occured in the same question in the training distribution. We show that models in which word meanings are represented by separate modules that perform independent computation generalize much better than models whose design is not explicitly modular. The modular models, however, generalize well only when the modules are connected in an appropriate layout, and our experiments highlight the challenges of learning the layout by end-to-end learning on the training distribution. In our fourth and final article we also study generalization of VQA models to questions outside of the training distribution, but this time using the popular CLEVR dataset of complex questions about 3D-rendered scenes as the platform. We generate novel CLEVR-like questions by using similarity-based references (e.g. ``the ball that has the same color as ...'') in contexts that occur in CLEVR questions but only with location-based references (e.g. ``the ball that is to the left of ...''). We analyze zero- and few- shot generalization to CLOSURE after training on CLEVR for a number of existing models as well as a novel one.

