As research in Artificial Intelligence (AI) advances, it is crucial to focus on having seamless communication between humans and machines in order to effectively accomplish tasks. Smooth human-machine communication requires the machine to be sensible and human-like while interacting with humans, while simultaneously being capable of extracting the maximum information it needs to accomplish the desired task. Since a lot of the tasks required to be solved by machines today involve the understanding of images, training machines to have human-like and effective image-grounded conversations with humans is one important step towards achieving this goal. Although we now have agents that can answer questions asked for images, they are prone to failure from confusing input, and cannot ask clarification questions, in turn, to extract the desired information from humans. Hence, as a first step, we direct our efforts towards making Visual Question Answering agents human-like by making them resilient to confusing inputs that otherwise do not confuse humans. Not only is it crucial for a machine to answer questions reasonably, it should also know how to ask questions sequentially to extract the desired information it needs from a human. Hence, we introduce a novel game called the Visual 20 Questions Game, where a machine tries to figure out a secret image a human has picked by having a natural language conversation with the human. Using deep learning techniques like recurrent neural networks and sequence-to-sequence learning, we demonstrate scalable and reasonable performances on both the tasks. / Master of Science / Research in Artificial Intelligence has reached to a point where computers can answer natural freeform questions asked to arbitrary images in a somewhat reasonable manner. These machines are called Visual Question Answering agents. However, they are prone to failure from even a slightly confusing input. For example, for an obviously irrelevant question asked to an image, they would answer something non-sensical instead of recognizing that the question is irrelevant. Furthermore, they also cannot ask questions in turn to humans for clarification or for more information. These shortcomings not only harm their efficacy, but also harm their perceived trust from human users. In order to remedy these problems, we first direct our efforts towards making Visual Question Answering agents capable of identifying an irrelevant question for an image. Next, we also try to train machines to be able to ask questions to extract more information from humans to make an informed decision. We do this by introducing a novel game called the Visual 20 Questions game, where a machine tries to figure out a secret image a human has picked by having a natural language conversation with the human. Deep learning techniques such as sequence-to-sequence learning using recurrent neural networks make it possible for machines to learn how to converse based on a series of conversational exchanges made between two humans. Techniques like reinforcement learning make it possible for machines to better themselves based on rewards it gets for accomplishing a task in a certain way. Using such algorithms, we demonstrate promise towards scalable and reasonable performances on both the tasks.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/78335 |
Date | 12 July 2017 |
Creators | Ray, Arijit |
Contributors | Electrical and Computer Engineering, Huang, Jia-Bin, Parikh, Devi, Abbott, A. Lynn |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0021 seconds