1 |
Data Augmentation with Seq2Seq ModelsGranstedt, Jason Louis 06 July 2017 (has links)
Paraphrase sparsity is an issue that complicates the training process of question answering systems: syntactically diverse but semantically equivalent sentences can have significant disparities in predicted output probabilities. We propose a method for generating an augmented paraphrase corpus for the visual question answering system to make it more robust to paraphrases. This corpus is generated by concatenating two sequence to sequence models. In order to generate diverse paraphrases, we sample the neural network using diverse beam search. We evaluate the results on the standard VQA validation set.
Our approach results in a significantly expanded training dataset and vocabulary size, but has slightly worse performance when tested on the validation split. Although not as fruitful as we had hoped, our work highlights additional avenues for investigation into selecting more optimal model parameters and the development of a more sophisticated paraphrase filtering algorithm. The primary contribution of this work is the demonstration that decent paraphrases can be generated from sequence to sequence models and the development of a pipeline for developing an augmented dataset. / Master of Science / For a machine, processing language is hard. All possible combinations of words in a language far exceed a computer’s ability to directly memorize them. Thus, generalizing language into a form that a computer can reason with is necessary for a machine to understand raw human input. Various advancements in machine learning have been particularly impressive in this regard. However, they require a corpus, or a body of information, in order to learn. Collecting this corpus is typically expensive and time consuming, and does not necessarily contain all of the information that a system would need to know - the machine would not know how to handle a word that it has never seen before, for example.
This thesis examines the possibility of using a large, general corpus to expand the vocabulary size of a specialized corpus in order to improve performance on a specific task. We use Seq2Seq models, a recent development in neural networks that has seen great success in translation tasks to do so. The Seq2Seq model is trained on the general corpus to learn the language and then applied to the specialized corpus to generate paraphrases similar to the format in the specialized corpus. We were able to significantly expand the volume and vocabulary size of the specialized corpus via this approach, we have demonstrated that decent paraphrases can be generated from Seq2Seq models, and we developed a pipeline for augmenting other specialized datasets.
|
2 |
Robustness Analysis of Visual Question Answering Models by Basic QuestionsHuang, Jia-Hong 11 1900 (has links)
Visual Question Answering (VQA) models should have both high robustness and accuracy. Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image. We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input. We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models. We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models.
|
3 |
[en] A DATA ANNOTATION APPROACH USING LARGE LANGUAGE MODELS / [pt] UMA ABORDAGEM PARA ANOTAÇÃO DE DADOS UTILIZANDO GRANDES MODELOS DE LINGUAGEMCARLOS VINICIOS MARTINS ROCHA 17 October 2024 (has links)
[pt] Os documentos são essenciais para o sistema econômico e acadêmico;
no entanto, explorá-los pode ser uma tarefa complexa e demorada. Uma
abordagem para contornar esse problema é o uso de modelos de Visual
Question and Answering (VQA) para extração de informações de documentos
por meio de prompts em linguagem natural. No VQA, assim como para
o desenvolvimento dos mais variados modelos, é necessário possuir dados
anotados para a sua etapa de treinamento e validação. No entanto, criar esses
conjuntos de dados é desafiador devido ao alto custo envolvido no processo.
Com base nisso, propomos um processo de quatro etapas que combina Modelos
de Visão Computacional e Large Language Models (LLMs) para a anotação
de dados de VQA em relatórios financeiros. O método proposto inicia pelo
reconhecimento da estrutura textual dos documentos por meio de modelos de
Análise de Layout de Documentos e Extração de Estrutura de Tabelas. Em
seguida, utiliza duas LLMs distintas para a etapa de geração e avaliação dos
pares de perguntas e respostas geradas, automatizando a construção e seleção
dos melhores pares para compor a base final. Para avaliar o método proposto,
geramos um dataset para treinar e avaliar modelos especialistas em VQA. / [en] Documents are essential for the economic and academic system; however,
exploring them can be complex and time-consuming. An approach to surpass
this problem is the use of Visual Question and Answering (VQA) models to
extract information from documents through natural language prompts. In
VQA, as well as for the development of various models, it is necessary to have
annotated data for training and validation. However, creating these datasets is
challenging due to the high cost involved in the process. To face this challenge,
we propose a four-step process that combines Computer Vision Models and
Large Language Models (LLMs) for VQA data annotation in financial reports.
The proposed method starts with recognizing the textual structure of documents through Document Layout Analysis and Table Structure Extraction
models. Then, it uses two distinct LLMs for the generation and evaluation of
question and answer pairs, automating the construction and selection of the
best pairs to compose the final dataset. To evaluate the proposed method, we
generate a dataset for train and evaluate VQA specialized models.
|
4 |
Improving Visual Question Answering by Leveraging Depth and Adapting Explainability / Förbättring av Visual Question Answering (VQA) genom utnyttjandet av djup och anpassandet av förklaringsförmåganPanesar, Amrita Kaur January 2022 (has links)
To produce smooth human-robot interactions, it is important for robots to be able to answer users’ questions accurately and provide a suitable explanation for why they arrive to the answer they provide. However, in the wild, the user may ask the robot questions relating to aspects of the scene that the robot is unfamiliar with and hence be unable to answer correctly all of the time. In order to gain trust in the robot and resolve failure cases where an incorrect answer is provided, we propose a method that uses Grad-CAM explainability on RGB-D data. Depth is a critical component in producing more intelligent robots that can respond correctly most of the time as some questions might rely on spatial relations within the scene, for which 2D RGB data alone would be insufficient. To our knowledge, this work is the first of its kind to leverage depth and an explainability module to produce an explainable Visual Question Answering (VQA) system. Furthermore, we introduce a new dataset for the task of VQA on RGB-D data, VQA-SUNRGBD. We evaluate our explainability method against Grad-CAM on RGB data and find that ours produces better visual explanations. When we compare our proposed model on RGB-D data against the baseline VQN network on RGB data alone, we show that ours outperforms, particularly in questions relating to depth such as asking about the proximity of objects and relative positions of objects to one another. / För att skapa smidiga interaktioner mellan människa och robot är det viktigt för robotar att kunna svara på användarnas frågor korrekt och ge en lämplig förklaring till varför de kommer fram till det svar de ger. Men i det vilda kan användaren ställa frågor till roboten som rör aspekter av miljön som roboten är obekant med och därmed inte kunna svara korrekt hela tiden. För att få förtroende för roboten och lösa de misslyckade fall där ett felaktigt svar ges, föreslår vi en metod som använder Grad-CAM-förklarbarhet på RGB-D-data. Djup är en kritisk komponent för att producera mer intelligenta robotar som kan svara korrekt för det mesta, eftersom vissa frågor kan förlita sig på rumsliga relationer inom scenen, för vilka enbart 2D RGB-data skulle vara otillräcklig. Såvitt vi vet är detta arbete det första i sitt slag som utnyttjar djup och en förklaringsmodul för att producera ett förklarabart Visual Question Answering (VQA)-system. Dessutom introducerar vi ett nytt dataset för uppdraget av VQA på RGB-D-data, VQA-SUNRGBD. Vi utvärderar vår förklaringsmetod mot Grad-CAM på RGB-data och finner att vår modell ger bättre visuella förklaringar. När vi jämför vår föreslagna modell för RGB-Ddata mot baslinje-VQN-nätverket på enbart RGB-data visar vi att vår modell överträffar, särskilt i frågor som rör djup, som att fråga om objekts närhet och relativa positioner för objekt jämntemot varandra.
|
Page generated in 0.0227 seconds