Global ETD Search

Return to search

Explainable Multimodal Fusion

Recently, there has been a lot of interest in explainable predictions, with new explainability approaches being created for specific data modalities like images and text. However, there is a dearth of understanding and minimal exploration in terms of explainability in the multimodal machine learning domain, where diverse data modalities are fused together in the model. In this thesis project, we look into two multimodal model architectures namely single-stream and dual-stream for the Visual Entailment (VE) task, which compromises of image and text modalities. The models considered in this project are UNiversal Image-TExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) and Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Furthermore, we conduct three different experiments for multimodal explainability by applying the Local Interpretable Model-agnostic Explanations (LIME) technique. Our results show that UNITER has the best accuracy among these models for the problem of VE. However, the explainability of all these models is similar. / Under den senaste tiden har intresset för förklarbara prediktioner (eng. explainable predictions) varit stort, med nya metoder skapade för specifika datamodaliteter som bilder och text. Samtidigt finns en brist på förståelse och lite utforskning har gjorts när det gäller förklarbarhet för multimodal maskininlärning, där olika datamodaliteter kombineras i modellen. I detta examensarbete undersöker vi två multimodala modellarkitekturer, så kallade en-ström och två-strömsarkitekturer (eng. single-steam och dual-stream) för en uppgift som kombinerar bilder och text, Visual Entailment (VE). Modellerna som studeras är UNiversal Image-TExt Representation Learning (UNITER), Visual-Linguistic BERT (VLBERT), Vision-and-Language BERT (ViLBERT) och Learning Cross-Modality Encoder Representations from Transformers (LXMERT). Dessutom genomför vi tre olika experiment för multimodal förklarbarhet genom att tillämpa en metod som heter Local Interpretable Model-agnostic Explanations (LIME). Våra resultat visar att UNITER har bäst prestanda av dessa modeller för VE-uppgiften. Å andra sidan är förklarbarheten för alla dessa modeller likvärdig.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-305766

Computer and Information Sciences

Data- och informationsvetenskap

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-305766
Date	January 2021
Creators	Alvi, Jaweriah
Publisher	KTH, Skolan för elektroteknik och datavetenskap (EECS)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	Swedish
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-EECS-EX ; 2021:771

Page generated in 0.0014 seconds

Explainable Multimodal Fusion

Description

Links & Downloads

Tags

Additional Fields