• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 6
  • 1
  • Tagged with
  • 7
  • 7
  • 7
  • 5
  • 5
  • 4
  • 4
  • 4
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Commonsense for Zero-Shot Natural Language Video Localization

Holla, Meghana 07 July 2023 (has links)
Zero-shot Natural Language-Video Localization (NLVL) has shown promising results in training NLVL models solely with raw video data through dynamic video segment proposal generation and pseudo-query annotations. However, existing pseudo-queries lack grounding in the source video and suffer from a lack of common ground due to their unstructured nature. In this work, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries through a commonsense enhancement module. Our approach employs Graph Convolutional Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query vectors prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that our model surpasses both zero-shot and weakly supervised baselines. These results underscore the significance of leveraging commonsense reasoning abilities in multimodal understanding tasks. / Master of Science / Natural Language Video Localization (NLVL) is the task of retrieving relevant video segments from an untrimmed video given a user text query. To train an NLVL system, traditional methods demand annotations on the input videos, which include video segment spans (i.e., start and end timestamps) and the accompanying text query describing the segment. These annotations are laborious to collect for any domain and video length. To alleviate this, zero-shot NLVL methods generate the aforementioned annotations dynamically. However, current zero-shot NLVL approaches suffer from poor alignment between the video and the dynamically generated query, which can introduce noise in the localization process. To this end, this work aims to investigate the impact of implicit commonsensical knowledge, which humans innately possess, on zero-shot NLVL. We introduce CORONET, a zero-shot NLVL framework that utilizes commonsense information to bridge the gap between videos and generated pseudo-queries. Experiments on two benchmark datasets, containing diverse themes of videos, highlight the effectiveness of leveraging commonsense information.
2

Multimodal Machine Learning in Human Motion Analysis

Fu, Jia January 2022 (has links)
Currently, most long-term human motion classification and prediction tasks are driven by spatio-temporal data of the human trunk. In addition, data with multiple modalities can change idiosyncratically with human motion, such as electromyography (EMG) of specific muscles and respiratory rhythm. On the other hand, progress in Artificial Intelligence research on the collaborative understanding of image, video, audio, and semantics mainly relies on MultiModal Machine Learning (MMML). This work explores human motion classification strategies with multi-modality information using MMML. The research is conducted using the Unige-Maastricht Dance dataset. Attention-based Deep Learning architectures are proposed for modal fusion on three levels: 1) feature fusion by Component Attention Network (CANet); 2) model fusion by fusing Graph Convolution Network (GCN) with CANet innovatively; 3) and late fusion by a simple voting. These all successfully exceed the benchmark of single motion modality. Moreover, the effect of each modality in each fusion method is analyzed by comprehensive comparison experiments. Finally, statistical analysis and visualization of the attention scores are performed to assist the distillation of the most informative temporal/component cues characterizing two qualities of motion. / För närvarande drivs uppgifter som långsiktig klassificering och förutsägelse av mänskliga rörelser av spatiotemporala data från människans bål. Dessutom kan data från flera olika modaliteter förändras idiosynkratiskt med mänsklig rörelse, t.ex. elektromyografi (EMG) av specifika muskler och andningsrytm. Å andra sidan bygger forskning inom artificiell intelligens för samtidig förståelse av bild, video, ljud och semantik huvudsakligen på multimodal maskininlärning (MMML). I det här arbetet undersöks strategier för klassificering av mänskliga rörelser med multimodal information med hjälp av MMML. Forskningen utförs med hjälp av Unige-Maastricht Dance dataset. Uppmärksamhetsbaserade djupinlärningsarkitekturer föreslås för modal fusion på tre nivåer: 1) funktionsfusion genom Component Attention Network (CANet), 2) modellfusion genom en innovativ fusion av Graph Convolution Network (GCN) med CANet, 3) och sen fusion genom en enkel omröstning. Alla dessa överträffar riktmärket med en enda rörelsemodalitet. Dessutom analyseras effekten av varje modalitet i varje fusionsmetod genom omfattande jämförelseexperiment. Slutligen genomförs en statistisk analys och visualiseras av uppmärksamhetsvärdena för att hjälpa till att hitta de mest informativa temporala signaler eller komponentsignaler som kännetecknar två typer av rörelse.
3

Product Matching through Multimodal Image and Text Combined Similarity Matching / Produktmatchning Genom Multimodal Kombinerad Bild- och Textlikhetsmatchning

Ko, E Soon January 2021 (has links)
Product matching in e-commerce is an area that faces more and more challenges with growth in the e-commerce marketplace as well as variation in the quality of data available online for each product. Product matching for e-commerce provides competitive possibilities for vendors and flexibility for customers by identifying identical products from different sources. Traditional methods in product matching are often conducted through rule-based methods and methods tackling the issue through machine learning usually do so through unimodal systems. Moreover, existing methods would tackle the issue through product identifiers which are not always unified for each product. This thesis provides multimodal approaches through product name, description, and image to the problem area of product matching that outperforms unimodal approaches. Three multimodal approaches were taken, one unsupervised and two supervised. The unsupervised approach uses straight-forward embedding space to nearest neighbor search that provides better results than unimodal approaches. One of the supervised multimodal approaches uses Siamese network on the embedding space which outperforms the unsupervised multi- modal approach. Finally, the last supervised approach instead tackles the issue by exploiting distance differences in each modality through logistic regression and a decision system that provided the best results. / Produktmatchning inom e-handel är ett område som möter fler och fler utmaningar med hänsyn till den tillväxt som e-handelsmarknaden undergått och fortfarande undergår samt variation i kvaliteten på den data som finns tillgänglig online för varje produkt. Produktmatchning inom e-handel är ett område som ger konkurrenskraftiga möjligheter för leverantörer och flexibilitet för kunder genom att identifiera identiska produkter från olika källor. Traditionella metoder för produktmatchning genomfördes oftast genom regelbaserade metoder och metoder som utnyttjar maskininlärning gör det vanligtvis genom unimodala system. Dessutom utnyttjar mestadels av befintliga metoder produktidentifierare som inte alltid är enhetliga för varje produkt mellan olika källor. Denna studie ger istället förslag till multimodala tillvägagångssätt som istället använder sig av produktnamn, produktbeskrivning och produktbild för produktmatchnings-problem vilket ger bättre resultat än unimodala metoder. Tre multimodala tillvägagångssätt togs, en unsupervised och två supervised. Den unsupervised metoden använder embeddings vektorerna rakt av för att göra en nearest neighborsökning vilket gav bättre resultat än unimodala tillvägagångssätt. Ena supervised multimodal tillvägagångssätten använder siamesiska nätverk på embedding utrymmet vilket gav resultat som överträffade den unsupervised multimodala tillvägagångssättet. Slutligen tar den sista supervised metoden istället avståndsskillnader i varje modalitet genom logistisk regression och ett beslutssystem som gav bästa resultaten.
4

Approches jointes texte/image pour la compréhension multimodale de documents / Text/image joint approaches for multimodal understanding of documents

Delecraz, Sébastien 10 December 2018 (has links)
Les mécanismes de compréhension chez l'être humain sont par essence multimodaux. Comprendre le monde qui l'entoure revient chez l'être humain à fusionner l'information issue de l'ensemble de ses récepteurs sensoriels. La plupart des documents utilisés en traitement automatique de l'information sont multimodaux. Par exemple, du texte et des images dans des documents textuels ou des images et du son dans des documents vidéo. Cependant, les traitements qui leurs sont appliqués sont le plus souvent monomodaux. Le but de cette thèse est de proposer des traitements joints s'appliquant principalement au texte et à l'image pour le traitement de documents multimodaux à travers deux études : l'une portant sur la fusion multimodale pour la reconnaissance du rôle du locuteur dans des émissions télévisuelles, l'autre portant sur la complémentarité des modalités pour une tâche d'analyse linguistique sur des corpus d'images avec légendes. Pour la première étude nous nous intéressons à l'analyse de documents audiovisuels provenant de chaînes d'information télévisuelle. Nous proposons une approche utilisant des réseaux de neurones profonds pour la création d'une représentation jointe multimodale pour les représentations et la fusion des modalités. Dans la seconde partie de cette thèse nous nous intéressons aux approches permettant d'utiliser plusieurs sources d'informations multimodales pour une tâche monomodale de traitement automatique du langage, afin d'étudier leur complémentarité. Nous proposons un système complet de correction de rattachements prépositionnels utilisant de l'information visuelle, entraîné sur un corpus multimodal d'images avec légendes. / The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions.
5

Context-based Multimodal Machine Learning on Game Oriented Data for Affective State Recognition / Kontextbaserad multimodal maskininlärning på spelorienterad data för affektivt tillståndsigenkänning

Corneliussen, Ilian January 2021 (has links)
Affective computing is an essential part of Human-Robot Interaction, where knowing the human’s emotional state is crucial to create an interactive and adaptive social robot. Previous work has mainly been focusing on using unimodal or multimodal sequential models for Affective State Recognition. However, few have included context-based information with their models to boost performance. In this paper, context-based features are tested on a multimodal Gated Recurrent Unit model with late fusion on game oriented data. It shows that using context-based features such as game state can significantly increase the performance of sequential multimodal models on game oriented data. / Affektiv beräkning är en viktig del av interaktion mellan människa och robot, där kunskap om människans emotionella tillstånd är avgörande för att skapa en interaktiv och anpassningsbar social robot. Tidigare arbete har främst fokuserat på att använda unimodala eller multimodala sekventiella modeller för affektiv tillståndsigenkänning. Men få har inkluderat kontextbaserad information i sin inställning för att öka prestanda. I denna uppsats testas kontextbaserade funktioner på en multimodal s.k. Gated Recurrent Unit modell med sen fusion på spelorienterad data. Det visar att användning av kontextbaserade information som tillståndet i spelet kan avsevärt öka prestandan hos sekventiella multimodala modeller på spelorienterad data.
6

Automated Multimodal Emotion Recognition / Automatiserad multimodal känsloigenkänning

Fernández Carbonell, Marcos January 2020 (has links)
Being able to read and interpret affective states plays a significant role in human society. However, this is difficult in some situations, especially when information is limited to either vocal or visual cues. Many researchers have investigated the so-called basic emotions in a supervised way. This thesis holds the results of a multimodal supervised and unsupervised study of a more realistic number of emotions. To that end, audio and video features are extracted from the GEMEP dataset employing openSMILE and OpenFace, respectively. The supervised approach includes the comparison of multiple solutions and proves that multimodal pipelines can outperform unimodal ones, even with a higher number of affective states. The unsupervised approach embraces a traditional and an exploratory method to find meaningful patterns in the multimodal dataset. It also contains an innovative procedure to better understand the output of clustering techniques. / Att kunna läsa och tolka affektiva tillstånd spelar en viktig roll i det mänskliga samhället. Detta är emellertid svårt i vissa situationer, särskilt när information är begränsad till antingen vokala eller visuella signaler. Många forskare har undersökt de så kallade grundläggande känslorna på ett övervakat sätt. Det här examensarbetet innehåller resultaten från en multimodal övervakad och oövervakad studie av ett mer realistiskt antal känslor. För detta ändamål extraheras ljud- och videoegenskaper från GEMEP-data med openSMILE respektive OpenFace. Det övervakade tillvägagångssättet inkluderar jämförelse av flera lösningar och visar att multimodala pipelines kan överträffa unimodala sådana, även med ett större antal affektiva tillstånd. Den oövervakade metoden omfattar en konservativ och en utforskande metod för att hitta meningsfulla mönster i det multimodala datat. Den innehåller också ett innovativt förfarande för att bättre förstå resultatet av klustringstekniker.
7

Messing With The Gap: On The Modality Gap Phenomenon In Multimodal Contrastive Representation Learning

Al-Jaff, Mohammad January 2023 (has links)
In machine learning, a sub-field of computer science, a two-tower architecture model is a specialised type of neural network model that encodes paired data from different modalities (like text and images, sound and video, or proteomics and gene expression profiles) into a shared latent representation space. However, when training these models using a specific contrastive loss function, known as the multimodalinfoNCE loss, seems to often lead to a unique geometric phenomenon known as the modality gap. This gap is a clear geometric separation of the embeddings of the modalities in the joint contrastive latent space. This thesis investigates the modality gap in multimodal machine learning, specifically in two-tower neural networks trained with multimodal-infoNCE loss. We examine the adequacy of the current definition of the modality gap, the conditions under which the modality gap phenomenon manifests, and its impact on representation quality and downstream task performance. The approach to address these questions consists of a two-phase experimental strategy. Phase I involves a series of experiments, ranging from toy synthetic simulations to true multimodal machine learning with complex datasets, to explore and characterise the modality gap under varying conditions. Phase II focuses on modifying the modality gap and analysing representation quality, evaluating different loss functions and their impact on the modality gap. This methodical exploration allows us to systematically dissect the emergence and implications of the modality gap phenomenon, providing insights into its impact on downstream tasks, measured with proxy metrics based on semantic clustering in the shared latent representation space and modality-specific linear probe evaluation. Our findings reveal that the modality gap definition proposed by W. Liang et al. 2022, is insufficient. We demonstrate that similar modality gap magnitudes can exhibit varying linear separability between modality embeddings in the contrastive latent space and varying embedding topologies, indicating the need for additional metrics to capture the true essence of the gap. Furthermore, our experiments show that the temperature hyperparameter in the multimodal infoNCE loss function plays a crucial role in the emergence of the modality gap, and this effect varies with different data sets. This suggests that individual dataset characteristics significantly influence the modality gap's manifestation. A key finding is the consistent emergence of modality gaps with small temperature settings in the fixed temperature mode of the loss function and almost invariably under learned temperature mode settings, regardless of the initial temperature value. Additionally, we observe that the magnitude of the modality gap is influenced by distribution shifts, with the gap magnitude increasing progressively from the training set to the validation set, then to the test set, and finally to more distributionally shifted datasets. We discover that the choice of contrastive learning method, temperature settings, and temperature values is crucial in influencing the modality gap. However, reducing the gap does not consistently improve downstream task performance, suggesting its role may be more nuanced than previously understood. This insight indicates that the modality gap might be a geometric by-product of the learning methods rather than a critical determinant of representation quality. Our results encourage the need to reevaluate the modality gap's significance in multimodal contrastive learning, emphasising the importance of dataset characteristics and contrastive learning methodology.

Page generated in 0.1104 seconds