Global ETD Search

Return to search

Multimodal Machine Learning in Human Motion Analysis

Currently, most long-term human motion classification and prediction tasks are driven by spatio-temporal data of the human trunk. In addition, data with multiple modalities can change idiosyncratically with human motion, such as electromyography (EMG) of specific muscles and respiratory rhythm. On the other hand, progress in Artificial Intelligence research on the collaborative understanding of image, video, audio, and semantics mainly relies on MultiModal Machine Learning (MMML). This work explores human motion classification strategies with multi-modality information using MMML. The research is conducted using the Unige-Maastricht Dance dataset. Attention-based Deep Learning architectures are proposed for modal fusion on three levels: 1) feature fusion by Component Attention Network (CANet); 2) model fusion by fusing Graph Convolution Network (GCN) with CANet innovatively; 3) and late fusion by a simple voting. These all successfully exceed the benchmark of single motion modality. Moreover, the effect of each modality in each fusion method is analyzed by comprehensive comparison experiments. Finally, statistical analysis and visualization of the attention scores are performed to assist the distillation of the most informative temporal/component cues characterizing two qualities of motion. / För närvarande drivs uppgifter som långsiktig klassificering och förutsägelse av mänskliga rörelser av spatiotemporala data från människans bål. Dessutom kan data från flera olika modaliteter förändras idiosynkratiskt med mänsklig rörelse, t.ex. elektromyografi (EMG) av specifika muskler och andningsrytm. Å andra sidan bygger forskning inom artificiell intelligens för samtidig förståelse av bild, video, ljud och semantik huvudsakligen på multimodal maskininlärning (MMML). I det här arbetet undersöks strategier för klassificering av mänskliga rörelser med multimodal information med hjälp av MMML. Forskningen utförs med hjälp av Unige-Maastricht Dance dataset. Uppmärksamhetsbaserade djupinlärningsarkitekturer föreslås för modal fusion på tre nivåer: 1) funktionsfusion genom Component Attention Network (CANet), 2) modellfusion genom en innovativ fusion av Graph Convolution Network (GCN) med CANet, 3) och sen fusion genom en enkel omröstning. Alla dessa överträffar riktmärket med en enda rörelsemodalitet. Dessutom analyseras effekten av varje modalitet i varje fusionsmetod genom omfattande jämförelseexperiment. Slutligen genomförs en statistisk analys och visualiseras av uppmärksamhetsvärdena för att hjälpa till att hitta de mest informativa temporala signaler eller komponentsignaler som kännetecknar två typer av rörelse.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-321741

Multimodal machine learning

Modal fusion

Human motion classification

Multimodal maskininlärning

Modal fusion

Mänsklig rörelseklassificering

Computer and Information Sciences

Data- och informationsvetenskap

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-321741
Date	January 2022
Creators	Fu, Jia
Publisher	KTH, Skolan för elektroteknik och datavetenskap (EECS)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	Swedish
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-EECS-EX ; 2022:747

Page generated in 0.0022 seconds

Multimodal Machine Learning in Human Motion Analysis

Description

Links & Downloads

Tags

Additional Fields