Global ETD Search

1	Multimodal Machine Learning in Human Motion Analysis Fu, Jia January 2022 (has links) Currently, most long-term human motion classification and prediction tasks are driven by spatio-temporal data of the human trunk. In addition, data with multiple modalities can change idiosyncratically with human motion, such as electromyography (EMG) of specific muscles and respiratory rhythm. On the other hand, progress in Artificial Intelligence research on the collaborative understanding of image, video, audio, and semantics mainly relies on MultiModal Machine Learning (MMML). This work explores human motion classification strategies with multi-modality information using MMML. The research is conducted using the Unige-Maastricht Dance dataset. Attention-based Deep Learning architectures are proposed for modal fusion on three levels: 1) feature fusion by Component Attention Network (CANet); 2) model fusion by fusing Graph Convolution Network (GCN) with CANet innovatively; 3) and late fusion by a simple voting. These all successfully exceed the benchmark of single motion modality. Moreover, the effect of each modality in each fusion method is analyzed by comprehensive comparison experiments. Finally, statistical analysis and visualization of the attention scores are performed to assist the distillation of the most informative temporal/component cues characterizing two qualities of motion. / För närvarande drivs uppgifter som långsiktig klassificering och förutsägelse av mänskliga rörelser av spatiotemporala data från människans bål. Dessutom kan data från flera olika modaliteter förändras idiosynkratiskt med mänsklig rörelse, t.ex. elektromyografi (EMG) av specifika muskler och andningsrytm. Å andra sidan bygger forskning inom artificiell intelligens för samtidig förståelse av bild, video, ljud och semantik huvudsakligen på multimodal maskininlärning (MMML). I det här arbetet undersöks strategier för klassificering av mänskliga rörelser med multimodal information med hjälp av MMML. Forskningen utförs med hjälp av Unige-Maastricht Dance dataset. Uppmärksamhetsbaserade djupinlärningsarkitekturer föreslås för modal fusion på tre nivåer: 1) funktionsfusion genom Component Attention Network (CANet), 2) modellfusion genom en innovativ fusion av Graph Convolution Network (GCN) med CANet, 3) och sen fusion genom en enkel omröstning. Alla dessa överträffar riktmärket med en enda rörelsemodalitet. Dessutom analyseras effekten av varje modalitet i varje fusionsmetod genom omfattande jämförelseexperiment. Slutligen genomförs en statistisk analys och visualiseras av uppmärksamhetsvärdena för att hjälpa till att hitta de mest informativa temporala signaler eller komponentsignaler som kännetecknar två typer av rörelse. Multimodal machine learning Modal fusion Human motion classification Multimodal maskininlärning Modal fusion Mänsklig rörelseklassificering Computer and Information Sciences Data- och informationsvetenskap
2	Exploiting Multi-Modal Fusion for Urban Autonomous Driving Using Latent Deep Reinforcement Learning Khalil, Yasser 29 April 2022 (has links) Human driving decisions are the leading cause of road fatalities. Autonomous driving naturally eliminates such incompetent decisions and thus can improve traffic safety and efficiency. Deep reinforcement learning (DRL) has shown great potential in learning complex tasks. Recently, researchers investigated various DRL-based approaches for autonomous driving. However, exploiting multi-modal fusion to generate pixel-wise perception and motion prediction and then leveraging these predictions to train a latent DRL has not been targeted yet. Unlike other DRL algorithms, the latent DRL algorithm distinguishes representation learning from task learning, enhancing sampling efficiency for reinforcement learning. In addition, supplying the latent DRL algorithm with accurate perception and motion prediction simplifies the surrounding urban scenes, improving training and thus learning a better driving policy. To that end, this Ph.D. research initially develops LiCaNext, a novel real-time multi-modal fusion network to produce accurate joint perception and motion prediction at a pixel level. Our proposed approach relies merely on a LIDAR sensor, where its multi-modal input is composed of bird's-eye view (BEV), range view (RV), and range residual images. Further, this Ph.D. thesis proposes leveraging these predictions with another simple BEV image to train a sequential latent maximum entropy reinforcement learning (MaxEnt RL) algorithm. A sequential latent model is deployed to learn a more compact latent representation from high-dimensional inputs. Subsequently, the MaxEnt RL model trains on this latent space to learn a driving policy. The proposed LiCaNext is trained on the public nuScenes dataset. Results demonstrated that LiCaNext operates in real-time and performs better than the state-of-the-art in perception and motion prediction, especially for small and distant objects. Furthermore, simulation experiments are conducted on CARLA to evaluate the performance of our proposed approach that exploits LiCaNext predictions to train sequential latent MaxEnt RL algorithm. The simulated experiments manifest that our proposed approach learns a better driving policy outperforming other prevalent DRL-based algorithms. The learned driving policy achieves the objectives of safety, efficiency, and comfort. Experiments also reveal that the learned policy maintains its effectiveness under different environments and varying weather conditions. Autonomous Driving Deep Reinforcement Learning Multi-Modal Fusion Perception and Motion Prediction
3	Modélisation pour la reconnaissance continue de la langue française parlée complétée à l'aide de méthodes avancées d'apprentissage automatique / Modeling for Continuous Cued Speech Recognition in French using Advanced Machine Learning Methods Liu, Li 11 September 2018 (has links) Cette thèse de doctorat traite de la reconnaissance automatique du Langage français Parlé Complété (LPC), version française du Cued Speech (CS), à partir de l’image vidéo et sans marquage de l’information préalable à l’enregistrement vidéo. Afin de réaliser cet objectif, nous cherchons à extraire les caractéristiques de haut niveau de trois flux d’information (lèvres, positions de la main et formes), et fusionner ces trois modalités dans une approche optimale pour un système de reconnaissance de LPC robuste. Dans ce travail, nous avons introduit une méthode d’apprentissage profond avec les réseaux neurono convolutifs (CNN)pour extraire les formes de main et de lèvres à partir d’images brutes. Un modèle de mélange de fond adaptatif (ABMM) est proposé pour obtenir la position de la main. De plus, deux nouvelles méthodes nommées Modified Constraint Local Neural Fields (CLNF Modifié) et le model Adaptive Ellipse Model ont été proposées pour extraire les paramètres du contour interne des lèvres (étirement et ouverture aux lèvres). Le premier s’appuie sur une méthode avancée d’apprentissage automatique (CLNF) en vision par ordinateur. Toutes ces méthodes constituent des contributions significatives pour l’extraction de caractéristiques du LPC. En outre, en raison de l’asynchronie des trois flux caractéristiques du LPC, leur fusion est un enjeu important dans cette thèse. Afin de le résoudre, nous avons proposé plusieurs approches, y compris les stratégies de fusion au niveau données et modèle avec une modélisation HMM dépendant du contexte. Pour obtenir le décodage, nous avons proposé trois architectures CNNs-HMMs. Toutes ces architectures sont évaluées sur un corpus de phrases codées en LPC en parole continue sans aucun artifice, et la performance de reconnaissance CS confirme l’efficacité de nos méthodes proposées. Le résultat est comparable à l’état de l’art qui utilisait des bases de données où l’information pertinente était préalablement repérée. En même temps, nous avons réalisé une étude spécifique concernant l’organisation temporelle des mouvements de la main, révélant une avance de la main en relation avec l’emplacement dans la phrase. En résumé, ce travail de doctorat propose les méthodes avancées d’apprentissage automatique issues du domaine de la vision par ordinateur et les méthodologies d’apprentissage en profondeur dans le travail de reconnaissance CS, qui constituent un pas important vers le problème général de conversion automatique de CS en parole audio. / This PhD thesis deals with the automatic continuous Cued Speech (CS) recognition basedon the images of subjects without marking any artificial landmark. In order to realize thisobjective, we extract high level features of three information flows (lips, hand positions andshapes), and find an optimal approach to merging them for a robust CS recognition system.We first introduce a novel and powerful deep learning method based on the ConvolutionalNeural Networks (CNNs) for extracting the hand shape/lips features from raw images. Theadaptive background mixture models (ABMMs) are also applied to obtain the hand positionfeatures for the first time. Meanwhile, based on an advanced machine learning method Modi-fied Constrained Local Neural Fields (CLNF), we propose the Modified CLNF to extract theinner lips parameters (A and B ), as well as another method named adaptive ellipse model. Allthese methods make significant contributions to the feature extraction in CS. Then, due tothe asynchrony problem of three feature flows (i.e., lips, hand shape and hand position) in CS,the fusion of them is a challenging issue. In order to resolve it, we propose several approachesincluding feature-level and model-level fusion strategies combined with the context-dependentHMM. To achieve the CS recognition, we propose three tandem CNNs-HMM architectureswith different fusion types. All these architectures are evaluated on the corpus without anyartifice, and the CS recognition performance confirms the efficiency of our proposed methods.The result is comparable with the state of the art using the corpus with artifices. In parallel,we investigate a specific study about the temporal organization of hand movements in CS,especially about its temporal segmentation, and the evaluations confirm the superior perfor-mance of our methods. In summary, this PhD thesis applies the advanced machine learningmethods to computer vision, and the deep learning methodologies to CS recognition work,which make a significant step to the general automatic conversion problem of CS to sound.The future work will mainly focus on an end-to-end CNN-RNN system which incorporates alanguage model, and an attention mechanism for the multi-modal fusion. Langue parlée complétée Machine Learning et Deep Learning Fusion multimodale Modèle dépendant du contexte Cued Speech Automatic Continuous Recognition Automatic Feature Extraction Machine Learning and Deep Learning Multi-Modal Fusion Context-Dependent Modeling 510 620
4	Robust Multi-Modal Fusion for 3D Object Detection : Using multiple sensors of different types to robustly detect, classify, and position objects in three dimensions. / Robust multi-modal fusion för 3D-objektdetektion : Använda flera sensorer av olka typer för att robust detektera, klassificera och positionera objekt i tre dimensioner. Kårefjärd, Viktor January 2023 (has links) The computer vision task of 3D object detection is fundamentally necessary for autonomous driving perception systems. These vehicles typically feature a multitude of sensors, such as cameras, radars, and light detection and ranging sensors. A neural network architecture approach to make use of these sensor modalities is a multi-modal 3D object detection network with a fusion step that combines the information from multiple data streams to jointly predicted bounding boxes of detected objects. How this step should be performed, however, remains largely an open question due to the contemporary nature of this literature space. Thus, the question arises: How can sensor information from different sensors be combined to perform 3D object detection for a real-world application such as a mobile delivery robot with robustness requirements and how should a fusion step be performed as a part of a larger multi-modal fusion network? This work explores state-of-the-art multi-modal fusion models by testing with sub-optimal sensor data augmentations to quantify robustness including LiDAR point cloud subsampling and low-resolution LiDAR data. Sensor-to-sensor misalignments from poor calibration, decalibration, or spatial-temporal mis-synchronization problems are also simulated and a set of fusion steps are compared and evaluated. Three novel fusion steps are proposed where the best-performing fusion step is a convolution fusion with an encode-decoder and a squeeze and excitation block. The results indicate how early and late fusion methods are sensitive to sub-optimal LiDAR sensor conditions, and thus not suitable for an application with requirements of robust detection. Instead, Deep-fusion based models are preferred. Furthermore, a bird’s eye fusion model is demonstrated to not be overly sensitive to small sensor-to-sensor misalignments, and how the proposed fusion step with an encoder-decoder structure and a squeeze and excitation block can further limit misalignment-related performance deficits. The introduction of sensor misalignment as a training augmentation is also proven to alleviate and generalize the fusion step under heavy misalignment. / Datorseende uppgiften 3D-objektdetektering är i grunden nödvändig för autonomt körande system. Dessa fordon har vanligtvis ett flertal sensorer, såsom kameror, radar och ljusdetekterings- och avståndssensorer. Ett tillvägagångssätt med neural nätverksarkitektur för att använda dessa sensormodaliteter är ett multimodalt 3D-objektdetekteringsnätverk med ett fusionssteg som kombinerar informationen från flera dataströmmar för att gemensamt föreslå beggrränsade boxar för upptäckta objekt. Hur detta steg bör utformas förblir dock till stor del en öppen fråga på grund av litteraturutrymmes obestämda karaktär. Därför uppstår frågan: Hur kan sensorinformation från olika sensorer kombineras för att utföra 3D-objektdetektering för en verklig applikation som en mobil leveransrobot med robusthetskrav och hur ska ett fusionssteg utföras som en del av i ett större multimodalt fusionsnätverk? Detta arbete utforskar moderna multimodala fusionsmodeller genom att testa med suboptimala sensordataaugmenteringar för att kvantifiera robusthet inklusive LiDAR punktmolnsdelsampling och lågupplöst LiDAR-data. Sensor-till-sensor feljusteringar från dålig kalibrering, dekalibrering eller rumsliga-temporala felsynkroniseringsproblem simuleras också och en uppsättning fusionssteg jämförs och utvärderas. Tre nya fusionssteg föreslås där det bästa fusionssteget av de presterande är en convolution med en inkodare-avkodare och ett kläm- och exciteringsblock. Resultaten indikerar hur tidiga och sena fusionsmetoder är känsliga för suboptimala LiDAR-sensorförhållanden och därför inte lämpar sig för en applikation med krav på robust detektion. Istället föredras djupfusion modeller. Dessutom har en fusionsmodell av fågelvy typ visat sig inte vara känslig för små sensor-till-sensor feljusteringar, och hur det föreslagna fusionssteget med en inkodare-avkodarestruktur och ett kläm- och exciteringsblock ytterligare kan begränsa feljusteringsrelaterade prestandabrister. Införandet av sensorfeljustering som en träningsaugmentering har också visat sig lindra och generalisera fusionssteget under kraftig feljustering. Computer Vision 3D Object Detection Multi-Modal Fusion Deep Learning Datorseenden 3D-objektdetektion Multimodal fusion Djupinlärning Robotics Robotteknik och automation Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.0644 seconds