Spelling suggestions: "subject:"multimodal Fusion"" "subject:"multimodala Fusion""
11 |
Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery Data : Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery DataHe, Linbo January 2019 (has links)
Semantic segmentation is a key approach to comprehensive image data analysis. It can be applied to analyze 2D images, videos, and even point clouds that contain 3D data points. On the first two problems, CNNs have achieved remarkable progress, but on point cloud segmentation, the results are less satisfactory due to challenges such as limited memory resource and difficulties in 3D point annotation. One of the research studies carried out by the Computer Vision Lab at Linköping University was aiming to ease the semantic segmentation of 3D point cloud. The idea is that by first projecting 3D data points to 2D space and then focusing only on the analysis of 2D images, we can reduce the overall workload for the segmentation process as well as exploit the existing well-developed 2D semantic segmentation techniques. In order to improve the performance of CNNs for 2D semantic segmentation, the study has used input data derived from different modalities. However, how different modalities can be optimally fused is still an open question. Based on the above-mentioned study, this thesis aims to improve the multistream framework architecture. More concretely, we investigate how different singlestream architectures impact the multistream framework with a given fusion method, and how different fusion methods contribute to the overall performance of a given multistream framework. As a result, our proposed fusion architecture outperformed all the investigated traditional fusion methods. Along with the best singlestream candidate and few additional training techniques, our final proposed multistream framework obtained a relative gain of 7.3\% mIoU compared to the baseline on the semantic3D point cloud test set, increasing the ranking from 12th to 5th position on the benchmark leaderboard.
|
12 |
Event Boundary Detection Using Web-cating Texts And Audio-visual FeaturesBayar, Mujdat 01 September 2011 (has links) (PDF)
We propose a method to detect events and event boundaries in soccer videos by using web-casting texts and audio-visual features. The events and their inaccurate time information given in web-casting texts need to be aligned with the visual content of the video. Most match reports presented by popular organizations such as uefa.com (the official site of Union of European Football Associations) provide the time information in minutes rather than seconds. We propose a robust method which is able to handle uncertainties in the time points of the events. As a result of our experiments, we claim that our method detects event boundaries satisfactorily for uncertain web-casting texts, and that the use of audio-visual features improves the performance of event boundary detection.
|
13 |
Semantic Scene Segmentation using RGB-D & LRF fusionLilja, Harald January 2020 (has links)
In the field of robotics and autonomous vehicles, the use of RGB-D data and LiDAR sensors is a popular practice for applications such as SLAM[14], object classification[19] and scene understanding[5]. This thesis explores the problem of semantic segmentation using deep multimodal fusion of LRF and depth data. Two data set consisting of 1080 and 108 data points from two scenes is created and manually labeled in 2D space and transferred to 1D using a proposed label transfer method utilizing hierarchical clustering. The data set is used to train and validate the suggested method for segmentation using a proposed dual encoder-decoder network based on SalsaNet [1] with gradual fusion in the decoder. Applying the suggested method yielded an improvement in the scenario of an unseen circuit when compared to uni-modal segmentation using depth, RGB, laser, and a naive combination of RGB-D data. A suggestion of feature extraction in the form of PCA or stacked auto-encoders is suggested as a further improvement for this type of fusion. The source code and data set are made publicly available at https://github.com/Anguse/salsa_fusion.
|
14 |
Détection multimodale du stress pour la conception de logiciels de remédiation / Multimodal stress detection for remediation software designSoury, Mariette 28 October 2014 (has links)
Ces travaux de thèse portent sur la reconnaissance automatique du stress chez des humains en interaction dans des situations anxiogènes: prise de parole en public, entretiens et jeux sérieux à partir d'indices audio et visuels.Afin de concevoir des modèles de reconnaissance automatique du stress, nous utilisons : des indices audio calculés à partir de la voix des sujets, capturée par un micro cravate; et des indices visuels calculés soit à partir de l'expression faciale des sujets capturés par une webcam, soit à partir de la posture des sujets capturée par une Kinect. Une partie des travaux portent sur la fusion des informations apportées par les différentes modalités.L'expression et la gestion du stress sont influencées à la fois par des différences interpersonnelles (traits de personnalité, expériences passées, milieu culturel) et contextuelles (type de stresseur, enjeux de la situation). Nous évaluons le stress sur différents publics à travers des corpus de données collectés pendant la thèse: un public sociophobe en situation anxiogène, face à une machine et face à des humains; un public non pathologique en simulation d'entretien d'embauche; et un public non pathologique en interaction face à un ordinateur ou face au robot humanoïde Nao. Les comparaisons inter- individus, et inter-corpus révèlent la diversité de l'expression du stress.Une application de ces travaux pourrait être la conception d'outils thérapeutiques pour la maitrise du stress, notamment à destination des populations phobiques.Mots clé : stress, phobie sociale, détection multimodale du stress , indices audio du stress, indices faciaux du stress, indices posturaux du stress, fusion multimodale / This thesis focuses on the automatic recognition of human stress during stress-inducing interactions (public speaking, job interview and serious games), using audio and visual cues.In order to build automatic stress recognition models, we used audio cues computed from subjects' voice captured via a lapel microphone, and visual cues computed either form subjects' facial expressions captured via a webcam, or subjects' posture captured via a Kinect. Part of this work is dedicated to the study of information fusion form those various modalities.Stress expression and coping are influenced both by interpersonal differences (personality traits, past experiences, cultural background) and contextual differences (type of stressor, situation's stakes). We evaluated stress in various populations in data corpora collected during this thesis: social phobics in anxiety-inducing situations in interaction with a machine and with humans; apathologic subjects in a mock job interview; and apathologic subjects interaction with a computer and with the humanoid robot Nao. Inter-individual and inter-corpora comparisons highlight the variability of stress expression.A possible application of this work could be the elaboration of therapeutic software to learn stress coping strategies, particularly for social phobics.Key words: stress, social phobia, multimodal stress detection, stress audio cues, stress facial cues, stress postural cues, multimodal fusion
|
15 |
Modélisation statistique et segmentation d'images TEP : application à l'hétérogénéité et au suivi de tumeurs / Statistical model and segmentation of PET images : application to tumor heterogeneity and trackingIrace, Zacharie 08 October 2014 (has links)
Cette thèse étudie le traitement statistique des images TEP. Plus particulièrement, la distribution binomiale négative est proposée pour modéliser l’activité d’une région mono-tissulaire. Cette représentation a l’avantage de pouvoir prendre en compte les variations d’activité biologique (ou hétérogénéité) d’un même tissu. A partir de ces résultats, il est proposé de modéliser la distribution de l’image TEP entière comme un mélange spatialement cohérent de lois binomiales négatives. Des méthodes Bayésiennes sont considérées pour la segmentation d’images TEP et l’estimation conjointe des paramètres du modèle. La cohérence spatiale inhérente aux tissus biologiques est modélisée par un champ aléatoire de Potts-Markov pour représenter la dépendance locale entre les composantes du mélange. Un algorithme original de Monte Carlo par Chaîne de Markov (MCMC) est utilisé, faisant appel aux notions d’échantillonnage dans un espace Riemannien et d’opérateurs proximaux. L’approche proposée est appliquée avec succès à la segmentation de tumeurs en imagerie TEP. Cette méthode est ensuite étendue d’une part en intégrant au processus de segmentation des informations anatomiques acquises par tomodensitométrie (TDM), et d’autre part en traitant une série temporelle d’images correspondant aux différentes phases de respiration. Un modèle de mélange de distributions bivariées binomiale négative - normale est proposé pour représenter les images dynamiques TEP et TDM fusionnées. Un modèle Bayésien hiérarchique a été élaboré comprenant un champ de Potts-Markov à quatre dimensions pour respecter la cohérence spatiale et temporelle des images PET-TDM dynamiques. Le modèle proposé montre une bonne qualité d’ajustement aux données et les résultats de segmentation obtenus sont visuellement en concordance avec les structures anatomiques et permettent la délimitation et le suivi de la tumeur. / This thesis studies statistical image processing of PET images. More specifically, the negative binomial distribution is proposed to model the activity of a single tissue. This representation has the advantage to take into account the variations of biological activity (or heterogeneity) within a single tissue. Based on this, it is proposed to model the data of the entire PET image as a spatially coherent finite mixture of negative binomial distributions. Bayesian methods are considered to jointly perform the segmentation and estimate the model parameters. The inherent spatial coherence of the biological tissue is modeled by a Potts-Markov random field to represent the local dependence between the components of the mixture. An original Markov Chain Monte Carlo (MCMC) algorithm is proposed, based on sampling in a Riemannian space and proximal operators. The proposed approach is successfully applied to the segmentation of tumors in PET imaging. This method is further extended by incorporating anatomical information acquired by computed tomography (CT) and processing a time series of images corresponding to the phases of respiration. A mixture model of bivariate negative binomial - normal distributions is proposed to represent the dynamic PET and CT fused images. A hierarchical Bayesian model was developed including a four dimensional Potts-Markov field to enforce the spatiotemporal coherence of dynamic PET-CT images. The proposed model shows a good fit to the data and the segmentation results obtained are visually consistent with the anatomical structures and allow accurate tumor delineation and tracking.
|
16 |
Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuaisMinotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
|
17 |
Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuaisMinotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
|
18 |
Audiovisual voice activity detection and localization of simultaneous speech sources / Detecção de atividade de voz e localização de fontes sonoras simultâneas utilizando informações audiovisuaisMinotto, Vicente Peruffo January 2013 (has links)
Em vista da tentência de se criarem intefaces entre humanos e máquinas que cada vez mais permitam meios simples de interação, é natural que sejam realizadas pesquisas em técnicas que procuram simular o meio mais convencional de comunicação que os humanos usam: a fala. No sistema auditivo humano, a voz é automaticamente processada pelo cérebro de modo efetivo e fácil, também comumente auxiliada por informações visuais, como movimentação labial e localizacão dos locutores. Este processamento realizado pelo cérebro inclui dois componentes importantes que a comunicação baseada em fala requere: Detecção de Atividade de Voz (Voice Activity Detection - VAD) e Localização de Fontes Sonoras (Sound Source Localization - SSL). Consequentemente, VAD e SSL também servem como ferramentas mandatórias de pré-processamento em aplicações de Interfaces Humano-Computador (Human Computer Interface - HCI), como no caso de reconhecimento automático de voz e identificação de locutor. Entretanto, VAD e SSL ainda são problemas desafiadores quando se lidando com cenários acústicos realísticos, particularmente na presença de ruído, reverberação e locutores simultâneos. Neste trabalho, são propostas abordagens para tratar tais problemas, para os casos de uma e múltiplas fontes sonoras, através do uso de informações audiovisuais, explorando-se variadas maneiras de se fundir as modalidades de áudio e vídeo. Este trabalho também emprega um arranjo de microfones para o processamento de som, o qual permite que as informações espaciais dos sinais acústicos sejam exploradas através do algoritmo estado-da-arte SRP (Steered Response Power). Por consequência adicional, uma eficiente implementação em GPU do SRP foi desenvolvida, possibilitando processamento em tempo real do algoritmo. Os experimentos realizados mostram uma acurácia média de 95% ao se efetuar VAD de até três locutores simultâneos, e um erro médio de 10cm ao se localizar tais locutores. / Given the tendency of creating interfaces between human and machines that increasingly allow simple ways of interaction, it is only natural that research effort is put into techniques that seek to simulate the most conventional mean of communication humans use: the speech. In the human auditory system, voice is automatically processed by the brain in an effortless and effective way, also commonly aided by visual cues, such as mouth movement and location of the speakers. This processing done by the brain includes two important components that speech-based communication require: Voice Activity Detection (VAD) and Sound Source Localization (SSL). Consequently, VAD and SSL also serve as mandatory preprocessing tools for high-end Human Computer Interface (HCI) applications in a computing environment, as the case of automatic speech recognition and speaker identification. However, VAD and SSL are still challenging problems when dealing with realistic acoustic scenarios, particularly in the presence of noise, reverberation and multiple simultaneous speakers. In this work we propose some approaches for tackling these problems using audiovisual information, both for the single source and the competing sources scenario, exploiting distinct ways of fusing the audio and video modalities. Our work also employs a microphone array for the audio processing, which allows the spatial information of the acoustic signals to be explored through the stateof- the art method Steered Response Power (SRP). As an additional consequence, a very fast GPU version of the SRP is developed, so that real-time processing is achieved. Our experiments show an average accuracy of 95% when performing VAD of up to three simultaneous speakers and an average error of 10cm when locating such speakers.
|
19 |
Robust Multi-Modal Fusion for 3D Object Detection : Using multiple sensors of different types to robustly detect, classify, and position objects in three dimensions. / Robust multi-modal fusion för 3D-objektdetektion : Använda flera sensorer av olka typer för att robust detektera, klassificera och positionera objekt i tre dimensioner.Kårefjärd, Viktor January 2023 (has links)
The computer vision task of 3D object detection is fundamentally necessary for autonomous driving perception systems. These vehicles typically feature a multitude of sensors, such as cameras, radars, and light detection and ranging sensors. A neural network architecture approach to make use of these sensor modalities is a multi-modal 3D object detection network with a fusion step that combines the information from multiple data streams to jointly predicted bounding boxes of detected objects. How this step should be performed, however, remains largely an open question due to the contemporary nature of this literature space. Thus, the question arises: How can sensor information from different sensors be combined to perform 3D object detection for a real-world application such as a mobile delivery robot with robustness requirements and how should a fusion step be performed as a part of a larger multi-modal fusion network? This work explores state-of-the-art multi-modal fusion models by testing with sub-optimal sensor data augmentations to quantify robustness including LiDAR point cloud subsampling and low-resolution LiDAR data. Sensor-to-sensor misalignments from poor calibration, decalibration, or spatial-temporal mis-synchronization problems are also simulated and a set of fusion steps are compared and evaluated. Three novel fusion steps are proposed where the best-performing fusion step is a convolution fusion with an encode-decoder and a squeeze and excitation block. The results indicate how early and late fusion methods are sensitive to sub-optimal LiDAR sensor conditions, and thus not suitable for an application with requirements of robust detection. Instead, Deep-fusion based models are preferred. Furthermore, a bird’s eye fusion model is demonstrated to not be overly sensitive to small sensor-to-sensor misalignments, and how the proposed fusion step with an encoder-decoder structure and a squeeze and excitation block can further limit misalignment-related performance deficits. The introduction of sensor misalignment as a training augmentation is also proven to alleviate and generalize the fusion step under heavy misalignment. / Datorseende uppgiften 3D-objektdetektering är i grunden nödvändig för autonomt körande system. Dessa fordon har vanligtvis ett flertal sensorer, såsom kameror, radar och ljusdetekterings- och avståndssensorer. Ett tillvägagångssätt med neural nätverksarkitektur för att använda dessa sensormodaliteter är ett multimodalt 3D-objektdetekteringsnätverk med ett fusionssteg som kombinerar informationen från flera dataströmmar för att gemensamt föreslå beggrränsade boxar för upptäckta objekt. Hur detta steg bör utformas förblir dock till stor del en öppen fråga på grund av litteraturutrymmes obestämda karaktär. Därför uppstår frågan: Hur kan sensorinformation från olika sensorer kombineras för att utföra 3D-objektdetektering för en verklig applikation som en mobil leveransrobot med robusthetskrav och hur ska ett fusionssteg utföras som en del av i ett större multimodalt fusionsnätverk? Detta arbete utforskar moderna multimodala fusionsmodeller genom att testa med suboptimala sensordataaugmenteringar för att kvantifiera robusthet inklusive LiDAR punktmolnsdelsampling och lågupplöst LiDAR-data. Sensor-till-sensor feljusteringar från dålig kalibrering, dekalibrering eller rumsliga-temporala felsynkroniseringsproblem simuleras också och en uppsättning fusionssteg jämförs och utvärderas. Tre nya fusionssteg föreslås där det bästa fusionssteget av de presterande är en convolution med en inkodare-avkodare och ett kläm- och exciteringsblock. Resultaten indikerar hur tidiga och sena fusionsmetoder är känsliga för suboptimala LiDAR-sensorförhållanden och därför inte lämpar sig för en applikation med krav på robust detektion. Istället föredras djupfusion modeller. Dessutom har en fusionsmodell av fågelvy typ visat sig inte vara känslig för små sensor-till-sensor feljusteringar, och hur det föreslagna fusionssteget med en inkodare-avkodarestruktur och ett kläm- och exciteringsblock ytterligare kan begränsa feljusteringsrelaterade prestandabrister. Införandet av sensorfeljustering som en träningsaugmentering har också visat sig lindra och generalisera fusionssteget under kraftig feljustering.
|
Page generated in 0.0786 seconds