1 |
Evidential Reasoning for Multimodal Fusion in Human Computer InteractionReddy, Bakkama Srinath January 2007 (has links)
Fusion of information from multiple modalities in Human Computer Interfaces
(HCI) has gained a lot of attention in recent years, and has far reaching
implications in many areas of human-machine interaction. However, a major
limitation of current HCI fusion systems is that the fusion process tends to
ignore the semantic nature of modalities, which may reinforce, complement or
contradict each other over time. Also, most systems are not robust in
representing the ambiguity inherent in human gestures. In this work, we
investigate an evidential reasoning based approach for intelligent multimodal
fusion, and apply this algorithm to a proposed multimodal system consisting of
a Hand Gesture sensor and a Brain Computing Interface (BCI). There are three
major contributions of this work to the area of human computer interaction.
First, we propose an algorithm for reconstruction of the 3D hand pose given a
2D input video. Second, we develop a BCI using Steady State Visually Evoked
Potentials, and show how a multimodal system consisting of the two sensors can
improve the efficiency and the complexity of the system, while retaining the
same levels of accuracy. Finally, we propose an semantic fusion algorithm based
on Transferable Belief Models, which can successfully fuse information from
these two sensors, to form meaningful concepts and resolve ambiguity. We also
analyze this system for robustness under various operating scenarios.
|
2 |
Evidential Reasoning for Multimodal Fusion in Human Computer InteractionReddy, Bakkama Srinath January 2007 (has links)
Fusion of information from multiple modalities in Human Computer Interfaces
(HCI) has gained a lot of attention in recent years, and has far reaching
implications in many areas of human-machine interaction. However, a major
limitation of current HCI fusion systems is that the fusion process tends to
ignore the semantic nature of modalities, which may reinforce, complement or
contradict each other over time. Also, most systems are not robust in
representing the ambiguity inherent in human gestures. In this work, we
investigate an evidential reasoning based approach for intelligent multimodal
fusion, and apply this algorithm to a proposed multimodal system consisting of
a Hand Gesture sensor and a Brain Computing Interface (BCI). There are three
major contributions of this work to the area of human computer interaction.
First, we propose an algorithm for reconstruction of the 3D hand pose given a
2D input video. Second, we develop a BCI using Steady State Visually Evoked
Potentials, and show how a multimodal system consisting of the two sensors can
improve the efficiency and the complexity of the system, while retaining the
same levels of accuracy. Finally, we propose an semantic fusion algorithm based
on Transferable Belief Models, which can successfully fuse information from
these two sensors, to form meaningful concepts and resolve ambiguity. We also
analyze this system for robustness under various operating scenarios.
|
3 |
The Fusion of Multimodal Brain Imaging Data from Geometry PerspectivesJanuary 2020 (has links)
abstract: The rapid development in acquiring multimodal neuroimaging data provides opportunities to systematically characterize human brain structures and functions. For example, in the brain magnetic resonance imaging (MRI), a typical non-invasive imaging technique, different acquisition sequences (modalities) lead to the different descriptions of brain functional activities, or anatomical biomarkers. Nowadays, in addition to the traditional voxel-level analysis of images, there is a trend to process and investigate the cross-modality relationship in a high dimensional level of images, e.g. surfaces and networks.
In this study, I aim to achieve multimodal brain image fusion by referring to some intrinsic properties of data, e.g. geometry of embedding structures where the commonly used image features reside. Since the image features investigated in this study share an identical embedding space, i.e. either defined on a brain surface or brain atlas, where a graph structure is easy to define, it is straightforward to consider the mathematically meaningful properties of the shared structures from the geometry perspective.
I first introduce the background of multimodal fusion of brain image data and insights of geometric properties playing a potential role to link different modalities. Then, several proposed computational frameworks either using the solid and efficient geometric algorithms or current geometric deep learning models are be fully discussed. I show how these designed frameworks deal with distinct geometric properties respectively, and their applications in the real healthcare scenarios, e.g. to enhanced detections of fetal brain diseases or abnormal brain development. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2020
|
4 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
5 |
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech RecognitionMakkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise.
|
6 |
Estimating Real Estate Selling Prices using Multimodal Neural Networks / Estimering av fastigheters försäljningspriser med hjälp av multimodala neurala nätverkÖijar Jansson, Agnes January 2023 (has links)
This thesis examines whether housing price estimations can be improved by combining several modalities of data through the utilization of neural networks. The analysis is limited to apartments in the Stockholm municipality, and the applied modalities are residential attributes (tabular data) and photo montages (image data). The tabular data includes living area, number of rooms, age, latitude, longitude and ocean distance, while the image data contains montages of four images representing the kitchen, bathroom, living space and neighborhood through satellite imagery. Furthermore, the dataset comprises a total of 1154 apartments sold within a time frame of approximately six months, ending in June 2023. The analysis is conducted by designing three artificial neural networks and comparing their performances: a multilayer perceptron that predicts selling prices using tabular data, a convolutional neural network that predicts selling prices using image data, and a multimodal neural network that estimates sold prices taking both modalities as inputs. To facilitate the construction process, the multimodal neural network is designed by integrating the other models into its architecture. This is achieved through the concatenation of their outputs, which is then fed into a joint hidden layer. Before initiating the network development phase, the data is preprocessed appropriately, for example by excluding duplicates and dealing with missing values. In addition, images are categorized into room types via object detection, satellite images are collected, and photo montages are created. To obtain well-performing models, hyperparameter tuning is performed using methods such as grid search or random search. Moreover, the models are evaluated through three repetitions of 5-fold cross-validation with the mean absolute percentage error as performance metric. The analysis shows that the multimodal neural network exhibits a marginal but significant performance advantage compared to the multilayer perceptron, both in terms of cross-validation scores and test set outcomes. This result underscores the potential benefits of utilizing both image data and tabular data for predicting apartment selling prices through the application of neural networks. Furthermore, this work motivates a deeper investigation into these prediction methods using larger datasets for which the multimodal neural network may achieve even stronger predictive capacity / Detta examensarbete undersöker huruvida bostadsprisuppskattningar kan förbättras genom att kombinera flera modaliteter vid tillämpning av neurala nätverk. Analysen är begränsad till lägenheter i Stockholms kommun, och de tillämpade modaliteterna är bostadsattribut (tabelldata) och fotomontage (bilddata). Tabelldatat inkluderar bostadsyta, antal rum, ålder, latitud, longitud och avstånd till havet, medan bilddatat består av montage med fyra bilder som representerar kök, badrum, vardagsrum och närområde genom satellitbilder. Datasetet omfattar totalt 1154 lägenheter sålda inom ett tidsspann på cirka sex månader, fram till och med juni 2023. Analysen utförs genom att designa tre artificiella neurala nätverk och jämföra deras prestanda: en flerskiktsperceptron som förutsäger försäljningspriser med hjälp av tabelldata, ett konvolutionellt neuralt nätverk som förutsäger försäljningspriser med hjälp av bilddata, och ett multimodalt neuralt nätverk som estimerar sålda priser med båda modaliteterna som indata. För att underlätta konstruktionsprocessen designas det multimodala neurala nätverket genom att integrera de andra modellerna i sin arkitektur. Detta åstadkoms genom en sammanlänkning av deras utdata, som sedan matas in i ett gemensamt dolt lager. Innan nätverksutvecklingsfasen påbörjas, förbehandlas datat på lämpligt sätt, till exempel genom exkludering av dubbletter och hantering av saknade värden. Dessutom kategoriseras bilder till rumstyper via objektdetektering, satellitbilder samlas in och fotomontage skapas. För att uppnå välpresterande modeller utförs hyperparameterjustering med metoder som rutnätssökning eller slumpmässig sökning. Vidare utvärderas modellerna genom tre upprepningar av 5-faldig korsvalidering med det genomsnittliga absoluta procentuella felet som prestandamått. Analysen visar på att det multimodala neurala nätverket uppvisar en marginell men tydlig prestandafördel jämfört med flerskiktsperceptronen, både när det gäller korsvalideringspoäng och testresultat. Detta understryker de potentiella fördelarna med att använda både bilddata och tabelldata vid estimering av lägenheters försäljningspris genom tillämpning av neurala nätverk. Vidare motiverar detta arbete en djupare undersökning av dessa prediktionsmetoder med hjälp av större datamängder, för vilket det multimodala neurala nätverket har potential att uppnå ännu starkare prediktiv kapacitet.
|
7 |
Um método de segmentação de vídeo em cenas baseado em aprendizagem profunda / A vídeo scene segmentation method based on deep learnigTrojahn, Tiago Henrique 27 June 2019 (has links)
A segmentação automática de vídeo em cenas é um problema atual e relevante dado sua aplicação em diversos serviços ligado à área de multimídia. Dentre as diferentes técnicas reportadas pela literatura, as multimodais são consideradas mais promissoras, dado a capacidade de extrair informações de diferentes mídias de maneira potencialmente complementar, possibilitando obter segmentações mais significativas. Ao usar informações de diferentes naturezas, tais técnicas enfrentam dificuldades para modelar e obter uma representação combinada das informações ou com elevado custo ao processar cada fonte de informação individualmente. Encontrar uma combinação adequada de informação que aumente a eficácia da segmentação a um custo computacional relativamente baixo torna-se um desafio. Paralelamente, abordagens baseadas em Aprendizagem Profunda mostraram-se eficazes em uma ampla gama de tarefas, incluindo classificação de imagens e vídeo. Técnicas baseadas em Aprendizagem Profunda, como as Redes Neurais Convolucionais (CNNs), têm alcançado resultados impressionantes em tarefas relacionadas por conseguirem extrair padrões significativos dos dados, incluindo multimodais. Contudo, CNNs não podem aprender adequadamente os relacionamentos entre dados que estão temporalmente distribuídos entre as tomadas de uma mesma cena. Isto pode tornar a rede incapaz de segmentar corretamente cenas cujas características mudam entre tomadas. Por outro lado, Redes Neurais Recorrentes (RNNs) têm sido empregadas com sucesso em processamento textual, pois foram projetadas para analisar sequências de dados de tamanho variável e podem melhor explorar as relações temporais entre as características de tomadas relacionadas, potencialmente aumentando a eficácia da segmentação em cenas. Há uma carência de métodos de segmentação multimodais que explorem Aprendizagem Profunda. Assim, este trabalho de doutorado propõe um método automático de segmentação de vídeo em cenas que modela o problema de segmentação como um problema de classificação. O método conta com um modelo que combina o potencial de extração de padrões das CNNs com o processamento de sequencias das RNNs. O modelo proposto elimina a dificuldade de modelar representações multimodais das diferentes informações de entrada além de permitir instanciar diferentes abordagens para fusão multimodal (antecipada ou tardia). Tal método foi avaliado na tarefa de segmentação em cenas utilizando uma base de vídeos pública, comparando os resultados obtidos com os resultados de técnicas em estado-da-arte usando diferentes abordagens. Os resultados mostram um avanço significativo na eficácia obtida. / Automatic video scene segmentation is a current and relevant problem given its application in various services related to multimedia. Among the different techniques reported in the literature, the multimodal ones are considered more promising, given the ability to extract information from different media in a potentially complementary way, allowing for more significant segmentations. By processing information of different natures, such techniques faces difficulties on modeling and obtaining a combined representation of information and cost problems when processing each source of information individually. Finding a suitable combination of information that increases the effectiveness of segmentation at a relatively low computational cost becomes a challenge. At the same time, approaches based on Deep Learning have proven effective on a wide range of tasks, including classification of images and video. Techniques based on Deep Learning, such as Convolutional Neural Networks (CNNs), have achieved impressive results in related tasks by being able to extract significant patterns from data, including multimodal data. However, CNNs can not properly learn the relationships between data temporarily distributed among the shots of the same scene. This can lead the network to become unable to properly segment scenes whose characteristics change among shots. On the other hand, Recurrent Neural Networks (RNNs) have been successfully employed in textual processing since they are designed to analyze variable-length data sequences and can be developed to better explore the temporal relationships between low-level characteristics of related shots, potentially increasing the effectiveness of scene segmentation. There is a lack of multimodal segmentation methods exploring Deep Learning. Thus, this thesis proposes an automatic method for video scene segmentation that models the problem of segmentation as a classification problem. The method relies on a model developed to combine the potential for extracting patterns from CNNs with the potential for sequence processing of the RNNs. The proposed model, different from related works, eliminates the difficulty of modeling multimodal representations of the different input information, besides allowing to instantiate different approaches for multimodal (early or late) fusion. This method was evaluated in the scene segmentation task using a public video database, comparing the results obtained with the results of state-of-the-art techniques using different approaches. The results show a significant advance in the efficiency obtained.
|
8 |
Multimodal Speech-Gesture Interaction with 3D Objects in Augmented Reality EnvironmentsLee, Minkyung January 2010 (has links)
Augmented Reality (AR) has the possibility of interacting with virtual objects and real objects at the same time since it combines the real world with computer-generated contents seamlessly. However, most AR interface research uses general Virtual Reality (VR) interaction techniques without modification. In this research we develop a multimodal interface (MMI) for AR with speech and 3D hand gesture input. We develop a multimodal signal fusion architecture based on the user behaviour while interacting with the MMI that provides more effective and natural multimodal signal fusion. Speech and 3D vision-based free hand gestures are used as multimodal input channels. There were two user observations (1) a Wizard of Oz study and (2)Gesture modelling. With the Wizard of Oz study, we observed user behaviours of interaction with our MMI. Gesture modelling was undertaken to explore whether different types of gestures can be described by pattern curves. Based on the experimental observations, we designed our own multimodal fusion architecture and developed an MMI. User evaluations have been conducted to evaluate the usability of our MMI. As a result, we found that MMI is more efficient and users are more satisfied with it when compared to the unimodal interfaces. We also describe design guidelines which were derived from our findings through the user studies.
|
9 |
MMF-DRL: Multimodal Fusion-Deep Reinforcement Learning Approach with Domain-Specific Features for Classifying Time Series DataSharma, Asmita 01 June 2023 (has links) (PDF)
This research focuses on addressing two pertinent problems in machine learning (ML) which are (a) the supervised classification of time series and (b) the need for large amounts of labeled images for training supervised classifiers. The novel contributions are two-fold. The first problem of time series classification is addressed by proposing to transform time series into domain-specific 2D features such as scalograms and recurrence plot (RP) images. The second problem which is the need for large amounts of labeled image data, is tackled by proposing a new way of using a reinforcement learning (RL) technique as a supervised classifier by using multimodal (joint representation) scalograms and RP images. The motivation for using such domain-specific features is that they provide additional information to the ML models by capturing domain-specific features (patterns) and also help in taking advantage of state-of-the-art image classifiers for learning the patterns from these textured images. Thus, this research proposes a multimodal fusion (MMF) - deep reinforcement learning (DRL) approach as an alternative technique to traditional supervised image classifiers for the classification of time series. The proposed MMF-DRL approach produces improved accuracy over state-of-the-art supervised learning models while needing fewer training data. Results show the merit of using multiple modalities and RL in achieving improved performance than training on a single modality. Moreover, the proposed approach yields the highest accuracy of 90.20% and 89.63% respectively for two physiological time series datasets with fewer training data in contrast to the state-of-the-art supervised learning model ChronoNet which gave 87.62% and 88.02% accuracy respectively for the two datasets with more training data.
|
10 |
Wavelet-enhanced 2D and 3D Lightweight Perception Systems for autonomous drivingAlaba, Simegnew Yihunie 10 May 2024 (has links) (PDF)
Autonomous driving requires lightweight and robust perception systems that can rapidly and accurately interpret the complex driving environment. This dissertation investigates the transformative capacity of discrete wavelet transform (DWT), inverse DWT, CNNs, and transformers as foundational elements to develop lightweight perception architectures for autonomous vehicles. The inherent properties of DWT, including its invertibility, sparsity, time-frequency localization, and ability to capture multi-scale information, present an inductive bias. Similarly, transformers capture long-range dependency between features. By harnessing these attributes, novel wavelet-enhanced deep learning architectures are introduced. The first contribution is introducing a lightweight backbone network that can be employed for real-time processing. This network balances processing speed and accuracy, outperforming established models like ResNet-50 and VGG16 in terms of accuracy while remaining computationally efficient. Moreover, a multiresolution attention mechanism is introduced for CNNs to enhance feature extraction. This mechanism directs the network's focus toward crucial features while suppressing less significant ones. Likewise, a transformer model is proposed by leveraging the properties of DWT with vision transformers. The proposed wavelet-based transformer utilizes the convolution theorem in the frequency domain to mitigate the computational burden on vision transformers caused by multi-head self-attention. Furthermore, a proposed wavelet-multiresolution-analysis-based 3D object detection model exploits DWT's invertibility, ensuring comprehensive environmental information capture. Lastly, a multimodal fusion model is presented to use information from multiple sensors. Sensors have limitations, and there is no one-fits-all sensor for specific applications. Therefore, multimodal fusion is proposed to use the best out of different sensors. Using a transformer to capture long-range feature dependencies, this model effectively fuses the depth cues from LiDAR with the rich texture derived from cameras. The multimodal fusion model is a promising approach that integrates backbone networks and transformers to achieve lightweight and competitive results for 3D object detection. Moreover, the proposed model utilizes various network optimization methods, including pruning, quantization, and quantization-aware training, to minimize the computational load while maintaining optimal performance. The experimental results across various datasets for classification networks, attention mechanisms, 3D object detection, and multimodal fusion indicate a promising direction in developing a lightweight and robust perception system for robotics, particularly in autonomous driving.
|
Page generated in 0.0752 seconds