• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 13
  • 2
  • 1
  • Tagged with
  • 17
  • 17
  • 9
  • 6
  • 6
  • 5
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Evidential Reasoning for Multimodal Fusion in Human Computer Interaction

Reddy, Bakkama Srinath January 2007 (has links)
Fusion of information from multiple modalities in Human Computer Interfaces (HCI) has gained a lot of attention in recent years, and has far reaching implications in many areas of human-machine interaction. However, a major limitation of current HCI fusion systems is that the fusion process tends to ignore the semantic nature of modalities, which may reinforce, complement or contradict each other over time. Also, most systems are not robust in representing the ambiguity inherent in human gestures. In this work, we investigate an evidential reasoning based approach for intelligent multimodal fusion, and apply this algorithm to a proposed multimodal system consisting of a Hand Gesture sensor and a Brain Computing Interface (BCI). There are three major contributions of this work to the area of human computer interaction. First, we propose an algorithm for reconstruction of the 3D hand pose given a 2D input video. Second, we develop a BCI using Steady State Visually Evoked Potentials, and show how a multimodal system consisting of the two sensors can improve the efficiency and the complexity of the system, while retaining the same levels of accuracy. Finally, we propose an semantic fusion algorithm based on Transferable Belief Models, which can successfully fuse information from these two sensors, to form meaningful concepts and resolve ambiguity. We also analyze this system for robustness under various operating scenarios.
2

Evidential Reasoning for Multimodal Fusion in Human Computer Interaction

Reddy, Bakkama Srinath January 2007 (has links)
Fusion of information from multiple modalities in Human Computer Interfaces (HCI) has gained a lot of attention in recent years, and has far reaching implications in many areas of human-machine interaction. However, a major limitation of current HCI fusion systems is that the fusion process tends to ignore the semantic nature of modalities, which may reinforce, complement or contradict each other over time. Also, most systems are not robust in representing the ambiguity inherent in human gestures. In this work, we investigate an evidential reasoning based approach for intelligent multimodal fusion, and apply this algorithm to a proposed multimodal system consisting of a Hand Gesture sensor and a Brain Computing Interface (BCI). There are three major contributions of this work to the area of human computer interaction. First, we propose an algorithm for reconstruction of the 3D hand pose given a 2D input video. Second, we develop a BCI using Steady State Visually Evoked Potentials, and show how a multimodal system consisting of the two sensors can improve the efficiency and the complexity of the system, while retaining the same levels of accuracy. Finally, we propose an semantic fusion algorithm based on Transferable Belief Models, which can successfully fuse information from these two sensors, to form meaningful concepts and resolve ambiguity. We also analyze this system for robustness under various operating scenarios.
3

The Fusion of Multimodal Brain Imaging Data from Geometry Perspectives

January 2020 (has links)
abstract: The rapid development in acquiring multimodal neuroimaging data provides opportunities to systematically characterize human brain structures and functions. For example, in the brain magnetic resonance imaging (MRI), a typical non-invasive imaging technique, different acquisition sequences (modalities) lead to the different descriptions of brain functional activities, or anatomical biomarkers. Nowadays, in addition to the traditional voxel-level analysis of images, there is a trend to process and investigate the cross-modality relationship in a high dimensional level of images, e.g. surfaces and networks. In this study, I aim to achieve multimodal brain image fusion by referring to some intrinsic properties of data, e.g. geometry of embedding structures where the commonly used image features reside. Since the image features investigated in this study share an identical embedding space, i.e. either defined on a brain surface or brain atlas, where a graph structure is easy to define, it is straightforward to consider the mathematically meaningful properties of the shared structures from the geometry perspective. I first introduce the background of multimodal fusion of brain image data and insights of geometric properties playing a potential role to link different modalities. Then, several proposed computational frameworks either using the solid and efficient geometric algorithms or current geometric deep learning models are be fully discussed. I show how these designed frameworks deal with distinct geometric properties respectively, and their applications in the real healthcare scenarios, e.g. to enhanced detections of fetal brain diseases or abnormal brain development. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2020
4

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

Makkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise.
5

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

Makkook, Mustapha January 2007 (has links)
A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise.
6

Um método de segmentação de vídeo em cenas baseado em aprendizagem profunda / A vídeo scene segmentation method based on deep learnig

Trojahn, Tiago Henrique 27 June 2019 (has links)
A segmentação automática de vídeo em cenas é um problema atual e relevante dado sua aplicação em diversos serviços ligado à área de multimídia. Dentre as diferentes técnicas reportadas pela literatura, as multimodais são consideradas mais promissoras, dado a capacidade de extrair informações de diferentes mídias de maneira potencialmente complementar, possibilitando obter segmentações mais significativas. Ao usar informações de diferentes naturezas, tais técnicas enfrentam dificuldades para modelar e obter uma representação combinada das informações ou com elevado custo ao processar cada fonte de informação individualmente. Encontrar uma combinação adequada de informação que aumente a eficácia da segmentação a um custo computacional relativamente baixo torna-se um desafio. Paralelamente, abordagens baseadas em Aprendizagem Profunda mostraram-se eficazes em uma ampla gama de tarefas, incluindo classificação de imagens e vídeo. Técnicas baseadas em Aprendizagem Profunda, como as Redes Neurais Convolucionais (CNNs), têm alcançado resultados impressionantes em tarefas relacionadas por conseguirem extrair padrões significativos dos dados, incluindo multimodais. Contudo, CNNs não podem aprender adequadamente os relacionamentos entre dados que estão temporalmente distribuídos entre as tomadas de uma mesma cena. Isto pode tornar a rede incapaz de segmentar corretamente cenas cujas características mudam entre tomadas. Por outro lado, Redes Neurais Recorrentes (RNNs) têm sido empregadas com sucesso em processamento textual, pois foram projetadas para analisar sequências de dados de tamanho variável e podem melhor explorar as relações temporais entre as características de tomadas relacionadas, potencialmente aumentando a eficácia da segmentação em cenas. Há uma carência de métodos de segmentação multimodais que explorem Aprendizagem Profunda. Assim, este trabalho de doutorado propõe um método automático de segmentação de vídeo em cenas que modela o problema de segmentação como um problema de classificação. O método conta com um modelo que combina o potencial de extração de padrões das CNNs com o processamento de sequencias das RNNs. O modelo proposto elimina a dificuldade de modelar representações multimodais das diferentes informações de entrada além de permitir instanciar diferentes abordagens para fusão multimodal (antecipada ou tardia). Tal método foi avaliado na tarefa de segmentação em cenas utilizando uma base de vídeos pública, comparando os resultados obtidos com os resultados de técnicas em estado-da-arte usando diferentes abordagens. Os resultados mostram um avanço significativo na eficácia obtida. / Automatic video scene segmentation is a current and relevant problem given its application in various services related to multimedia. Among the different techniques reported in the literature, the multimodal ones are considered more promising, given the ability to extract information from different media in a potentially complementary way, allowing for more significant segmentations. By processing information of different natures, such techniques faces difficulties on modeling and obtaining a combined representation of information and cost problems when processing each source of information individually. Finding a suitable combination of information that increases the effectiveness of segmentation at a relatively low computational cost becomes a challenge. At the same time, approaches based on Deep Learning have proven effective on a wide range of tasks, including classification of images and video. Techniques based on Deep Learning, such as Convolutional Neural Networks (CNNs), have achieved impressive results in related tasks by being able to extract significant patterns from data, including multimodal data. However, CNNs can not properly learn the relationships between data temporarily distributed among the shots of the same scene. This can lead the network to become unable to properly segment scenes whose characteristics change among shots. On the other hand, Recurrent Neural Networks (RNNs) have been successfully employed in textual processing since they are designed to analyze variable-length data sequences and can be developed to better explore the temporal relationships between low-level characteristics of related shots, potentially increasing the effectiveness of scene segmentation. There is a lack of multimodal segmentation methods exploring Deep Learning. Thus, this thesis proposes an automatic method for video scene segmentation that models the problem of segmentation as a classification problem. The method relies on a model developed to combine the potential for extracting patterns from CNNs with the potential for sequence processing of the RNNs. The proposed model, different from related works, eliminates the difficulty of modeling multimodal representations of the different input information, besides allowing to instantiate different approaches for multimodal (early or late) fusion. This method was evaluated in the scene segmentation task using a public video database, comparing the results obtained with the results of state-of-the-art techniques using different approaches. The results show a significant advance in the efficiency obtained.
7

Multimodal Speech-Gesture Interaction with 3D Objects in Augmented Reality Environments

Lee, Minkyung January 2010 (has links)
Augmented Reality (AR) has the possibility of interacting with virtual objects and real objects at the same time since it combines the real world with computer-generated contents seamlessly. However, most AR interface research uses general Virtual Reality (VR) interaction techniques without modification. In this research we develop a multimodal interface (MMI) for AR with speech and 3D hand gesture input. We develop a multimodal signal fusion architecture based on the user behaviour while interacting with the MMI that provides more effective and natural multimodal signal fusion. Speech and 3D vision-based free hand gestures are used as multimodal input channels. There were two user observations (1) a Wizard of Oz study and (2)Gesture modelling. With the Wizard of Oz study, we observed user behaviours of interaction with our MMI. Gesture modelling was undertaken to explore whether different types of gestures can be described by pattern curves. Based on the experimental observations, we designed our own multimodal fusion architecture and developed an MMI. User evaluations have been conducted to evaluate the usability of our MMI. As a result, we found that MMI is more efficient and users are more satisfied with it when compared to the unimodal interfaces. We also describe design guidelines which were derived from our findings through the user studies.
8

MMF-DRL: Multimodal Fusion-Deep Reinforcement Learning Approach with Domain-Specific Features for Classifying Time Series Data

Sharma, Asmita 01 June 2023 (has links) (PDF)
This research focuses on addressing two pertinent problems in machine learning (ML) which are (a) the supervised classification of time series and (b) the need for large amounts of labeled images for training supervised classifiers. The novel contributions are two-fold. The first problem of time series classification is addressed by proposing to transform time series into domain-specific 2D features such as scalograms and recurrence plot (RP) images. The second problem which is the need for large amounts of labeled image data, is tackled by proposing a new way of using a reinforcement learning (RL) technique as a supervised classifier by using multimodal (joint representation) scalograms and RP images. The motivation for using such domain-specific features is that they provide additional information to the ML models by capturing domain-specific features (patterns) and also help in taking advantage of state-of-the-art image classifiers for learning the patterns from these textured images. Thus, this research proposes a multimodal fusion (MMF) - deep reinforcement learning (DRL) approach as an alternative technique to traditional supervised image classifiers for the classification of time series. The proposed MMF-DRL approach produces improved accuracy over state-of-the-art supervised learning models while needing fewer training data. Results show the merit of using multiple modalities and RL in achieving improved performance than training on a single modality. Moreover, the proposed approach yields the highest accuracy of 90.20% and 89.63% respectively for two physiological time series datasets with fewer training data in contrast to the state-of-the-art supervised learning model ChronoNet which gave 87.62% and 88.02% accuracy respectively for the two datasets with more training data.
9

Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery Data : Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery Data

He, Linbo January 2019 (has links)
Semantic segmentation is a key approach to comprehensive image data analysis. It can be applied to analyze 2D images, videos, and even point clouds that contain 3D data points. On the first two problems, CNNs have achieved remarkable progress, but on point cloud segmentation, the results are less satisfactory due to challenges such as limited memory resource and difficulties in 3D point annotation. One of the research studies carried out by the Computer Vision Lab at Linköping University was aiming to ease the semantic segmentation of 3D point cloud. The idea is that by first projecting 3D data points to 2D space and then focusing only on the analysis of 2D images, we can reduce the overall workload for the segmentation process as well as exploit the existing well-developed 2D semantic segmentation techniques. In order to improve the performance of CNNs for 2D semantic segmentation, the study has used input data derived from different modalities. However, how different modalities can be optimally fused is still an open question. Based on the above-mentioned study, this thesis aims to improve the multistream framework architecture. More concretely, we investigate how different singlestream architectures impact the multistream framework with a given fusion method, and how different fusion methods contribute to the overall performance of a given multistream framework. As a result, our proposed fusion architecture outperformed all the investigated traditional fusion methods. Along with the best singlestream candidate and few additional training techniques, our final proposed multistream framework obtained a relative gain of 7.3\% mIoU compared to the baseline on the semantic3D point cloud test set, increasing the ranking from 12th to 5th position on the benchmark leaderboard.
10

Event Boundary Detection Using Web-cating Texts And Audio-visual Features

Bayar, Mujdat 01 September 2011 (has links) (PDF)
We propose a method to detect events and event boundaries in soccer videos by using web-casting texts and audio-visual features. The events and their inaccurate time information given in web-casting texts need to be aligned with the visual content of the video. Most match reports presented by popular organizations such as uefa.com (the official site of Union of European Football Associations) provide the time information in minutes rather than seconds. We propose a robust method which is able to handle uncertainties in the time points of the events. As a result of our experiments, we claim that our method detects event boundaries satisfactorily for uncertain web-casting texts, and that the use of audio-visual features improves the performance of event boundary detection.

Page generated in 0.0898 seconds