Global ETD Search

1	Advancing Chart Question Answering with Robust Chart Component Recognition Zheng, Hanwen 13 August 2024 (has links) The task of comprehending charts [1, 2, 3] presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. The chart extraction task ensures the precise identification of key components, while the chart question answering (ChartQA) task integrates visual and textual information, facilitating accurate responses to queries based on the chart's content. To approach ChartQA, this research focuses on two main aspects. Firstly, we introduce ChartFormer, an integrated framework that simultaneously identifies and classifies every chart element. ChartFormer extends beyond traditional data visualization by identifying descriptive components such as the chart title, legend, and axes, providing a comprehensive understanding of the chart's content. ChartFormer is particularly effective for complex instance segmentation tasks that involve a wide variety of class objects with unique visual structures. It utilizes an end-to-end transformer architecture, which enhances its ability to handle the intricacies of diverse and distinct object features. Secondly, we present Question-guided Deformable Co-Attention (QDCAt), which facilitates multimodal fusion by incorporating question information into a deformable offset network and enhancing visual representation from ChartFormer through a deformable co-attention block. / Master of Science / Real-world data often encompasses multimodal information, blending textual descriptions with visual representations. Charts, in particular, pose a significant challenge for machine learning models due to their condensed and complex structure. Existing multimodal methods often neglect these graphics, failing to integrate them effectively. To address this gap, we introduce ChartFormer, a unified framework designed to enhance chart understanding through instance segmentation, and a novel Question-guided Deformable Co-Attention (QDCAt) mechanism. This approach seamlessly integrates visual and textual features for chart question answering (ChartQA), allowing for more comprehensive reasoning. ChartFormer excels at identifying and classifying chart components such as bars, lines, pies, titles, legends, and axes. The QDCAt mechanism further enhances multimodal fusion by aligning textual information with visual cues, thereby improving answer accuracy. By dynamically adjusting attention based on the question context, QDCAt ensures that the model focuses on the most relevant parts of the chart. Extensive experiments demonstrate that ChartFormer and QDChart significantly outperform their baseline models in chart component recognition and ChartQA tasks by 3.2% in mAP and 15.4% in accuracy, respectively, providing a robust solution for detailed visual data interpretation across various applications. These results highlight the efficacy of our approach in providing a robust solution for detailed visual data interpretation, making it applicable to a wide range of domains, from scientific research to financial analysis and beyond. Multimodal Learning Instance Segmentation Visual Question Answering
2	Multimodal Representation Learning for Visual Reasoning and Text-to-Image Translation January 2018 (has links) abstract: Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task dependent. The downstream task can range from understanding one modality in the presence of information from other modalities, to that of translating input from one modality to another. In this thesis the utility of multimodal representation learning for understanding one modality vis-à-vis Image Understanding for Visual Reasoning given corresponding information in other modalities, as well as translating from one modality to the other, specifically, Text to Image Translation was investigated. Visual Reasoning has been an active area of research in computer vision. It encompasses advanced image processing and artificial intelligence techniques to locate, characterize and recognize objects, regions and their attributes in the image in order to comprehend the image itself. One way of building a visual reasoning system is to ask the system to answer questions about the image that requires attribute identification, counting, comparison, multi-step attention, and reasoning. An intelligent system is thought to have a proper grasp of the image if it can answer said questions correctly and provide a valid reasoning for the given answers. In this work how a system can be built by learning a multimodal representation between the stated image and the questions was investigated. Also, how background knowledge, specifically scene-graph information, if available, can be incorporated into existing image understanding models was demonstrated. Multimodal learning provides an intuitive way of learning a joint representation between different modalities. Such a joint representation can be used to translate from one modality to the other. It also gives way to learning a shared representation between these varied modalities and allows to provide meaning to what this shared representation should capture. In this work, using the surrogate task of text to image translation, neural network based architectures to learn a shared representation between these two modalities was investigated. Also, the ability that such a shared representation is capable of capturing parts of different modalities that are equivalent in some sense is proposed. Specifically, given an image and a semantic description of certain objects present in the image, a shared representation between the text and the image modality capable of capturing parts of the image being mentioned in the text was demonstrated. Such a capability was showcased on a publicly available dataset. / Dissertation/Thesis / Masters Thesis Computer Engineering 2018 Artificial intelligence Multimodal Learning Text-to-Image Translation Visual Reasoning
3	Data-Driven Representation Learning in Multimodal Feature Fusion January 2018 (has links) abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph. / Dissertation/Thesis / Doctoral Dissertation Electrical Engineering 2018 Computer science Deep Learning Feature Fusion Multimodal Learning Representation Learning
4	Multimodal Learning and Single Source WiFi Based Indoor Localization Wu, Hongyu 15 June 2020 (has links) No description available. Computer Engineering Electrical Engineering
5	Audio-tactile displays to improve learnability and perceived urgency of alarming stimuli Momenipour, Amirmasoud 01 August 2019 (has links) Based on cross-modal learning and multiple resources theory, human performance can be improved by receiving and processing additional streams of information from the environment. In alarm situations, alarm meanings need to be distinguishable from each other and learnable for users. In audible alarms, by manipulating the temporal characteristics of sounds different audible signals can be generated. However, in some cases such as in using discrete medical alarms, when there are too many audible signals to manage, changes in temporal characteristics may not generate discriminable signals that would be easy for listeners to learn. Multimodal displays can be developed to generate additional auditory, visual, and tactile stimuli for helping humans benefit from cross-modal learning and multiple attentional resources for a better understanding of the alarm situations. In designing multimodal alarm displays in work domains where the alarms are predominantly auditory-based and where accessing visual displays is not possible at all times, tactile displays can enhance the effectiveness of alarms by providing additional streams of information for understanding the alarms. However, because of low information density of tactile information presentation, the use of tactile alarms has been limited. In this thesis, by using human subjects, the learnability of auditory and tactile alarms, separately and together in an audio-tactile display were studied. The objective of the study was to test cross-modal learning when messages of an alarm (i.e. meaning, urgency level) were conveyed simultaneously in audible, tactile and audio-tactile alarm displays. The alarm signals were designed by using spatial characteristics of tactile, and temporal characteristics of audible signals separately in audible and tactile displays as well as together in an audio-tactile display. This study explored if using multimodal alarms (tactile and audible) would help learning unimodal (audible or tactile) alarm meanings and urgency levels. The findings of this study can help for design of more efficient discrete audio-tactile alarms that promote learnability of alarm meanings and urgency levels. Audio-tactile alarms Cross-modal learning Medical alarms Multimodal Alarms Multimodal learning Wearables Industrial Engineering
6	MULTIMODAL VIRTUAL LEARNING ENVIRONMENTS: THE EFFECTS OF VISUO-HAPTIC SIMULATIONS ON CONCEPTUAL LEARNING Mayari Serrano Anazco (8790932) 03 May 2020 (has links) <p>Presently, it is possible to use virtual learning environments for simulating abstract and/or complex scientific concepts. Multimodal Virtual Learning Environments use multiple sensory stimuli, including haptic feedback, in the representation of concepts. Past research</p> <p>on the utilization of haptics for learning has shown inconsistent results when gains in conceptual knowledge had been assessed. This research focused on two abstract phenomena</p> <p>Electricity and Magnetism and Buoyancy. These abstract concepts were experienced by students using either visual, visuo-haptic, or hands-on learning activities. Embodied</p> <p>Cognition Theory was used as a for the implementation of the learning environments. Both phenomena were assessed using qualitative and quantitative data analysis techniques.</p> <p>Results suggested that haptic, visual, and physical modalities affected positively the acquisition of conceptual knowledge of both concepts.</p> Educational Technology and Computing visuo-haptic simulation haptic feedback buoyancy electricity and magnetism multimodal learning environments
7	Multimodal Image Classification In Fluoropolymer AFM And Chest X-Ray Images Meshnick, David Chad 26 May 2023 (has links) No description available. Computer Science machine learning multimodal learning image classification chest xray afm fluoropolymer
8	Modeling Complex Networks via Graph Neural Networks Yella, Jaswanth 05 June 2023 (has links) No description available. Computer Science graph neural networks complex networks deep learning multimodal learning drug discovery drug interactions
9	Transfer Learning and Attention Mechanisms in a Multimodal Setting Greco, Claudio 13 May 2022 (has links) Humans are able to develop a solid knowledge of the world around them: they can leverage information coming from different sources (e.g., language, vision), focus on the most relevant information from the input they receive in a given life situation, and exploit what they have learned before without forgetting it. In the field of Artificial Intelligence and Computational Linguistics, replicating these human abilities in artificial models is a major challenge. Recently, models based on pre-training and on attention mechanisms, namely pre-trained multimodal Transformers, have been developed. They seem to perform tasks surprisingly well compared to other computational models in multiple contexts. They simulate a human-like cognition in that they supposedly rely on previously acquired knowledge (transfer learning) and focus on the most important information (attention mechanisms) of the input. Nevertheless, we still do not know whether these models can deal with multimodal tasks that require merging different types of information simultaneously to be solved, as humans would do. This thesis attempts to fill this crucial gap in our knowledge of multimodal models by investigating the ability of pre-trained Transformers to encode multimodal information; and the ability of attention-based models to remember how to deal with previously-solved tasks. With regards to pre-trained Transformers, we focused on their ability to rely on pre-training and on attention while dealing with tasks requiring to merge information coming from language and vision. More precisely, we investigate if pre-trained multimodal Transformers are able to understand the internal structure of a dialogue (e.g., organization of the turns); to effectively solve complex spatial questions requiring to process different spatial elements (e.g., regions of the image, proximity between elements, etc.); and to make predictions based on complementary multimodal cues (e.g., guessing the most plausible action by leveraging the content of a sentence and of an image). The results of this thesis indicate that pre-trained Transformers outperform other models. Indeed, they are able to some extent to integrate complementary multimodal information; they manage to pinpoint both the relevant turns in a dialogue and the most important regions in an image. These results suggest that pre-training and attention play a key role in pre-trained Transformers’ encoding. Nevertheless, their way of processing information cannot be considered as human-like. Indeed, when compared to humans, they struggle (as non-pre-trained models do) to understand negative answers, to merge spatial information in difficult questions, and to predict actions based on complementary linguistic and visual cues. With regards to attention-based models, we found out that these kinds of models tend to forget what they have learned in previously-solved tasks. However, training these models on easy tasks before more complex ones seems to mitigate this catastrophic forgetting phenomenon. These results indicate that, at least in this context, attention-based models (and, supposedly, pre-trained Transformers too) are sensitive to tasks’ order. A better control of this variable may therefore help multimodal models learn sequentially and continuously as humans do. Settore INF/01 - Informatica
10	Storytime at Irish Libraries : How public libraries can boostearly literacy through reading promotion events O'Driscoll, Mariana January 2019 (has links) The purpose of this Master’s thesis is to explore how libraries’ storytime for babies and toddlers can be construedas a reading promotion event which boosts early literacy, by ways of a multiple-case study of storytime at four public libraries in western Ireland. The study will also explore how the different libraries design these events and include different elements of traditional reading and storytelling, multimodal reading and technology, participation, and accessibility and inclusion through a sociocultural lens. The theoretical framework is constructed on the concepts and theory related to literacy development and reading promotion, and works as an analytical tool through which the empirical data collection will be examined. Data was collected through observations of storytimes at four public libraries in Ireland, as well as interviews with the involved librarians. The results show that although the librarians do not actively work to implement national and EU storytime templates, they offer programmes which are in tune with their participants’ needs, and invoke engagement and excitement about reading among children and parents or guardians alike. library and information science reading promotion early literacy storytime children’s library sociocultural perspective digital technologies multimodal learning Information Studies Biblioteks- och informationsvetenskap

Search results