Spelling suggestions: "subject:"captivating"" "subject:"caption""
1 |
Learning video preferences using visual features and closed captionsBrezeale, Darin. January 1900 (has links)
Thesis (Ph.D.)--The University of Texas at Arlington, 2007. / Adviser: Diane J. Cook. Includes bibliographical references.
|
2 |
Representing Emotions with Animated TextRashid, Raisa 25 July 2008 (has links)
Closed captioning has not improved since early 1970s, while film and television technology has changed dramatically. Closed captioning only conveys verbatim dialogue to the audience while ignoring music, sound effects and speech prosody. Thus, caption viewers receive limited and often erroneous information. My thesis research attempts to add some of the missing sounds and emotions back into captioning using animated text.
The study involved two animated caption styles and one conventional style: enhanced, extreme and closed. All styles were applied to two clips with animations for happiness, sadness, anger, fear and disgust emotions. Twenty-five hard of hearing and hearing participants viewed and commented on the three caption styles and also identified the character’s emotions. The study revealed that participants preferred enhanced, animated captions. Enhanced captions appeared to improve access to the emotive information in the content. Also, the animation for fear appeared to be most easily understood by the participants.
|
3 |
Representing Emotions with Animated TextRashid, Raisa 25 July 2008 (has links)
Closed captioning has not improved since early 1970s, while film and television technology has changed dramatically. Closed captioning only conveys verbatim dialogue to the audience while ignoring music, sound effects and speech prosody. Thus, caption viewers receive limited and often erroneous information. My thesis research attempts to add some of the missing sounds and emotions back into captioning using animated text.
The study involved two animated caption styles and one conventional style: enhanced, extreme and closed. All styles were applied to two clips with animations for happiness, sadness, anger, fear and disgust emotions. Twenty-five hard of hearing and hearing participants viewed and commented on the three caption styles and also identified the character’s emotions. The study revealed that participants preferred enhanced, animated captions. Enhanced captions appeared to improve access to the emotive information in the content. Also, the animation for fear appeared to be most easily understood by the participants.
|
4 |
Learning to read from television : the effects of closed captioning and narration /Linebarger, Deborah Lorraine, January 1998 (has links)
Thesis (Ph. D.)--University of Texas at Austin, 1998. / Vita. Includes bibliographical references (leaves 145-157). Available also in a digital version from Dissertation Abstracts.
|
5 |
CCTV use by visually impaired seniors living independently in community settingsEllingsberg, Carol E. January 2002 (has links) (PDF)
Thesis--PlanB (M.S.)--University of Wisconsin--Stout, 2002. / Includes bibliographical references.
|
6 |
Towards Affective Vision and LanguageHaydarov, Kilichbek 30 November 2021 (has links)
Developing intelligent systems that can recognize and express human affects is essential to bridge the gap between human and artificial intelligence. This thesis explores the creative and emotional frontiers of artificial intelligence. Specifically, in this thesis, we investigate the relation between the affective impact of visual stimuli and natural language by collecting and analyzing a new dataset called ArtEmis. Furthermore, capitalizing on this dataset, we demonstrate affective AI models that can emotionally talk about artwork and generate them given their affective descriptions. In text-to-image generation task, we present HyperCGAN: a conceptually simple and general approach for text-to-image synthesis that uses hypernetworks to condition a GAN model on text. In our setting, the generator and the discriminator weights are controlled by their corresponding hypernetworks, which modulate weight parameters based on the provided text query. We explore different mechanisms to modulate the layers depending on the underlying architecture of a target network and the structure of the conditioning variable.
|
7 |
The Effects of Captioning and Viewing Original Versions in English on Long-term Acquisition and Comprehension of the English LanguageMartínez Copete, Antonio 10 July 2020 (has links)
The skills enhancement when English as a Foreign Language (EFL) students are acquiring the language has been recently affected by the rapid spread of broadband Internet. More particularly since the appearance of Original Version (OV) video streaming, as it is now available for many English language teachers and students that take advantage of this verbal medium, which provides new opportunities for education and culture. In the present work, we have investigated, through empirical research, the transition from CEFR B1, through B2, towards C1 of different EFL students that have proved how unadapted captioned videos can affect the way they eventually perform when: (1) Gist understanding is evaluated by means of leaving the captions on (B1 students in Study 1). (2) The learners’ listening skills enhancement varies if they are in the habit of watching OV on a regular basis (B1-B2 students in Study 2), and (3) There is an observation of how advanced students (C1 students in Study 3) reading and listening skills improve depending on their level of voluntary exposure to OV videos in the long-term. Results lead us to conclude that the B2 level is the tipping point for students to really enjoy and take advantage of real video streaming TV shows in English as lower levels showed inconsistencies in results amongst them. Complementarily, C1 students that have been independently watching OV TV programmes on a regular basis, showed better outcomes in the listening comprehension tests than those who did not. Nevertheless, the reading comprehension results showed no difference between the two groups.
|
8 |
Multicultural Emotional reasoning in Vision Language ModelsMOHAMED, YOUSSEF SHERIF MANSOUR 03 1900 (has links)
Human intelligence, with its many components, has been elusive. Until recently, the emphasis has been on facts and how humans perceive them. Now, it is time to embellish these facts with emotions and commentary. Emotional experiences and expressions play a critical role in human behavior and are influenced by language and cultural diversity. In this thesis, we explore the importance of emotions across multiple languages, such as Arabic, Chinese, and Spanish. In addition, we argue for the importance of collecting diverse emotional experiences including negative ones. We aim to develop AI systems that have a deeper understanding of emotional experiences. We open-source two datasets that emphasize diversity over emotions, language, and culture. ArtELingo contains affective annotations in the aforementioned languages, revealing valuable insights into how linguistic backgrounds shape emotional perception and expression. While ArtEmis 2.0 has a balanced distribution of positive and negative emotional experiences. Studying emotional experiences in AI is crucial for creating applications that genuinely understand and resonate with users.
We identify and tackle challenges in popular existing affective captioning datasets, mainly unbalanced emotion distribution, and generic captions, we pro- pose a contrastive data collection method. This approach results in a dataset with a balanced distribution of emotions, significantly enhancing the quality of trained neural speakers and emotion recognition models. Consequently, our trained speakers generate emotionally accurate and relevant captions, demonstrating the advantages of using a linguistically and emotionally diverse dataset in AI systems.
In addition, we explore the cultural aspects of emotional experiences and
expressions, highlighting the importance of considering cultural differences in the development of AI applications. By incorporating these insights, our research lays the groundwork for future advancements in culturally diverse affective computing.
This thesis establishes a foundation for future research in emotionally and culturally diverse affective computing, contributing to the development of AI applications capable of effectively understanding and engaging with humans on a deeper emotional level, regardless of their cultural background.
|
9 |
IMAGE CAPTIONING FOR REMOTE SENSING IMAGE ANALYSISHoxha, Genc 09 August 2022 (has links)
Image Captioning (IC) aims to generate a coherent and comprehensive textual description that summarizes the complex content of an image. It is a combination of computer vision and natural language processing techniques to encode the visual features of an image and translate them into a sentence. In the context of remote sensing (RS) analysis, IC has been emerging as a new research area of high interest since it not only recognizes the objects within an image but also describes their attributes and relationships. In this thesis, we propose several IC methods for RS image analysis. We focus on the design of different approaches that take into consideration the peculiarity of RS images (e.g. spectral, temporal and spatial properties) and study the benefits of IC in challenging RS applications.
In particular, we focus our attention on developing a new decoder which is based on support vector machines. Compared to the traditional decoders that are based on deep learning, the proposed decoder is particularly interesting for those situations in which only a few training samples are available to alleviate the problem of overfitting. The peculiarity of the proposed decoder is its simplicity and efficiency. It is composed of only one hyperparameter, does not require expensive power units and is very fast in terms of training and testing time making it suitable for real life applications. Despite the efforts made in developing reliable and accurate IC systems, the task is far for being solved. The generated descriptions are affected by several errors related to the attributes and the objects present in an RS scene. Once an error occurs, it is propagated through the recurrent layers of the decoders leading to inaccurate descriptions. To cope with this issue, we propose two post-processing techniques with the aim of improving the generated sentences by detecting and correcting the potential errors. They are based on Hidden Markov Model and Viterbi algorithm. The former aims to generate a set of possible states while the latter aims at finding the optimal sequence of states. The proposed post-processing techniques can be injected to any IC system at test time to improve the quality of the generated sentences. While all the captioning systems developed in the RS community are devoted to single and RGB images, we propose two captioning systems that can be applied to multitemporal and multispectral RS images. The proposed captioning systems are able at describing the changes occurred in a given geographical through time. We refer to this new paradigm of analysing multitemporal and multispectral images as change captioning (CC). To test the proposed CC systems, we construct two novel datasets composed of bitemporal RS images. The first one is composed of very high-resolution RGB images while the second one of medium resolution multispectral satellite images. To advance the task of CC, the constructed datasets are publically available in the following link: https://disi.unitn.it/~melgani/datasets.html. Finally, we analyse the potential of IC for content based image retrieval (CBIR) and show its applicability and advantages compared to the traditional techniques. Specifically, we focus our attention on developing
a CBIR systems that represents an image with generated descriptions and uses sentence similarity to search and retrieve relevant RS images. Compare to traditional CBIR systems, the proposed system is able to search and retrieve images using either an image or a sentence as a query making it more comfortable for the end-users. The achieved results show the promising potentialities of our proposed methods compared to the baselines and state-of-the art methods.
|
10 |
Vision and language understanding with localized evidenceXu, Huijuan 16 February 2019 (has links)
Enabling machines to solve computer vision tasks with natural language components can greatly improve human interaction with computers. In this thesis, we address vision and language tasks with deep learning methods that explicitly localize relevant visual evidence. Spatial evidence localization in images enhances the interpretability of the model, while temporal localization in video is necessary to remove irrelevant content. We apply our methods to various vision and language tasks, including visual question answering, temporal activity detection, dense video captioning and cross-modal retrieval.
First, we tackle the problem of image question answering, which requires the model to predict answers to questions posed about images. We design a memory network with a question-guided spatial attention mechanism which assigns higher weights to regions that are more relevant to the question. The visual evidence used to derive the answer can be shown by visualizing the attention weights in images. We then address the problem of localizing temporal evidence in videos. For most language/vision tasks, only part of the video is relevant to the linguistic component, so we need to detect these relevant events in videos. We propose an end-to-end model for temporal activity detection, which can detect arbitrary length activities by coordinate regression with respect to anchors and contains a proposal stage to filter out background segments, saving computation time. We further extend activity category detection to event captioning, which can express richer semantic meaning compared to a class label. This derives the problem of dense video captioning, which involves two sub-problems: localizing distinct events in long video and generating captions for the localized events. We propose an end-to-end hierarchical captioning model with vision and language context modeling in which the captioning training affects the activity localization. Lastly, the task of text-to-clip video retrieval requires one to localize the specified query instead of detecting and captioning all events. We propose a model based on the early fusion of words and visual features, outperforming standard approaches which embed the whole sentence before performing late feature fusion. Furthermore, we use queries to regulate the proposal network to generate query related proposals.
In conclusion, our proposed visual localization mechanism applies across a variety of vision and language tasks and achieves state-of-the-art results. Together with the inference module, our work can contribute to solving other tasks such as video question answering in future research.
|
Page generated in 0.1139 seconds