• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 29
  • 11
  • 11
  • 10
  • 8
  • 8
  • 7
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Forced Attention for Image Captioning

Hemanth Devarapalli (5930603) 17 January 2019 (has links)
<div> <div> <div> <p>Automatic generation of captions for a given image is an active research area in Artificial Intelligence. The architectures have evolved from using metadata of the images on which classical machine learning was employed to neural networks. Two different styles of architectures evolved in the neural network space for image captioning: Encoder-Attention-Decoder architecture, and the transformer architecture. This study is an attempt to modify the attention to allow any object to be specified. An archetypical Encoder-Attention-Decoder architecture (Show, Attend, and Tell (Xu et al., 2015)) is employed as a baseline for this study, and a modification of the Show, Attend, and Tell architecture is proposed. Both the architectures are evaluated on the MSCOCO (Lin et al., 2014) dataset, and seven metrics: BLEU – 1, 2, 3, 4 (Papineni, Roukos, Ward & Zhu, 2002), METEOR (Banerjee & Lavie, 2005), ROGUE L (Lin, 2004), and CIDer (Vedantam, Lawrence & Parikh, 2015) are calculated. Finally, the statistical significance of the results is evaluated by performing paired t tests. </p> </div> </div> </div>
12

Construction of linefeed insertion rules for lecture transcript and their evaluation

Matsubara, Shigeki, Ohno, Tomohiro, Murata, Masaki January 2010 (has links)
No description available.
13

L1/L2 Eye Movement Reading of Closed Captioning: A Multimodal Analysis of Multimodal Use

Specker, Elizabeth January 2008 (has links)
Learning in a multimodal environment entails the presentation of information in a combination of more than one mode (i.e. written words, illustrations, and sound). Past research regarding the benefits of multimodal presentation of information includes both school age children and adult learners (e.g. Koolstra, van der Voort & d'Ydewalle, 1999; Neumen & Koskinen, 1992), as well as both native and non-native language learners (e.g. d'Ydewalle & Gielen, 1992; Kothari et al, 2002). This dissertation focuses how the combination of various modalities are used by learners of differing proficiencies in English to gain better comprehension (cf. Mayer, 1997, 2005; Graber, 1990; Slykhuis et al, 2005). The addition of the written mode (closed captioning) to the already multimodal environment that exists in film and video presentations is analyzed. A Multimodal Multimedia Communicative Event is used to situate the language learner. Research questions focus on the eye movements of the participants as they read moving text both with and without the audio and video modes of information. Small case studies also give a context to four participants by bringing their individual backgrounds and observations to bear on the use of multimodal texts as language learning tools in a second or foreign language learning environment. It was found that Non Native English Speakers (NNS) (L1 Arabic) show longer eye movement patterns in reading dynamic text (closed captioning), echoing past research with static texts while Native Speakers of English (NS) tend to have quicker eye movements. In a multimodal environment the two groups also differed: NNS looked longer at the closed captioning and NS were able to navigate the text presentation quickly. While associative activation (Paivio, 2007) between the audio and print modalities was not found to alter the eye movement patterns of the NNS, participants did alternate between the modalities in search of supplementary information. Other research using additional closed captioning and subtitling have shown that viewing a video program with written text added turns the activity into a reading activity (Jensema, 2000; d'Ydewalle, 1987). The current study found this to be the case, but the results differed in regard to proficiency and strategy.
14

Aika painaa : oopperan tekstilaitekäännöksen toiminnalliset rajat /

Virkkunen, Riitta. January 2004 (has links)
Thesis (Ph.D.)--Tampereen yliopisto, 2004. / Includes bibliographical references (p. 253-263) and discography (p. 251-252). Also available online.
15

Describing and retrieving visual content using natural language

Ramanishka, Vasili 11 February 2021 (has links)
Modern deep learning methods have boosted research progress in visual recognition and text understanding but it is a non-trivial task to unite these advances from both disciplines. In this thesis, we develop models and techniques that allow us to connect natural language and visual content enabling automatic video subtitling, visual grounding, and text-based image search. Such models could be useful in a wide range of applications in robotics and human-computer interaction bridging the gap in vision and language understanding. First, we develop a model that generates natural language descriptions of the main activities and scenes depicted in short videos. While previous methods were constrained to a predefined list of objects, actions, or attributes, our model learns to generate descriptions directly from raw pixels. The model exploits available audio information and the video’s category (e.g., cooking, movie, education) to generate more relevant and coherent sentences. Then, we introduce a technique for visual grounding of generated sentences using the same video description model. Our approach allows for explaining the model’s prediction by localizing salient video regions for corresponding words in the generated sentence. Lastly, we address the problem of image retrieval. Existing cross-modal retrieval methods work by learning a common embedding space for different modalities using parallel data such as images and their accompanying descriptions. Instead, we focus on the case when images are connected by relative annotations: given the context set as an image and its metadata, the user can specify desired semantic changes using natural language instructions. The model needs to capture distinctive visual differences between image pairs as described by the user. Our approach enables interactive image search such that the natural language feedback significantly improves the efficacy of image retrieval. We show that the proposed methods advance the state-of-the-art for video captioning and image retrieval tasks in terms of both accuracy and interpretability.
16

Going Deeper with Images and Natural Language

Ma, Yufeng 29 March 2019 (has links)
One aim in the area of artificial intelligence (AI) is to develop a smart agent with high intelligence that is able to perceive and understand the complex visual environment around us. More ambitiously, it should be able to interact with us about its surroundings in natural languages. Thanks to the progress made in deep learning, we've seen huge breakthroughs towards this goal over the last few years. The developments have been extremely rapid in visual recognition, in which machines now can categorize images into multiple classes, and detect various objects within an image, with an ability that is competitive with or even surpasses that of humans. Meanwhile, we also have witnessed similar strides in natural language processing (NLP). It is quite often for us to see that now computers are able to almost perfectly do text classification, machine translation, etc. However, despite much inspiring progress, most of the achievements made are still within one domain, not handling inter-domain situations. The interaction between the visual and textual areas is still quite limited, although there has been progress in image captioning, visual question answering, etc. In this dissertation, we design models and algorithms that enable us to build in-depth connections between images and natural languages, which help us to better understand their inner structures. In particular, first we study how to make machines generate image descriptions that are indistinguishable from ones expressed by humans, which as a result also achieved better quantitative evaluation performance. Second, we devise a novel algorithm for measuring review congruence, which takes an image and review text as input and quantifies the relevance of each sentence to the image. The whole model is trained without any supervised ground truth labels. Finally, we propose a brand new AI task called Image Aspect Mining, to detect visual aspects in images and identify aspect level rating within the review context. On the theoretical side, this research contributes to multiple research areas in Computer Vision (CV), Natural Language Processing (NLP), interactions between CVandNLP, and Deep Learning. Regarding impact, these techniques will benefit related users such as the visually impaired, customers reading reviews, merchants, and AI researchers in general. / Doctor of Philosophy / One aim in the area of artificial intelligence (AI) is to develop a smart agent with high intelligence that is able to perceive and understand the complex visual environment around us. More ambitiously, it should be able to interact with us about its surroundings in natural languages. Thanks to the progress made in deep learning, we’ve seen huge breakthroughs towards this goal over the last few years. The developments have been extremely rapid in visual recognition, in which machines now can categorize images into multiple classes, and detect various objects within an image, with an ability that is competitive with or even surpasses that of humans. Meanwhile, we also have witnessed similar strides in natural language processing (NLP). It is quite often for us to see that now computers are able to almost perfectly do text classification, machine translation, etc. However, despite much inspiring progress, most of the achievements made are still within one domain, not handling inter-domain situations. The interaction between the visual and textual areas is still quite limited, although there has been progress in image captioning, visual question answering, etc. In this dissertation, we design models and algorithms that enable us to build in-depth connections between images and natural languages, which help us to better understand their inner structures. In particular, first we study how to make machines generate image descriptions that are indistinguishable from ones expressed by humans, which as a result also achieved better quantitative evaluation performance. Second, we devise a novel algorithm for measuring review congruence, which takes an image and review text as input and quantifies the relevance of each sentence to the image. The whole model is trained without any supervised ground truth labels. Finally, we propose a brand new AI task called Image Aspect Mining, to detect visual aspects in images and identify aspect level rating within the review context. On the theoretical side, this research contributes to multiple research areas in Computer Vision (CV), Natural Language Processing (NLP), interactions between CV&NLP, and Deep Learning. Regarding impact, these techniques will benefit related users such as the visually impaired, customers reading reviews, merchants, and AI researchers in general.
17

A Netflix Original Closed Captioning Study: How Netflix Closed Captions Make Audiovisual Content Accessible to Deaf Audiences

Gomizelj, Anna 21 December 2022 (has links)
Netflix is currently the world's largest subscription-based streaming platform, with 221.8 million subscribers worldwide (Maglio, 2022). Part of Netflix's enormous global appeal is its Netflix Original brand of films and TV shows - content it produces specifically for broadcast on its streaming platform. To make its content accessible to deaf and hard-of-hearing audiences, Netflix subcontracts the creation of closed captioning to vendors, instructing them to follow the Timed Text Style Guide (TTSG), which it makes freely available online. My study examines how closed captions for Netflix Original content endeavour to make audiovisual content accessible to deaf audiences, and I demonstrate how the platonic ideal of "equal access" is out of reach due to the limitations of timed text. The objective of my study is to highlight and critique the transformations of meaning that occur when captions translate sound and spoken dialogue into timed text. Drawing on D'Acci's circuit model of media studies (2004) my thesis links the sociohistorical conditions from which captioning techniques and technologies were developed, the conditions of caption production, and the way in which the needs of deaf audiences are articulated in the TTSG. I explore how these three forces affect the content of closed captions. To this end, I engage in a close reading of the TTSG and a selection of closed captions for Netflix Original series and films, borrowing from Berman's (2000) theories regarding the deforming tendencies of translation to describe the changes that result from the intralingual and intersemiotic translation involved in captioning (Jakobson, 2004). My study is informed and inspired by my personal experience as a professional captioner.
18

A New Framework and Novel Techniques to Multimodal Concept Representation and Fusion

Lin, Xudong January 2024 (has links)
To solve real-world problems, machines are required to perceive multiple modalities and fuse the information from them. This thesis studies learning to understand and fuse multimodal information. Existing approaches follow a three-stage learning paradigm. The first stage is to train models for each modality. This process for video understanding models is usually based on supervised training, which is not scalable. Moreover, these modality-specific models are updated rather frequently nowadays with improving single-modality perception abilities. The second stage is crossmodal pretraining, which trains a model to align and fuse multiple modalities based on paired multimodal data, such as video-caption pairs. This process is resource-consuming and expensive. The third stage is to further fine-tune or prompt the resulting model from the second stage towards certain downstream tasks. The key bottleneck of conventional methods lies in the continuous feature representation used for non-textual modalities, which is usually costly to align and fuse with text. In this thesis, we investigate the representation and the fusion based on textual concepts. We propose to map non-textual modalities to textual concepts and then fuse these textual concepts using text models. We systematically study various specific methods of mapping and different architectures for fusion. The proposed methods include an end-to-end video-based text generation model with differentiable tokenization for video and audio concepts, a contrastive-model-based architecture with zero-shot concept extractor, a deep concept injection algorithm enabling language models to solve multimodal tasks without any training, and a distant supervision framework learning concepts in a long temporal span. With our concept representation, we empirically demonstrate that without several orders of magnitude more cost for the crossmodal pretraining stage, our models are able to achieve competitive or even superior performance on downstream tasks such as video question answering, video captioning, text-video retrieval, and audio-video dialogue. We also examine the possible limitations of concept representations such as when the text quality of a dataset is poor. We believe we show a potential path towards upgradable multimodal intelligence, whose components can be easily updated towards new models or new modalities of data.
19

Image Captioning On General Data And Fashion Data : An Attribute-Image-Combined Attention-Based Network for Image Captioning on Mutli-Object Images and Single-Object Images / Bildtexter på allmänna data och modedata : Ett attribut-bild-kombinerat uppmärksamhetsbaserat nätverk för bildtextning på Mutli-objekt-bilder och en-objekt-bilder

Tu, Guoyun January 2020 (has links)
Image captioning is a crucial field across computer vision and natural language processing. It could be widely applied to high-volume web images, such as conveying image content to visually impaired users. Many methods are adopted in this area such as attention-based methods, semantic-concept based models. These achieve excellent performance on general image datasets such as the MS COCO dataset. However, it is still left unexplored on single-object images.In this paper, we propose a new attribute-information-combined attention- based network (AIC-AB Net). At each time step, attribute information is added as a supplementary of visual information. For sequential word generation, spatial attention determines specific regions of images to pass the decoder. The sentinel gate decides whether to attend to the image or to the visual sentinel (what the decoder already knows, including the attribute information). Text attribute information is synchronously fed in to help image recognition and reduce uncertainty.We build a new fashion dataset consisting of fashion images to establish a benchmark for single-object images. This fashion dataset consists of 144,422 images from 24,649 fashion products, with one description sentence for each image. Our method is tested on the MS COCO dataset and the proposed Fashion dataset. The results show the superior performance of the proposed model on both multi-object images and single-object images. Our AIC-AB net outperforms the state-of-the-art network, Adaptive Attention Network by 0.017, 0.095, and 0.095 (CIDEr Score) on the COCO dataset, Fashion dataset (Bestsellers), and Fashion dataset (all vendors), respectively. The results also reveal the complement of attention architecture and attribute information. / Bildtextning är ett avgörande fält för datorsyn och behandling av naturligt språk. Det kan tillämpas i stor utsträckning på högvolyms webbbilder, som att överföra bildinnehåll till synskadade användare. Många metoder antas inom detta område såsom uppmärksamhetsbaserade metoder, semantiska konceptbaserade modeller. Dessa uppnår utmärkt prestanda på allmänna bilddatamängder som MS COCO-dataset. Det lämnas dock fortfarande outforskat på bilder med ett objekt.I denna uppsats föreslår vi ett nytt attribut-information-kombinerat uppmärksamhetsbaserat nätverk (AIC-AB Net). I varje tidsteg läggs attributinformation till som ett komplement till visuell information. För sekventiell ordgenerering bestämmer rumslig uppmärksamhet specifika regioner av bilder som ska passera avkodaren. Sentinelgrinden bestämmer om den ska ta hand om bilden eller den visuella vaktposten (vad avkodaren redan vet, inklusive attributinformation). Text attributinformation matas synkront för att hjälpa bildigenkänning och minska osäkerheten.Vi bygger en ny modedataset bestående av modebilder för att skapa ett riktmärke för bilder med en objekt. Denna modedataset består av 144 422 bilder från 24 649 modeprodukter, med en beskrivningsmening för varje bild. Vår metod testas på MS COCO dataset och den föreslagna Fashion dataset. Resultaten visar den överlägsna prestandan hos den föreslagna modellen på både bilder med flera objekt och enbildsbilder. Vårt AIC-AB-nät överträffar det senaste nätverket Adaptive Attention Network med 0,017, 0,095 och 0,095 (CIDEr Score) i COCO-datasetet, modedataset (bästsäljare) respektive modedatasetet (alla leverantörer). Resultaten avslöjar också komplementet till uppmärksamhetsarkitektur och attributinformation.
20

A Multitask Learning Encoder-N-Decoder Framework for Movie and Video Description

Nina, Oliver A., Nina 11 October 2018 (has links)
No description available.

Page generated in 0.1079 seconds