Spelling suggestions: "subject:"textoimagem"" "subject:"text2image""
1 |
Multimodal Representation Learning for Visual Reasoning and Text-to-Image TranslationJanuary 2018 (has links)
abstract: Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task dependent. The downstream task can range from understanding one modality in the presence of information from other modalities, to that of translating input from one modality to another. In this thesis the utility of multimodal representation learning for understanding one modality vis-à-vis Image Understanding for Visual Reasoning given corresponding information in other modalities, as well as translating from one modality to the other, specifically, Text to Image Translation was investigated.
Visual Reasoning has been an active area of research in computer vision. It encompasses advanced image processing and artificial intelligence techniques to locate, characterize and recognize objects, regions and their attributes in the image in order to comprehend the image itself. One way of building a visual reasoning system is to ask the system to answer questions about the image that requires attribute identification, counting, comparison, multi-step attention, and reasoning. An intelligent system is thought to have a proper grasp of the image if it can answer said questions correctly and provide a valid reasoning for the given answers. In this work how a system can be built by learning a multimodal representation between the stated image and the questions was investigated. Also, how background knowledge, specifically scene-graph information, if available, can be incorporated into existing image understanding models was demonstrated.
Multimodal learning provides an intuitive way of learning a joint representation between different modalities. Such a joint representation can be used to translate from one modality to the other. It also gives way to learning a shared representation between these varied modalities and allows to provide meaning to what this shared representation should capture. In this work, using the surrogate task of text to image translation, neural network based architectures to learn a shared representation between these two modalities was investigated. Also, the ability that such a shared representation is capable of capturing parts of different modalities that are equivalent in some sense is proposed. Specifically, given an image and a semantic description of certain objects present in the image, a shared representation between the text and the image modality capable of capturing parts of the image being mentioned in the text was demonstrated. Such a capability was showcased on a publicly available dataset. / Dissertation/Thesis / Masters Thesis Computer Engineering 2018
|
2 |
Towards Affective Vision and LanguageHaydarov, Kilichbek 30 November 2021 (has links)
Developing intelligent systems that can recognize and express human affects is essential to bridge the gap between human and artificial intelligence. This thesis explores the creative and emotional frontiers of artificial intelligence. Specifically, in this thesis, we investigate the relation between the affective impact of visual stimuli and natural language by collecting and analyzing a new dataset called ArtEmis. Furthermore, capitalizing on this dataset, we demonstrate affective AI models that can emotionally talk about artwork and generate them given their affective descriptions. In text-to-image generation task, we present HyperCGAN: a conceptually simple and general approach for text-to-image synthesis that uses hypernetworks to condition a GAN model on text. In our setting, the generator and the discriminator weights are controlled by their corresponding hypernetworks, which modulate weight parameters based on the provided text query. We explore different mechanisms to modulate the layers depending on the underlying architecture of a target network and the structure of the conditioning variable.
|
3 |
Text to Image Synthesis via Mask Anchor Points and Aesthetic AssessmentBaraheem, Samah Saeed 15 June 2020 (has links)
No description available.
|
4 |
Case Study: Can Midjourney produce visual character design ideas for Dota 2 that meet the game’s art guidelines?Liu, Dong January 2023 (has links)
This case study investigates if the AI text-to-image generator Midjourney can generate visual game character idea images for Dota 2 that meet the game’s art guidelines. The author defined the term “visual game character ideas” as the idea images at the early stage of visual character design process to help artists get inspiration. To achieve this, an experiment was designed and conducted where three new Dota 2 heroes’ backgrounds were developed by the author, and 32 images per hero were generated with Midjourney bot. These 96 images were evaluated to examine Midjourney’s performance based on seven aspects: accurate content, readability and identifiability, value gradient, value patterning, the number of colors, areas of rest and detail, and directionality. “YES” was given to each criterion if they meet the requirement. The value of this case study is to present the strength and weakness of the text-to-image generators for visual character design ideas, which can potentially show game artists when and how to use them in visual character design process. The result suggested that Midjourney could be used to generate visual character design ideas for Dota 2 unstably, and this instability was mainly caused by its identified flaw: content accuracy. Furthermore, it performed better for non-color-related aspects, while the performance for color-related items was significantly worse than others.
|
5 |
Implementation of AI tools in 3D game artDiamond, Gregory Frederic, Lindberg, Alexander January 2023 (has links)
AI in art saw a huge spike in popularity with text-to-image models like Midjourney and Stable Diffusion. The AI aids in creation of 2D art and is at times able to save massive amounts of time. Creation of 3D assets is an incredibly time consuming task but the field is currently lacking in research pertaining to artificial intelligence. The goal of this study was to produce an AI-aided workflow that would be compared to a standard workflow of 3D art students. Participants were given one hour per workflow to produce game-ready sci-fi chair assets, one with their standard workflow and one with the study’s AI workflow where the AI tools supplemented or replaced parts of their regular workflow. They began with concepting and researching, moved onto modeling, sorted out the model’s UVs and finally textured the asset. Data taken from semi-structured interviews post-experiment was analyzed with thematic analysis to produce a vivid picture of their thoughts on the experience. The tools proved to be lackluster in both quality and user experience. It seems the tool that was most probable to see use in the future was a TTI tool for concepting. However, almost all tools – and the ideas behind the tools used showed great potential if developed further. The concept of AI in art was met with mixed emotions, excitement over the potential of improvements it might provide, and a small fear over the threat of being replaced. Considering how fast AI has developed in recent years, there is no doubt that further research on the topic is important. Even as the study was being conducted, new tools were being developed and released which could have found a way into the study or could prove useful for the next one.
|
6 |
Adversarial approaches to remote sensing image analysisBejiga, Mesay Belete 17 April 2020 (has links)
The recent advance in generative modeling in particular the unsupervised learning of data distribution is attributed to the invention of models with new learning algorithms. Among the methods proposed, generative adversarial networks (GANs) have shown to be the most efficient approaches to estimate data distributions. The core idea of GANs is an adversarial training of two deep neural networks, called generator and discriminator, to learn an implicit approximation of the true data distribution. The distribution is approximated through the weights of the generator network, and interaction with the distribution is through the process of sampling. GANs have found to be useful in applications such as image-to-image translation, in-painting, and text-to-image synthesis. In this thesis, we propose to capitalize on the power of GANs for different remote sensing problems.
The first problem is a new research track to the remote sensing community that aims to generate remote sensing images from text descriptions. More specifically, we focus on exploiting ancient text descriptions of geographical areas, inherited from previous civilizations, and convert them the equivalent remote sensing images. The proposed method is composed of a text encoder and an image synthesis module. The text encoder is tasked with converting a text description into a vector. To this end, we explore two encoding schemes: a multilabel encoder and a doc2vec encoder. The multilabel encoder takes into account the presence or absence of objects in the encoding process whereas the doc2vec method encodes additional information available in the text. The encoded vectors are then used as conditional information to a GAN network and guide the synthesis process. We collected satellite images and ancient text descriptions for training in order to evaluate the efficacy of the proposed method. The qualitative and quantitative results obtained suggest that the doc2vec encoder-based model yields better images in terms of the semantic agreement with the input description. In addition, we present open research areas that we believe are important to further advance this new research area.
The second problem we want to address is the issue of semi-supervised domain adaptation. The goal of domain adaptation is to learn a generic classifier for multiple related problems, thereby reducing the cost of labeling. To that end, we propose two methods. The first method uses GANs in the context of image-to-image translation to adapt source domain images into target domain images and train a classifier using the adapted images. We evaluated the proposed method on two remote sensing datasets. Though we have not explored this avenue extensively due to computational challenges, the results obtained show that the proposed method is promising and worth exploring in the future. The second domain adaptation strategy borrows the adversarial property of GANs to learn a new representation space where the domain discrepancy is negligible, and the new features are discriminative enough. The method is composed of a feature extractor, class predictor, and domain classifier blocks. Contrary to the traditional methods that perform representation and classifier learning in separate stages, this method combines both into a single-stage thereby learning a new representation of the input data that is domain invariant and discriminative. After training, the classifier is used to predict both source and target domain labels. We apply this method for large-scale land cover classification and cross-sensor hyperspectral classification problems. Experimental results obtained show that the proposed method provides a performance gain of up to 40%, and thus indicates the efficacy of the method.
|
7 |
Are AI-Photographers Ready for Hire? : Investigating the possibilities of AI generated images in journalismBreuer, Andrea, Jonsson, Isac January 2023 (has links)
In today’s information era, many news outlets are competing for attention. One way to cut through the noise is to use images. Obtaining images can be both time-consuming and expen- sive for smaller news agencies. In collaboration with the Swedish news agency Newsworthy, we investigate the possibilities of using AI-generated images in a journalistic context. Using images generated with the text-to-image generation model Stable Diffusion, we aim to answer the research question How do the parameters in Stable Diffusion affect the applicability of the generated images for journalistic purposes? A total of 511 images are generated with different Stable Diffusion parameter settings and rated on a scale of 1-5 by three journalists at Newswor- thy. The data is analyzed using ordinal logistic regression. The results suggest that the optimal value for the Stable Diffusion parameter classifier-free guidance is around 10-12, the default 50 iterations are sufficient, and keywords do not significantly affect the image outcome. The parameter that has the single greatest effect on the outcome is the prompt. Thus, to generate photo-realistic images that can be used in a journalistic context, most thought and effort should be put towards formulating a suitable prompt.
|
8 |
Fotografering med AI i bilden : En intervjustudie med svenska fotografer om det fotografiska yrket med teknologin artificiell intelligensAvelin Belin, Adam, Geidemark, Oscar January 2024 (has links)
At the turn of the century, photographers started a transition from analogue to digital photography: a transition which took years to complete. The photographic landscape is now in a new period of development, with new AI-programs based on machine learning. These programs can be about (I) editing programs that aim to streamline the photographic workflow or correct pictures with technical flaws, such as reducing grain from a high ISO or enlarging pictures. The programs also incorporate (II) AI generated images based on prompting from an AI-artist, so called text-to-image programs like DALL-E 3, Midjourney, Stable Diffusion and Firefly 2. In this study we have done seven semi-structured interviews with Swedish photographers. We have used a postphenomenological theory based on Don Ihde’s philosophy. In analyzing the material from the interviews we have used narrative analysis. The results showed that photographers who worked in advertising, with organizations or clients had a higher tolerance for image manipulation with (II). These photographers valued effectiveness in their workflow and saw a larger need to adapt to new technology. While nature photographers valued authenticity and used (I) sparingly. Another result from the study was that Swedish photographers do not consider (II) photography and do not think that AI generated images should be allowed to compete in photo competitions, only in certain categories with only AI generated images.
|
9 |
Quantitative and Qualitative Analysis of Text-to-Image modelsMasrourisaadat, Nila 30 August 2023 (has links)
The field of image synthesis has seen significant progress recently, including great strides with generative models like Generative Adversarial Networks (GANs), Diffusion Models, and Transformers.
These models have shown they can create high-quality images from a variety of text prompts. However, a comprehensive analysis that examines both their performance and possible biases is often missing from existing research.
In this thesis, I undertake a thorough examination of several leading text-to-image models, namely Stable Diffusion, DALL-E Mini, Lafite, and Ernie-ViLG. I assess their performance in generating accurate images of human faces, groups, and specified numbers of objects, using both Frechet Inception Distance (FID) scores and R-precision as my evaluation metrics. Moreover, I uncover inherent gender or social biases these models may possess.
My research reveals a noticeable bias in these models, which show a tendency towards generating images of white males, thus under-representing minorities in their output of human faces. This finding contributes to the broader dialogue on ethics in AI and sets the stage for further research aimed at developing more equitable AI systems.
Furthermore, based on the metrics I used for evaluation, the Stable Diffusion model outperforms the others in generating images from text prompts. This information could be particularly useful for researchers and practitioners trying to choose the most effective model for their future projects.
To facilitate further research in this field, I have made my findings, the related data, and the source code publicly available. / Master of Science / In my research, I explored how cutting-edge computer models, namely Stable Diffusion, DALL-E Mini, Lafite, and Ernie-ViLG, can create images from text descriptions, a process that holds exciting possibilities for the future. However, these technologies aren't without their challenges. An important finding from my study is that these models exhibit bias, e.g., they often generate images of white males more than they do of other races and genders. This suggests they're not representing our diverse society fairly. Among these models, Stable Diffusion outperforms the others at creating images from text prompts, which is valuable information for anyone choosing a model for their projects. To help others learn from my work and build upon it, I've made all my data, findings, and the code I used in this study publicly available. By sharing this work, I hope to contribute to improving this technology, making it even better and fairer for everyone in the future.
|
10 |
Assisted Prompt Engineering : Making Text-to-Image Models Available Through Intuitive Prompt Applications / Assisterad Prompt Engineering : Gör Text-till-Bild Modeller Tillgängliga Med Intuitiva Prompt ApplikationerBjörnler, Zimone January 2024 (has links)
This thesis explores the application of prompt engineering combined with human-AI interaction (HAII) to make text-to-image (TTI) models more accessible and intuitive for non-expert users. The thesis research focuses on developing an application with an intuitive interface that enables users to generate images without extensive knowledge of prompt engineering. A pre-post study was conducted to evaluate the application, demonstrating significant improvements in user satisfaction and ease of use. The findings suggest that such tailored interfaces can make AI technologies more accessible, empowering users to engage creatively with minimal technical barriers. This study contributes to the fields of Media technology and AI by showcasing how simplifying prompt engineering can enhance the accessibility of generative AI tools. / Detta examensarbete utforskar tillämpningen av prompt engineering i kombination med human-AI interaction för att göra text-till-bild modeller mer tillgängliga och intuitiva för icke-experter. Forskningen för examensarbetet fokuseras på att utveckla en applikation med ett intuitivt gränssnitt som gör det möjligt för användare att generera bilder utan omfattande kunskaper om prompt engineering. En före-efter-studie genomfördes för att utvärdera applikationen, vilket visade på en tydlig ökning i användarnöjdhet och användarvänlighet. Utfallet från studien tyder på att skräddarsydda gränssnitt kan göra AI-tekniken mer tillgänglig, och göra det möjligt för användare att nyttja det kreativa skapandet med minimerade tekniska hinder. Den här studien bidrar till områdena avmedieteknik och AI genom att demonstrera hur prompt engineering kan förenklas vilket kan förbättra tillgängligheten av AI-verktyg.
|
Page generated in 0.0368 seconds