• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 2
  • 1
  • Tagged with
  • 30
  • 30
  • 20
  • 12
  • 11
  • 11
  • 11
  • 10
  • 10
  • 9
  • 8
  • 8
  • 8
  • 8
  • 8
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Practicality in Generative Modeling & Synthetic Data

Daniel Antonio Cardona (19339264) 07 August 2024 (has links)
<p dir="ltr">As machine learning continues to grow and surprise us, its complexity grows as well. Indeed, many machine learning models have become black boxes. Yet, there is a prevailing need for practicality. This dissertation offers some practicality on generative modeling and synthetic data, a recently popular application of generative models. First, Lightweight Chained Universal Approximators (LiCUS) is proposed. Motivated by statistical sampling principles, LiCUS tackles a simplified generative task with its universal approximation property while having a minimal computational bottleneck. When compared to a generative adversarial network (GAN) and variational auto-encoder (VAE), LiCUS empirically yields synthetic data with greater utility for a classifier on the Modified National Institute of Standards and Technology (MNIST) dataset. Second, following on its potential for informative synthetic data, LiCUS undergoes an extensive synthetic data supplementation experiment. The experiment largely serves as an informative starting point for practical use of synthetic data via LiCUS. In addition, by proposing a gold standard of reserved data, the experimental results suggest that additional data collection may generally outperform models supplemented with synthetic data, at least when using LiCUS. Given that the experiment was conducted on two datasets, future research could involve further experimentation on a greater number and variety of datasets, such as images. Lastly, generative machine learning generally demands large datasets, which is not guaranteed in practice. To alleviate this demand, one could offer expert knowledge. This is demonstrated by applying an expert-informed Wasserstein GAN with gradient penalty (WGAN-GP) on network flow traffic from NASA's Operational Simulation for Small Satellites (NOS3). If one were to directly apply a WGAN-GP, it would fail to respect the physical limitations between satellite components and permissible communications amongst them. By arming a WGAN-GP with cyber-security software Argus, the informed WGAN-GP could produce permissible satellite network flows when given as little as 10,000 flows. In all, this dissertation illustrates how machine learning processes could be modified under a more practical lens and incorporate pre-existing statistical principles and expert knowledge. </p>
12

A Generalized Framework for Representing Complex Networks

Viplove Arora (8086250) 06 December 2019 (has links)
<div>Complex systems are often characterized by a large collection of components interacting in nontrivial ways. Self-organization among these individual components often leads to emergence of a macroscopic structure that is neither completely regular nor completely random. In order to understand what we observe at a macroscopic scale, conceptual, mathematical, and computational tools are required for modeling and analyzing these interactions. A principled approach to understand these complex systems (and the processes that give rise to them) is to formulate generative models and infer their parameters from given data that is typically stored in the form of networks (or graphs). The increasing availability of network data from a wide variety of sources, such as the Internet, online social networks, collaboration networks, biological networks, etc., has fueled the rapid development of network science. </div><div><br></div><div>A variety of generative models have been designed to synthesize networks having specific properties (such as power law degree distributions, small-worldness, etc.), but the structural richness of real-world network data calls for researchers to posit new models that are capable of keeping pace with the empirical observations about the topological properties of real networks. The mechanistic approach to modeling networks aims to identify putative mechanisms that can explain the dependence, diversity, and heterogeneity in the interactions responsible for creating the topology of an observed network. A successful mechanistic model can highlight the principles by which a network is organized and potentially uncover the mechanisms by which it grows and develops. While it is difficult to intuit appropriate mechanisms for network formation, machine learning and evolutionary algorithms can be used to automatically infer appropriate network generation mechanisms from the observed network structure.</div><div><br></div><div>Building on these philosophical foundations and a series of (not new) observations based on first principles, we extrapolate an action-based framework that creates a compact probabilistic model for synthesizing real-world networks. Our action-based perspective assumes that the generative process is composed of two main components: (1) a set of actions that expresses link formation potential using different strategies capturing the collective behavior of nodes, and (2) an algorithmic environment that provides opportunities for nodes to create links. Optimization and machine learning methods are used to learn an appropriate low-dimensional action-based representation for an observed network in the form of a row stochastic matrix, which can subsequently be used for simulating the system at various scales. We also show that in addition to being practically relevant, the proposed model is relatively exchangeable up to relabeling of the node-types. </div><div><br></div><div>Such a model can facilitate handling many of the challenges of understanding real data, including accounting for noise and missing values, and connecting theory with data by providing interpretable results. To demonstrate the practicality of the action-based model, we decided to utilize the model within domain-specific contexts. We used the model as a centralized approach for designing resilient supply chain networks while incorporating appropriate constraints, a rare feature of most network models. Similarly, a new variant of the action-based model was used for understanding the relationship between the structural organization of human brains and the cognitive ability of subjects. Finally, our analysis of the ability of state-of-the-art network models to replicate the expected topological variations in network populations highlighted the need for rethinking the way we evaluate the goodness-of-fit of new and existing network models, thus exposing significant gaps in the literature.</div>
13

Evaluation of generative machine learning models : Judging the quality of generated data with the use of neural networks / Evaluering av generativa maskininlärningsmodeller : Evaluering av genererad data med hjälp av neurala nätverk

Yousefzadegan Hedin, Sam January 2022 (has links)
Generative machine learning models are capable of generating remarkably realistic samples. Some models generate images that look entirely natural, and others generate text that reads as if a human wrote it. However, judging the quality of these models is a major challenge. Today, the most convincing method is to use humans to evaluate the quality of generated samples. However, humans are biased, costly, and inefficient. Therefore, there is a great need for automatic methods. MAUVE is a recent advancement in the evaluation of generative text models. It compares generated data with real data and returns a score that quantifies their similarity. This is accomplished with the help of a neural network, which provides the understanding of text required to evaluate its quality. MAUVE is motivated by its correspondence with human judgment, and this is shown in multiple experiments. This thesis contributes in two significant ways: First, we complement experiments and discussions made in the original paper. Importantly, we demonstrate that MAUVE sometimes fails to recognize quality differences between generative models. This failure is due to the choice of neural network. Later, we demonstrate that MAUVE can be used for more than just text evaluation. Specifically, we show that it can be applied to images. This is accomplished by using a neural network specialized in image recognition. However, the steps can be repeated for any data type, meaning that MAUVE can potentially become a more generalized measurement than suggested in the original paper. Our second contribution is an extension toMAUVEcalled Sequence-MAUVE (S-MAUVE). The score MAUVE produces can be seen as an average of the overall quality of generated text. However, some generative models initially produce excellent text, but see drops in quality as the sequences grow longer. Therefore, a single score that represents entire sequences is likely to omit important details. Instead, S-MAUVE evaluates generated text at the smallest possible level. The result is a sequence of scores, which give users more detailed feedback about the behavior of a generative model. / Generativa maskininlärningsmodeller kan generera data av enastående kvalitet. Vissa modeller genererar bilder av ansikten som ser helt realistiska ut, och andra genererar text som verkar varit skriven av en människa. Trots detta så är det inte klart hur dessa modeller ska evalueras. Idag så är den främsta metoden mänsklig evaluering: En person får utgöra huruvida generade data verkar realistisk eller inte. Mänsklig evaluering har flera nackdelar. Människor är partiska, dyra och långsamma. Därför behövs det automatiska evalueringsverktyg. MAUVE är ett ny metod för att evaluera generative textmodeller som jämför hur lik genererad data är med äkta data. Detta åstadkoms med hjälp av ett neuralt nätverk, som bidrar med den förståelse av text som krävs för att evaluera den. MAUVE är motiverat av att dess omdömen överensstämmer med mänsklig evaluering. Den här uppsatsen bidrar på två sätt. Till att börja med komplementerar vi experiment och diskussioner gjorda i den ursprungliga rapporten o m MAUVE. Till exempel så visar vi att MAUVE ibland inte lyckas känna av kvalitetsskillnader mellan olika generativa modeller. Detta på grund av val av neuralt nätverk. Efteråt så demonstrerar vi att MAUVE kan appliceras på andra typer av data än text. Mer specifikt så applicerar vi MAUVE på bilder. Detta åstadkoms genom att använda ett neuralt nätverk specialiserat på bildigenkänning, istället för text. Stegen vi följer kan upprepas för vilken typ av data som helst, vilket innebär att MAUVE kan användas som ett mer generellt mått än vad den ursprungliga artikeln ger sken för. Vårt andra bidrag är att utveckla MAUVE till det vi kallar för S-MAUVE. MAUVE använder bara sammanfattningar av hela texter som bas för sina jämförelser. En konsekvens av det är att den endast gör påståenden om textdatas genomsnittliga kvalitet. Men, det är välkänt att kvaliteten hos genererad textdata kan variera beroende på var i texten man befinner sig. Många generativa textmodeller producerar sekvenser som är verklighetstrogna i början, men blir sämre och repetitiva senare. Till skillnad från MAUVE så evaluerar S-MAUVE genererad text på minsta möjliga detaljnivå. Resultaten är en sekvens av poäng, som ger användare mer information om egenskaperna hos den studerade generativa modellen.
14

Apprentissage autosupervisé de modèles prédictifs de segmentation à partir de vidéos / Self-supervised learning of predictive segmentation models from video

Luc, Pauline 25 June 2019 (has links)
Les modèles prédictifs ont le potentiel de permettre le transfert des succès récents en apprentissage par renforcement à de nombreuses tâches du monde réel, en diminuant le nombre d’interactions nécessaires avec l’environnement.La tâche de prédiction vidéo a attiré un intérêt croissant de la part de la communauté ces dernières années, en tant que cas particulier d’apprentissage prédictif dont les applications en robotique et dans les systèmes de navigations sont vastes.Tandis que les trames RGB sont faciles à obtenir et contiennent beaucoup d’information, elles sont extrêmement difficile à prédire, et ne peuvent être interprétées directement par des applications en aval.C’est pourquoi nous introduisons ici une tâche nouvelle, consistant à prédire la segmentation sémantique ou d’instance de trames futures.Les espaces de descripteurs que nous considérons sont mieux adaptés à la prédiction récursive, et nous permettent de développer des modèles de segmentation prédictifs performants jusqu’à une demi-seconde dans le futur.Les prédictions sont interprétables par des applications en aval et demeurent riches en information, détaillées spatialement et faciles à obtenir, en s’appuyant sur des méthodes état de l’art de segmentation.Dans cette thèse, nous nous attachons d’abord à proposer pour la tâche de segmentation sémantique, une approche discriminative se basant sur un entrainement par réseaux antagonistes.Ensuite, nous introduisons la tâche nouvelle de prédiction de segmentation sémantique future, pour laquelle nous développons un modèle convolutionnel autoregressif.Enfin, nous étendons notre méthode à la tâche plus difficile de prédiction de segmentation d’instance future, permettant de distinguer entre différents objets.Du fait du nombre de classes variant selon les images, nous proposons un modèle prédictif dans l’espace des descripteurs d’image convolutionnels haut niveau du réseau de segmentation d’instance Mask R-CNN.Cela nous permet de produire des segmentations visuellement plaisantes en haute résolution, pour des scènes complexes comportant un grand nombre d’objets, et avec une performance satisfaisante jusqu’à une demi seconde dans le futur. / Predictive models of the environment hold promise for allowing the transfer of recent reinforcement learning successes to many real-world contexts, by decreasing the number of interactions needed with the real world.Video prediction has been studied in recent years as a particular case of such predictive models, with broad applications in robotics and navigation systems.While RGB frames are easy to acquire and hold a lot of information, they are extremely challenging to predict, and cannot be directly interpreted by downstream applications.Here we introduce the novel tasks of predicting semantic and instance segmentation of future frames.The abstract feature spaces we consider are better suited for recursive prediction and allow us to develop models which convincingly predict segmentations up to half a second into the future.Predictions are more easily interpretable by downstream algorithms and remain rich, spatially detailed and easy to obtain, relying on state-of-the-art segmentation methods.We first focus on the task of semantic segmentation, for which we propose a discriminative approach based on adversarial training.Then, we introduce the novel task of predicting future semantic segmentation, and develop an autoregressive convolutional neural network to address it.Finally, we extend our method to the more challenging problem of predicting future instance segmentation, which additionally segments out individual objects.To deal with a varying number of output labels per image, we develop a predictive model in the space of high-level convolutional image features of the Mask R-CNN instance segmentation model.We are able to produce visually pleasing segmentations at a high resolution for complex scenes involving a large number of instances, and with convincing accuracy up to half a second ahead.
15

Latent Space Manipulation of GANs for Seamless Image Compositing

Fruehstueck, Anna 04 1900 (has links)
Generative Adversarial Networks (GANs) are a very successful method for high-quality image synthesis and are a powerful tool to generate realistic images by learning their visual properties from a dataset of exemplars. However, the controllability of the generator output still poses many challenges. We propose several methods for achieving larger and/or higher visual quality in GAN outputs by combining latent space manipulations with image compositing operations: (1) GANs are inherently suitable for small-scale texture synthesis due to the generator’s capability to learn image properties of a limited domain such as the properties of a specific texture type at a desired level of detail. A rich variety of suitable texture tiles can be synthesized from the trained generator. Due to the convolutional nature of GANs, we can achieve largescale texture synthesis by tiling intermediate latent blocks, allowing the generation of (almost) arbitrarily large texture images that are seamlessly merged. (2) We notice that generators trained on heterogeneous data perform worse than specialized GANs, and we demonstrate that we can optimize multiple independently trained generators in such a way that a specialized network can fill in high-quality details for specific image regions, or insets, of a lower-quality canvas generator. Multiple generators can collaborate to improve the visual output quality and through careful optimization, seamless transitions between different generators can be achieved. (3) GANs can also be used to semantically edit facial images and videos, with novel 3D GANs even allowing for camera changes, enabling unseen views of the target. However, the GAN output must be merged with the surrounding image or video in a spatially and temporally consistent way, which we demonstrate in our method.
16

Generative models : a critical review

Lamb, Alexander 07 1900 (has links)
No description available.
17

Locality and compositionality in representation learning for complex visual tasks

Sylvain, Tristan 03 1900 (has links)
L'utilisation d'architectures neuronales profondes associée à des innovations spécifiques telles que les méthodes adversarielles, l’entraînement préalable sur de grands ensembles de données et l'estimation de l'information mutuelle a permis, ces dernières années, de progresser rapidement dans de nombreuses tâches de vision par ordinateur complexes telles que la classification d'images de catégories préalablement inconnues (apprentissage zéro-coups), la génération de scènes ou la classification multimodale. Malgré ces progrès, il n’est pas certain que les méthodes actuelles d’apprentissage de représentations suffiront à atteindre une performance équivalente au niveau humain sur des tâches visuelles arbitraires et, de fait, cela pose des questions quant à la direction de la recherche future. Dans cette thèse, nous nous concentrerons sur deux aspects des représentations qui semblent nécessaires pour atteindre de bonnes performances en aval pour l'apprentissage des représentations : la localité et la compositionalité. La localité peut être comprise comme la capacité d'une représentation à retenir des informations locales. Ceci sera pertinent dans de nombreux cas, et bénéficiera particulièrement à la vision informatique, domaine dans lequel les images naturelles comportent intrinsèquement des informations locales, par exemple des parties pertinentes d’une image, des objets multiples présents dans une scène... D'autre part, une représentation compositionnelle peut être comprise comme une représentation qui résulte d'une combinaison de parties plus simples. Les réseaux neuronaux convolutionnels sont intrinsèquement compositionnels, et de nombreuses images complexes peuvent être considérées comme la composition de sous-composantes pertinentes : les objets et attributs individuels dans une scène, les attributs sémantiques dans l'apprentissage zéro-coups en sont deux exemples. Nous pensons que ces deux propriétés détiennent la clé pour concevoir de meilleures méthodes d'apprentissage de représentations. Dans cette thèse, nous présentons trois articles traitant de la localité et/ou de la compositionnalité, et de leur application à l'apprentissage de représentations pour des tâches visuelles complexes. Dans le premier article, nous introduisons des méthodes de mesure de la localité et de la compositionnalité pour les représentations d'images, et nous démontrons que les représentations locales et compositionnelles sont plus performantes dans l'apprentissage zéro-coups. Nous utilisons également ces deux notions comme base pour concevoir un nouvel algorithme d'apprentissage des représentations qui atteint des performances de pointe dans notre cadre expérimental, une variante de l'apprentissage "zéro-coups" plus difficile où les informations externes, par exemple un pré-entraînement sur d'autres ensembles de données d'images, ne sont pas autorisées. Dans le deuxième article, nous montrons qu'en encourageant un générateur à conserver des informations locales au niveau de l'objet, à l'aide d'un module dit de similarité de graphes de scène, nous pouvons améliorer les performances de génération de scènes. Ce modèle met également en évidence l'importance de la composition, car de nombreux composants fonctionnent individuellement sur chaque objet présent. Pour démontrer pleinement la portée de notre approche, nous effectuons une analyse détaillée et proposons un nouveau cadre pour évaluer les modèles de génération de scènes. Enfin, dans le troisième article, nous montrons qu'en encourageant une forte information mutuelle entre les représentations multimodales locales et globales des images médicales en 2D et 3D, nous pouvons améliorer la classification et la segmentation des images. Ce cadre général peut être appliqué à une grande variété de contextes et démontre les avantages non seulement de la localité, mais aussi de la compositionnalité, car les représentations multimodales sont combinées pour obtenir une représentation plus générale. / The use of deep neural architectures coupled with specific innovations such as adversarial methods, pre-training on large datasets and mutual information estimation has in recent years allowed rapid progress in many complex vision tasks such as zero-shot learning, scene generation, or multi-modal classification. Despite such progress, it is still not clear if current representation learning methods will be enough to attain human-level performance on arbitrary visual tasks, and if not, what direction should future research take. In this thesis, we will focus on two aspects of representations that seem necessary to achieve good downstream performance for representation learning: locality and compositionality. Locality can be understood as a representation's ability to retain local information. This will be relevant in many cases, and will specifically benefit computer vision where natural images inherently feature local information, i.e. relevant patches of an image, multiple objects present in a scene... On the other hand, a compositional representation can be understood as one that arises from a combination of simpler parts. Convolutional neural networks are inherently compositional, and many complex images can be seen as composition of relevant sub-components: individual objects and attributes in a scene, semantic attributes in zero-shot learning are two examples. We believe both properties hold the key to designing better representation learning methods. In this thesis, we present 3 articles dealing with locality and/or compositionality, and their application to representation learning for complex visual tasks. In the first article, we introduce ways of measuring locality and compositionality for image representations, and demonstrate that local and compositional representations perform better at zero-shot learning. We also use these two notions as the basis for designing class-matching deep info-max, a novel representation learning algorithm that achieves state-of-the-art performance on our proposed "Zero-shot from scratch" setting, a harder zero-shot setting where external information, e.g. pre-training on other image datasets is not allowed. In the second article, we show that by encouraging a generator to retain local object-level information, using a scene-graph similarity module, we can improve scene generation performance. This model also showcases the importance of compositionality as many components operate individually on each object present. To fully demonstrate the reach of our approach, we perform detailed analysis, and propose a new framework to evaluate scene generation models. Finally, in the third article, we show that encouraging high mutual information between local and global multi-modal representations of 2D and 3D medical images can lead to improvements in image classification and segmentation. This general framework can be applied to a wide variety of settings, and demonstrates the benefits of not only locality, but also of compositionality as multi-modal representations are combined to obtain a more general one.
18

Integrating Trait and Neurocognitive Mechanisms of Externalizing Psychopathology: A Joint Modeling Framework for Measuring Impulsive Behavior

Haines, Nathaniel January 2021 (has links)
No description available.
19

Synthesis of Tabular Financial Data using Generative Adversarial Networks / Syntes av tabulär finansiell data med generativa motstridande nätverk

Karlsson, Anton, Sjöberg, Torbjörn January 2020 (has links)
Digitalization has led to tons of available customer data and possibilities for data-driven innovation. However, the data needs to be handled carefully to protect the privacy of the customers. Generative Adversarial Networks (GANs) are a promising recent development in generative modeling. They can be used to create synthetic data which facilitate analysis while ensuring that customer privacy is maintained. Prior research on GANs has shown impressive results on image data. In this thesis, we investigate the viability of using GANs within the financial industry. We investigate two state-of-the-art GAN models for synthesizing tabular data, TGAN and CTGAN, along with a simpler GAN model that we call WGAN. A comprehensive evaluation framework is developed to facilitate comparison of the synthetic datasets. The results indicate that GANs are able to generate quality synthetic datasets that preserve the statistical properties of the underlying data and enable a viable and reproducible subsequent analysis. It was however found that all of the investigated models had problems with reproducing numerical data. / Digitaliseringen har fört med sig stora mängder tillgänglig kunddata och skapat möjligheter för datadriven innovation. För att skydda kundernas integritet måste dock uppgifterna hanteras varsamt. Generativa Motstidande Nätverk (GANs) är en ny lovande utveckling inom generativ modellering. De kan användas till att syntetisera data som underlättar dataanalys samt bevarar kundernas integritet. Tidigare forskning på GANs har visat lovande resultat på bilddata. I det här examensarbetet undersöker vi gångbarheten av GANs inom finansbranchen. Vi undersöker två framstående GANs designade för att syntetisera tabelldata, TGAN och CTGAN, samt en enklare GAN modell som vi kallar för WGAN. Ett omfattande ramverk för att utvärdera syntetiska dataset utvecklas för att möjliggöra jämförelse mellan olika GANs. Resultaten indikerar att GANs klarar av att syntetisera högkvalitativa dataset som bevarar de statistiska egenskaperna hos det underliggande datat, vilket möjliggör en gångbar och reproducerbar efterföljande analys. Alla modellerna som testades uppvisade dock problem med att återskapa numerisk data.
20

Conditional generative modeling for images, 3D animations, and video

Voleti, Vikram 07 1900 (has links)
Generative modeling for computer vision has shown immense progress in the last few years, revolutionizing the way we perceive, understand, and manipulate visual data. This rapidly evolving field has witnessed advancements in image generation, 3D animation, and video prediction that unlock diverse applications across multiple fields including entertainment, design, healthcare, and education. As the demand for sophisticated computer vision systems continues to grow, this dissertation attempts to drive innovation in the field by exploring novel formulations of conditional generative models, and innovative applications in images, 3D animations, and video. Our research focuses on architectures that offer reversible transformations of noise and visual data, and the application of encoder-decoder architectures for generative tasks and 3D content manipulation. In all instances, we incorporate conditional information to enhance the synthesis of visual data, improving the efficiency of the generation process as well as the generated content. Prior successful generative techniques which are reversible between noise and data include normalizing flows and denoising diffusion models. The continuous variant of normalizing flows is powered by Neural Ordinary Differential Equations (Neural ODEs), and have shown some success in modeling the real image distribution. However, they often involve huge number of parameters, and high training time. Denoising diffusion models have recently gained huge popularity for their generalization capabilities especially in text-to-image applications. In this dissertation, we introduce the use of Neural ODEs to model video dynamics using an encoder-decoder architecture, demonstrating their ability to predict future video frames despite being trained solely to reconstruct current frames. In our next contribution, we propose a conditional variant of continuous normalizing flows that enables higher-resolution image generation based on lower-resolution input. This allows us to achieve comparable image quality to regular normalizing flows, while significantly reducing the number of parameters and training time. Our next contribution focuses on a flexible encoder-decoder architecture for accurate estimation and editing of full 3D human pose. We present a comprehensive pipeline that takes human images as input, automatically aligns a user-specified 3D human/non-human character with the pose of the human, and facilitates pose editing based on partial input information. We then proceed to use denoising diffusion models for image and video generation. Regular diffusion models involve the use of a Gaussian process to add noise to clean images. In our next contribution, we derive the relevant mathematical details for denoising diffusion models that use non-isotropic Gaussian processes, present non-isotropic noise, and show that the quality of generated images is comparable with the original formulation. In our final contribution, devise a novel framework building on denoising diffusion models that is capable of solving all three video tasks of prediction, generation, and interpolation. We perform ablation studies using this framework, and show state-of-the-art results on multiple datasets. Our contributions are published articles at peer-reviewed venues. Overall, our research aims to make a meaningful contribution to the pursuit of more efficient and flexible generative models, with the potential to shape the future of computer vision. / La modélisation générative pour la vision par ordinateur a connu d’immenses progrès ces dernières années, révolutionnant notre façon de percevoir, comprendre et manipuler les données visuelles. Ce domaine en constante évolution a connu des avancées dans la génération d’images, l’animation 3D et la prédiction vidéo, débloquant ainsi diverses applications dans plusieurs domaines tels que le divertissement, le design, la santé et l’éducation. Alors que la demande de systèmes de vision par ordinateur sophistiqués ne cesse de croître, cette thèse s’efforce de stimuler l’innovation dans le domaine en explorant de nouvelles formulations de modèles génératifs conditionnels et des applications innovantes dans les images, les animations 3D et la vidéo. Notre recherche se concentre sur des architectures offrant des transformations réversibles du bruit et des données visuelles, ainsi que sur l’application d’architectures encodeur-décodeur pour les tâches génératives et la manipulation de contenu 3D. Dans tous les cas, nous incorporons des informations conditionnelles pour améliorer la synthèse des données visuelles, améliorant ainsi l’efficacité du processus de génération ainsi que le contenu généré. Les techniques génératives antérieures qui sont réversibles entre le bruit et les données et qui ont connu un certain succès comprennent les flux de normalisation et les modèles de diffusion de débruitage. La variante continue des flux de normalisation est alimentée par les équations différentielles ordinaires neuronales (Neural ODEs) et a montré une certaine réussite dans la modélisation de la distribution d’images réelles. Cependant, elles impliquent souvent un grand nombre de paramètres et un temps d’entraînement élevé. Les modèles de diffusion de débruitage ont récemment gagné énormément en popularité en raison de leurs capacités de généralisation, notamment dans les applications de texte vers image. Dans cette thèse, nous introduisons l’utilisation des Neural ODEs pour modéliser la dynamique vidéo à l’aide d’une architecture encodeur-décodeur, démontrant leur capacité à prédire les images vidéo futures malgré le fait d’être entraînées uniquement à reconstruire les images actuelles. Dans notre prochaine contribution, nous proposons une variante conditionnelle des flux de normalisation continus qui permet une génération d’images à résolution supérieure à partir d’une entrée à résolution inférieure. Cela nous permet d’obtenir une qualité d’image comparable à celle des flux de normalisation réguliers, tout en réduisant considérablement le nombre de paramètres et le temps d’entraînement. Notre prochaine contribution se concentre sur une architecture encodeur-décodeur flexible pour l’estimation et l’édition précises de la pose humaine en 3D. Nous présentons un pipeline complet qui prend des images de personnes en entrée, aligne automatiquement un personnage 3D humain/non humain spécifié par l’utilisateur sur la pose de la personne, et facilite l’édition de la pose en fonction d’informations partielles. Nous utilisons ensuite des modèles de diffusion de débruitage pour la génération d’images et de vidéos. Les modèles de diffusion réguliers impliquent l’utilisation d’un processus gaussien pour ajouter du bruit aux images propres. Dans notre prochaine contribution, nous dérivons les détails mathématiques pertinents pour les modèles de diffusion de débruitage qui utilisent des processus gaussiens non isotropes, présentons du bruit non isotrope, et montrons que la qualité des images générées est comparable à la formulation d’origine. Dans notre dernière contribution, nous concevons un nouveau cadre basé sur les modèles de diffusion de débruitage, capable de résoudre les trois tâches vidéo de prédiction, de génération et d’interpolation. Nous réalisons des études d’ablation en utilisant ce cadre et montrons des résultats de pointe sur plusieurs ensembles de données. Nos contributions sont des articles publiés dans des revues à comité de lecture. Dans l’ensemble, notre recherche vise à apporter une contribution significative à la poursuite de modèles génératifs plus efficaces et flexibles, avec le potentiel de façonner l’avenir de la vision par ordinateur.

Page generated in 0.1301 seconds