Global ETD Search

1	Im2Vid: Future Video Prediction for Static Image Action Recognition AlBahar, Badour A Sh A. 20 June 2018 (has links) Static image action recognition aims at identifying the action performed in a given image. Most existing static image action recognition approaches use high-level cues present in the image such as objects, object human interaction, or human pose to better capture the action performed. Unlike images, videos have temporal information that greatly improves action recognition by resolving potential ambiguity. We propose to leverage a large amount of readily available unlabeled videos to transfer the temporal information from video domain to static image domain and hence improve static image action recognition. Specifically, We propose a video prediction model to predict the future video of a static image and use the future predicted video to improve static image action recognition. Our experimental results on four datasets validate that the idea of transferring the temporal information from videos to static images is promising, and can enhance static image action recognition performance. / Master of Science / Static image action recognition is the problem of identifying the action performed in a given image. Most existing approaches use the high-level cues present in the image like objects, object human interaction, or human pose to better capture the action performed. Unlike images, videos have temporal information that greatly improves action recognition. Looking at a static image of a man who is about to sit on a chair might be misunderstood as an image of a man who is standing from the chair. Because of the temporal information in videos, such ambiguity is not present. To transfer the temporal information and action features from video domain to static image domain and hence improve static image action recognition, we propose a model that learns a mapping from a static image to its future video by looking at a large number of existing images and their future videos. We then use this model to predict the future video of a static image to improve its action recognition. Our experimental results on four datasets show that the idea of transferring the temporal information from videos to static images is promising, and can enhance static image action recognition performance. Human Action Recognition Static Image Action Recognition Video Action Recognition Future Video Prediction
2	Diffusion Models for Video Prediction and Infilling : Training a conditional video diffusion model for arbitrary video completion tasks / Diffusionsmodeller för videoförutsägelse och ifyllnad : Träning av en villkorlig videodiffusionsmodell för slumpmässiga videokompletteringsuppgifter Höppe, Tobias January 2022 (has links) To predict and anticipate future outcomes or reason about missing information in a sequence is a key ability for agents to be able to make intelligent decisions. This requires strong temporally coherent generative capabilities. Diffusion models have shown huge success in several generative tasks lately, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling. Since we do not use concatenation to condition on a mask, as done in most conditionally trained diffusion models, we are able to utilize the same architecture as used for unconditional training which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluated the model on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. / Att förutse framtida resultat eller resonera kring bristande information i en sekvens är en viktig förutsättning för agenter att göra intelligenta beslut. Detta kräver robusta temporärt koherenta generativa kapaciteter. Diffusionsmodeller har visat pa stor framgang i flera generativa uppgifter i närtid, men denna potential har inte utforskats grundligt i samband med video. Vi presenterar Random-Mask Video Diffusion (RaMViD), vilket bredar bilddiffusionsmodeller till video med hjälp av 3D konvolutioner, och introducerar en ny konditioneringsteknik under träning. Genom att variera masken vi tränar med kan modellen utföra videoförutsägelse och videoifyllnad. Eftersom vi inte använder konkatenering för att träna pa en mask, som görs i de flesta villkorstränade diffusionsmodeller, har vi möjlighet att använda samma arkiktektur som används för ovillkorad träning, vilket i sin tur tillater oss att träna modellen pa ett villkorat och ovillkorat sätt samtidigt. Vi utvärderade modellen pa tva benchmnark datasets för videoförutsägelse och en för videogenerering, varav pa den första vi uppnade de bästa kvantitativa resultaten bland samtida metoder. Diffusion Video prediction and infilling Conditional generation Diffusion Videoförutsägelse och ifyllnad Villkorad generation Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik Computer and Information Sciences Data- och informationsvetenskap
3	Generating lightning bolt videos perceived as real in images using machine learning Johansson, Henrik January 2022 (has links) Background. Weather and weather effects are important features when trying to immerse the viewer into a virtual world. Lightning and thunder is one of those effects when attempting to create rough weather, realistic lightning however requires heavy computations, using physics, weather systems, and knowledge of the 3d world. Objectives. This thesis investigates the possibility of leveraging the predictive power of machine learning to generate animated lightning bolts inside of images, and then investigates the possibility to generate the animated lightning bolts in real time. Methods. A new data-set for training will be created consisting of videos of lightning bolts. Four image to video machine learning architectures will be investigated and two will be tested in an attempt to find a suitable model for generating the animated lightning bolts. The selected model will be used to generate videos for a questionnaire to collect qualititive data regarding the perceived realism of the animated lightning bolts. To figure out if it is possible to generate the animated lightning bolts in real time the final model will be performance measured and compared to real time requirements of video games and video editing software. Results. For the training data-set 106 curated and pre-processed videos were collected. By gathering four and testing two different machine learning architectures it was found that the architecture based on stochastic Image-to-Video Synthesis using conditional invertible neural networks were the most suited for generating animated lightning bolts. The questionnaire received a 77% positive rating for the generated lightning bolts, with a 1% statistical significance a p-value of 0.00005 was obtained. The performance of the selected machine learning model were measured to be inadequate for real time applications like video games but more than enough for video editing software. Conclusions. The goal of generating animated lightning bolts percieved as real were achieved by creating a new data-set and investigating multiple machine learning architectures. Real time generation is achievable for video editing applications, but real time generation for video games is not yet possible unless the background is static. / Bakgrund. Väder och vädereffekter är viktiga verktyg för att skapa en virtuell värld som fördjupar användaren. Åska och blixtar kan användas för att skapa en upplevelse av dåligt väder men verklighetstrogna representationer kräver tunga matematiska beräkningar och kunskap om den virtuella världen. \newline\textbf{ Syfte. Den här uppsatsen undersöker möjligheten att utnyttja kraften bakom maskininlärning för att generera blixtar som ser verklighetstrogna ut. Uppsatsen undersöker också om det är möjligt att generera blixtar i realtid. Metod. Ett nytt data-set som består av videos av blixtar skapas med syftet att träna modellerna. Fyra bild till video maskininlärnings arkitekturer kommer undersökas och två kommer testas i ett försök att finna en lämplig modell för att generera de animerade blixtarna. Videos från den utvalda modellen kommer användas i ett frågeformulär. Detta formulär kommer användas för att samla in kvalitativ data gällande den upplevda realismen av de genererade blixtarna. För att ta reda på om det är möjligt att generera blixtarna i realtid prestandamäts den slutgiltiga modellen och jämförs med kraven för spel och videoredigeringsverktyg i realtid. Resultat. Modellens träningsdata består av 106 insamlade videoklipp som blivit förbearbetade. Genom att testa två olika maskininlärnings arkitekturer visade det sig att stokastiska bild-till-video arkitekturen baserad på cINN konceptet var den mest lämpade för att generera videos av blixtar. Frågeformuläret mottog ett positivt betyg på 77\% gällande de genererade blixtarna, med en 1\% statistisk signifikans framkom ett p-värde på 0.00005. Prestandan av den utvalda maskin inlärningsmodellen uppmättes vara undermålig för en realtidsapplikation som digitala spel men tillräcklig för videoredigering. Slutsatser. Målet att generera animerade blixtar som uppfattas som realistiska uppnåddes genom att skapa ny träningsdata och undersöka flera olika bild till video maskininlärnings arkitekturer. Realtidsgenerering går att uppnås av applikationer för video redigering, men för applikationer som spel nås inte realtidskraven i dagsläget om kameran i spelet inte står stilla. Animated Lightning Machine learning Video prediction image-to-video Animerade blixtar maskininlärning video förutsägelse bild-till-video Computer Sciences Datavetenskap (datalogi)
4	Advances in generative models for dynamic scenes Castrejon Subira, Lluis Enric 05 1900 (has links) Les réseaux de neurones sont un type de modèle d'apprentissage automatique (ML) qui résolvent des tâches complexes d'intelligence artificielle (AI) sans nécessiter de représentations de données élaborées manuellement. Bien qu'ils aient obtenu des résultats impressionnants dans des tâches nécessitant un traitement de la parole, d’image, et du langage, les réseaux de neurones ont encore de la difficulté à résoudre des tâches de compréhension de scènes dynamiques. De plus, l’entraînement de réseaux de neurones nécessite généralement de nombreuses données annotées manuellement, ce qui peut être un processus long et coûteux. Cette thèse est composée de quatre articles proposant des modèles génératifs pour des scènes dynamiques. La modélisation générative est un domaine du ML qui étudie comment apprendre les mécanismes par lesquels les données sont produites. La principale motivation derrière les modèles génératifs est de pouvoir, sans utiliser d’étiquettes, apprendre des représentations de données utiles; c’est un sous-produit de l'approximation du processus de génération de données. De plus, les modèles génératifs sont utiles pour un large éventail d'applications telles que la super-résolution d'images, la synthèse vocale ou le résumé de texte. Le premier article se concentre sur l'amélioration de la performance des précédents auto-encodeurs variationnels (VAE) pour la prédiction vidéo. Il s’agit d’une tâche qui consiste à générer les images futures d'une scène dynamique, compte tenu de certaines observations antérieures. Les VAE sont une famille de modèles à variables latentes qui peuvent être utilisés pour échantillonner des points de données. Comparés à d'autres modèles génératifs, les VAE sont faciles à entraîner et ont tendance à couvrir tous les modes des données, mais produisent souvent des résultats de moindre qualité. En prédiction vidéo, les VAE ont été les premiers modèles capables de produire des images futures plausibles à partir d’un contexte donné, un progrès marquant par rapport aux modèles précédents car, pour la plupart des scènes dynamiques, le futur n'est pas une fonction déterministe du passé. Cependant, les premiers VAE pour la prédiction vidéo produisaient des résultats avec des artefacts visuels visibles et ne fonctionnaient pas sur des ensembles de données réalistes complexes. Dans cet article, nous identifions certains des facteurs limitants de ces modèles, et nous proposons pour chacun d’eux une solution pour en atténuer l'impact. Grâce à ces modifications, nous montrons que les VAE pour la prédiction vidéo peuvent obtenir des résultats de qualité nettement supérieurs par rapport aux références précédentes, et qu'ils peuvent être utilisés pour modéliser des scènes de conduite autonome. Dans le deuxième article, nous proposons un nouveau modèle en cascade pour la génération vidéo basé sur les réseaux antagonistes génératifs (GAN). Après le succès des VAE pour prédiction vidéo, il a été démontré que les GAN produisaient des échantillons vidéo de meilleure qualité pour la génération vidéo conditionnelle à des classes. Cependant, les GAN nécessitent de très grandes tailles de lots ainsi que des modèles de grande capacité, ce qui rend l’entraînement des GAN pour la génération vidéo coûteux computationnellement, à la fois en termes de mémoire et en temps de calcul. Nous proposons de scinder le processus génératif en une cascade de sous-modèles, chacun d'eux résolvant un problème plus simple. Cette division nous permet de réduire considérablement le coût computationnel tout en conservant la qualité de l'échantillon, et nous démontrons que ce modèle peut s'adapter à de très grands ensembles de données ainsi qu’à des vidéos de haute résolution. Dans le troisième article, nous concevons un modèle basé sur le principe qu'une scène est composée de différents objets, mais que les transitions de trame (également appelées règles dynamiques) sont partagées entre les objets. Pour mettre en œuvre cette hypothèse de modélisation, nous concevons un modèle qui extrait d'abord les différentes entités d'une image. Ensuite, le modèle apprend à mettre à jour la représentation de l'objet d'une image à l'autre en choisissant parmi différentes transitions possibles qui sont toutes partagées entre les différents objets. Nous montrons que, lors de l'apprentissage d'un tel modèle, les règles de transition sont fondées sémantiquement, et peuvent être appliquées à des objets non vus lors de l'apprentissage. De plus, nous pouvons utiliser ce modèle pour prédire les observations multimodales futures d'une scène dynamique en choisissant différentes transitions. Dans le dernier article nous proposons un modèle génératif basé sur des techniques de rendu 3D qui permet de générer des scènes avec plusieurs objets. Nous concevons un mécanisme d'inférence pour apprendre les représentations qui peuvent être rendues avec notre modèle et nous optimisons simultanément ce mécanisme d'inférence et le moteur de rendu. Nous montrons que ce modèle possède une représentation interprétable dans laquelle des changements sémantiques appliqués à la représentation de la scène sont rendus dans la scène générée. De plus, nous montrons que, suite au processus d’entraînement, notre modèle apprend à segmenter les objets dans une scène sans annotations et que la représentation apprise peut être utilisée pour résoudre des tâches de compréhension de scène dynamique en déduisant la représentation de chaque observation. / Neural networks are a type of Machine Learning (ML) models that solve complex Artificial Intelligence (AI) tasks without requiring handcrafted data representations. Although they have achieved impressive results in tasks requiring speech, image and language processing, neural networks still struggle to solve dynamic scene understanding tasks. Furthermore, training neural networks usually demands lots data that is annotated manually, which can be an expensive and time-consuming process. This thesis is comprised of four articles proposing generative models for dynamic scenes. Generative modelling is an area of ML that investigates how to learn the mechanisms by which data is produced. The main motivation for generative models is to learn useful data representations without labels as a by-product of approximating the data generation process. Furthermore, generative models are useful for a wide range of applications such as image super-resolution, voice synthesis or text summarization. The first article focuses on improving the performance of previous Variational AutoEncoders (VAEs) for video prediction, which is the task of generating future frames of a dynamic scene given some previous occurred observations. VAEs are a family of latent variable models that can be used to sample data points. Compared to other generative models, VAEs are easy to train and tend to cover all data modes, but often produce lower quality results. In video prediction VAEs were the first models that were able to produce multiple plausible future outcomes given a context, marking an advancement over previous models as for most dynamic scenes the future is not a deterministic function of the past. However, the first VAEs for video prediction produced results with visible visual artifacts and could not operate on complex realistic datasets. In this article we identify some of the limiting factors for these models, and for each of them we propose a solution to ease its impact. With our proposed modifications, we show that VAEs for video prediction can obtain significant higher quality results over previous baselines and that they can be used to model autonomous driving scenes. In the second article we propose a new cascaded model for video generation based on Generative Adversarial Networks (GANs). After the success of VAEs in video prediction, GANs were shown to produce higher quality video samples for class-conditional video generation. However, GANs require very large batch sizes and high capacity models, which makes training GANs for video generation computationally expensive, both in terms of memory and training time. We propose to split the generative process into a cascade of submodels, each of them solving a smaller generative problem. This split allows us to significantly reduce the computational requirements while retaining sample quality, and we show that this model can scale to very large datasets and video resolutions. In the third article we design a model based on the premise that a scene is comprised of different objects but that frame transitions (also known as dynamic rules) are shared among objects. To implement this modeling assumption we design a model that first extracts the different entities in a frame, and then learns to update the object representation from one frame to another by choosing among different possible transitions, all shared among objects. We show that, when learning such a model, the transition rules are semantically grounded and can be applied to objects not seen during training. Further, we can use this model for predicting multimodal future observations of a dynamic scene by choosing different transitions. In the last article we propose a generative model based on 3D rendering techniques that can generate scenes with multiple objects. We design an inference mechanism to learn representations that can be rendered with our model and we simultaneously optimize this inference mechanism and the renderer. We show that this model has an interpretable representation in which semantic changes to the scene representation are shown in the output. Furthermore, we show that, as a by product of the training process, our model learns to segment the objects in a scene without annotations and that the learned representation can be used to solve dynamic scene understanding tasks by inferring the representation of each observation. Neural networks Deep learning Video generation Generative models Variational autoencoders Generative adversarial networks Video prediction Neural radiance fields Réseaux de neurones Apprentissage profond Auto-encodeurs variationnels Réseaux antagonistes génératifs Prédiction vidéo Génération de vidéo Champs de rayonnement neuronal
5	Human Body Scattering Effects at Millimeter Waves Frequencies for Future 5G Systems and Beyond Romero Peña, Johan Samuel 13 January 2023 (has links) [ES] Se espera que las futuras comunicaciones móviles experimenten una revolución técnica que vaya más allá de las velocidades de datos de Gbps y reduzca las latencias de las velocidades de datos a niveles muy cercanos al milisegundo. Se han investigado nuevas tecnologías habilitadoras para lograr estas exigentes especificaciones. Y la utilización de las bandas de ondas milimétricas, donde hay mucho espectro disponible, es una de ellas. Debido a las numerosas dificultades técnicas asociadas a la utilización de esta banda de frecuencias, se necesitan complicados modelos de canal para anticipar las características del canal de radio y evaluar con precisión el rendimiento de los sistemas celulares en milimétricas. En concreto, los modelos de propagación más precisos son los basados en técnicas de trazado de rayos deterministas. Pero estas técnicas tienen el estigma de ser computacionalmente exigentes, y esto dificulta su uso para caracterizar el canal de radio en escenarios interiores complejos y dinámicos. La complejidad de la caracterización de estos escenarios depende en gran medida de la interacción del cuerpo humano con el entorno radioeléctrico, que en las ondas milimétricas suele ser destructiva y muy impredecible. Por otro lado, en los últimos años, la industria de los videojuegos ha desarrollado potentes herramientas para entornos hiperrealistas, donde la mayor parte de los avances en esta emulación de la realidad tienen que ver con el manejo de la luz. Así, los motores gráficos de estas plataformas se han vuelto cada vez más eficientes para manejar grandes volúmenes de información, por lo que son ideales para emular el comportamiento de la propagación de las ondas de radio, así como para reconstruir un escenario interior complejo. Por ello, en esta Tesis se ha aprovechado la capacidad computacional de este tipo de herramientas para evaluar el canal radioeléctrico milimétricas de la forma más eficiente posible. Esta Tesis ofrece unas pautas para optimizar la propagación de la señal en milimétricas en un entorno interior dinámico y complejo, para lo cual se proponen tres objetivos principales. El primer objetivo es evaluar los efectos de dispersión del cuerpo humano cuando interactúa con el canal de propagación. Una vez evaluado, se propuso un modelo matemático y geométrico simplificado para calcular este efecto de forma fiable y rápida. Otro objetivo fue el diseño de un reflector pasivo modular en milimétricas, que optimiza la cobertura en entornos de interior, evitando la interferencia del ser humano en la propagación. Y, por último, se diseñó un sistema de apuntamiento del haz predictivo en tiempo real, para que opere con el sistema de radiación en milimétricas, cuyo objetivo es evitar las pérdidas de propagación causadas por el cuerpo humano en entornos interiores dinámicos y complejos. / [CA] S'espera que les futures comunicacions mòbils experimenten una revolució tècnica que vaja més enllà de les velocitats de dades de Gbps i reduïsca les latències de les velocitats de dades a nivells molt pròxims al milisegundo. S'han investigat noves tecnologies habilitadoras per a aconseguir estes exigents especificacions. I la utilització de les bandes d'ones millimètriques, on hi ha molt espectre disponible, és una d'elles. A causa de les nombroses dificultats tècniques associades a la utilització d'esta banda de freqüències, es necessiten complicats models de canal per a anticipar les característiques del canal de ràdio i avaluar amb precisió el rendiment dels sistemes cellulars en millimètriques. En concret, els models de propagació més precisos són els basats en tècniques de traçat de rajos deterministes. Però estes tècniques tenen l'estigma de ser computacionalment exigents, i açò dificulta el seu ús per a caracteritzar el canal de ràdio en escenaris interiors complexos i dinàmics. La complexitat de la caracterització d'estos escenaris depén en gran manera de la interacció del cos humà amb l'entorn radioelèctric, que en les ones millimètriques sol ser destructiva i molt impredicible. D'altra banda, en els últims anys, la indústria dels videojocs ha desenrotllat potents ferramentes per a entorns hiperrealistes, on la major part dels avanços en esta emulació de la realitat tenen a veure amb el maneig de la llum. Així, els motors gràfics d'estes plataformes s'han tornat cada vegada més eficients per a manejar grans volums d'informació, per la qual cosa són ideals per a emular el comportament de la propagació de les ones de ràdio, així com per a reconstruir un escenari interior complex. Per això, en esta Tesi s'ha aprofitat la capacitat computacional d'este tipus de ferramentes per a avaluar el canal radioelèctric millimètriques de la manera més eficient possible. Esta Tesi oferix unes pautes per a optimitzar la propagació del senyal en millimètriques en un entorn interior dinàmic i complex, per a la qual cosa es proposen tres objectius principals. El primer objectiu és avaluar els efectes de dispersió del cos humà quan interactua amb el canal de propagació. Una vegada avaluat, es va proposar un model matemàtic i geomètric simplificat per a calcular este efecte de forma fiable i ràpida. Un altre objectiu va ser el disseny d'un reflector passiu modular en millimètriques, que optimitza la cobertura en entorns d'interior, evitant la interferència del ser humà en la propagació, per a així evitar pèrdues de propagació addicionals. I, finalment, es va dissenyar un sistema d'apuntament del feix predictiu en temps real, perquè opere amb el sistema de radiació en millimètriques, l'objectiu del qual és evitar les pèrdues de propagació causades pel cos humà en entorns interiors dinàmics i complexos. / [EN] Future mobile communications are expected to experience a technical revolution that goes beyond Gbps data rates and reduces data rate latencies to levels very close to a millisecond. New enabling technologies have been researched to achieve these demanding specifications. The utilization of mmWave bands, where a lot of spectrum is available, is one of them. Due to the numerous technical difficulties associated with using this frequency band, complicated channel models are necessary to anticipate the radio channel characteristics and to accurately evaluate the performance of cellular systems in mmWave. In particular, the most accurate propagation models are those based on deterministic ray tracing techniques. But these techniques have the stigma of being computationally intensive, and this makes it difficult to use them to characterize the radio channel in complex and dynamic indoor scenarios. The complexity of characterizing these scenarios depends largely on the interaction of the human body with the radio environment, which at mmWaves is often destructive and highly unpredictable. On the other hand, in recent years, the video game industry has developed powerful tools for hyper-realistic environments, where most of the progress in this reality emulation has to do with the handling of light. Therefore, the graphic engines of these platforms have become more and more efficient to handle large volumes of information, becoming ideal to emulate the radio wave propagation behavior, as well as to reconstruct a complex interior scenario. Therefore, in this Thesis one has taken advantage of the computational capacity of this type of tools to evaluate the mmWave radio channel in the most efficient way possible. This Thesis offers some guidelines to optimize the signal propagation in mmWaves in a dynamic and complex indoor environment, for which three main objectives are proposed. The first objective has been to evaluate the scattering effects of the human body when it interacts with the propagation channel. Once evaluated, a simplified mathematical and geometrical model has been proposed to calculate this effect in a reliable and fast way. Another objective has been the design of a modular passive reflector in mmWaves, which optimizes the coverage in indoor environments, avoiding human interference in the propagation, in order to avoid its harmful scattering effects. And finally, a real-time predictive beam steering system has been designed for the mmWaves radiation system, in order to avoid propagation losses caused by the human body in dynamic and complex indoor environments. / Romero Peña, JS. (2022). Human Body Scattering Effects at Millimeter Waves Frequencies for Future 5G Systems and Beyond [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/191325 Pérdidas por difracción Ondas milimétricas Modelos de propagación Reflexión Difracción Transmisión Trazado de rayos Seguimiento de vídeo Formación del haz Orientación del haz Modelos de canal Predicción de vídeo Inteligencia artificial Double knife edge diffraction Human Blocking Milimiter waves Propagation models Reflection Diffraction Transmision Ray tracing Video tracking Beam forming Beam steering Channel models Artificial intelligence video prediction Difracción de doble filo TEORÍA DE LA SEÑAL Y COMUNICACIONES

1

Page generated in 0.1011 seconds