Global ETD Search

1	Video Retargeting using Vision Transformers : Utilizing deep learning for video aspect ratio change / Video Retargeting med hjälp av Vision Transformers : Användning av djupinlärning för ändring av videobildförhållanden Laufer, Gil January 2022 (has links) The diversity of video material, where a video is shot and produced using a single aspect ratio, and the variety of devices that can play video with screens in different aspect ratios make video retargeting a relevant topic. The process of fitting a video filmed in one aspect ratio to a screen in another aspect ratio is called video retargeting, and the retargeted video should ideally preserve the important content and structure of the original video as well as be free of visual artifacts. Important content and important structure are vague and subjective definitions, which makes this problem more difficult to solve. The video retargeting problem has been a challenge for researchers from the computer vision, computer graphics and human-computer interaction areas, and successful retargeting can improve the viewing experience and the content’s aesthetic value. Video retargeting is done by four tools: cropping, scaling, seam carving and seam adding. Previous research showed that one of the keys to successful retargeting is to use a suitable combination of operators. This study makes use of a vision transformer, a deep learning model which is trained to discriminate between original and retargeted videos. Solving an optimization problem using beam search, the transformer assists in choosing a combination of operators that will result in the best possible retargeted video. The retargeted videos were examined in a user A/B-test, where users had to choose their preferred variant of a video shot: the transformer’s output using beam search, or a singular version where the video underwent a single retargeting operation. The model and user preferences were compared to check if the model indeed can make retargeting decisions that are appealing for humans to watch. A significance test showed that no conclusion can be made, probably due to lack of enough test data. However, the study revealed patterns in the preferences of the users and the model that could be further fine-tuned or combined with other computer vision mechanisms in order to output better retargeted videos. / Variation av videomaterial, där olika videor är inspelade och producerade i olika bildförhållande, samt variation i apparater och skärmar som spelar upp videor i olika bildförhållanden gör ändring av videobildförhållande till en relevant fråga. Processen där en videos bildförhållande ändras heter video retargeting. När video retargeting används bör den nya videon helst bevara strukturen och viktigt innehåll från originalvideon samt vara artefaktfri. Struktur och viktigt innehåll är subjektiva definitioner vilket gör frågan svårlöst, och frågan har varit en utmaning för forskare inom datorseende, datorgrafik och människa-datorinteraktion. Lyckad ändring av en videos bildförhållande kan förbättra tittarupplevelsen och innehållets estetiska värde. Video retargeting kan göras med hjälp av fyra funktioner: klippning, skalning, seam carving och seam adding. Tidigare studier visar att en av nycklarna till lyckad retargeting är att hitta en lämplig kombination av funktionerna. I denna studie används Vision Transformer, en djupinlärningsmodell som tränas för att skilja mellan original och omvandlade videor. Genom att lösa ett optimeringsproblem med strålsökning hjälper modellen välja den kombination av funktionerna som resulterar i den bästa möjliga omvandlade videon. De omvandlade videorna testades genom ett användartest där användare valde vilket videoklipp de tyckte bättre om: modellens output som skapades med hjälp av strålsökning, eller en version där klippet genomgick en enklare retargeting med hjälp av endast en av funktionerna. Modellens och användarnas preferenser jämfördes för att se om modellen kan fatta beslut som användare upplever som bra. Ett signifikanstest visar att ingen slutsats kan dras, förmodligen på grund av det begränsade antalet videoklipp och data som användes i studien. Däremot visar studien mönster i användarnas och modellens preferenser som kan användas för att vidareutveckla problemlösningen inom området. Video retargeting Aspect ratio Computer vision Deep learning Vision transformers. Video retargeting Bildförhållande Datorseende Djupinlärning Vision transformers. Computer and Information Sciences Data- och informationsvetenskap
2	Computational video: post-processing methods for stabilization, retargeting and segmentation Grundmann, Matthias 05 April 2013 (has links) In this thesis, we address a variety of challenges for analysis and enhancement of Computational Video. We present novel post-processing methods to bridge the difference between professional and casually shot videos mostly seen on online sites. Our research presents solutions to three well-defined problems: (1) Video stabilization and rolling shutter removal in casually-shot, uncalibrated videos; (2) Content-aware video retargeting; and (3) spatio-temporal video segmentation to enable efficient video annotation. We showcase several real-world applications building on these techniques. We start by proposing a novel algorithm for video stabilization that generates stabilized videos by employing L1-optimal camera paths to remove undesirable motions. We compute camera paths that are optimally partitioned into constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To achieve this, we propose a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond conventional filtering, that only suppresses high frequency jitter. An additional challenge in videos shot from mobile phones are rolling shutter distortions. Modern CMOS cameras capture the frame one scanline at a time, which results in non-rigid image distortions such as shear and wobble. We propose a solution based on a novel mixture model of homographies parametrized by scanline blocks to correct these rolling shutter distortions. Our method does not rely on a-priori knowledge of the readout time nor requires prior camera calibration. Our novel video stabilization and calibration free rolling shutter removal have been deployed on YouTube where they have successfully stabilized millions of videos. We also discuss several extensions to the stabilization algorithm and present technical details behind the widely used YouTube Video Stabilizer. We address the challenge of changing the aspect ratio of videos, by proposing algorithms that retarget videos to fit the form factor of a given device without stretching or letter-boxing. Our approaches use all of the screen's pixels, while striving to deliver as much video-content of the original as possible. First, we introduce a new algorithm that uses discontinuous seam-carving in both space and time for resizing videos. Our algorithm relies on a novel appearance-based temporal coherence formulation that allows for frame-by-frame processing and results in temporally discontinuous seams, as opposed to geometrically smooth and continuous seams. Second, we present a technique, that builds on the above mentioned video stabilization approach. We effectively automate classical pan and scan techniques by smoothly guiding a virtual crop window via saliency constraints. Finally, we introduce an efficient and scalable technique for spatio-temporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a "region graph" over the obtained segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations, and allows subsequent applications to choose from varying levels of granularity. We demonstrate the use of spatio-temporal segmentation as users interact with the video, enabling efficient annotation of objects within the video. Video retargeting Video stabilization Video segmentation Computer vision Linear programming Algorithms
3	Automatic rush generation with application to theatre performances / Cadrage et montage automatique de films de théâtre par analyse sémantique de vidéo Gandhi, Vineet 18 December 2014 (has links) Vidéos de direct de qualité professionnelle mises en scène sont créées en les enregistrant à partir de différents points de vue appropriées. Ceux-ci sont ensuite édités ensemble pour présenter une histoire éloquente remplie avec la capacité de tirer l'émotion prévu de téléspectateurs. La création de ces vidéos compétentes, implique la combinaison de multiples caméras de haute qualité et des opérateurs de caméra qualifiés. Nous présentons une thèse à faire même les productions à petit budget adepte et agréable en produisant des vidéos de Youtube professionnels de qualité sans un équipage entièrement équipée et coûteux de cameramen. Une caméra statique haute résolution annule et remplace l'équipe de tournage pluriel et leurs mouvements de caméra efficaces sont ensuite simulé par la quasi-panoramique - inclinaison - zoom dans les enregistrements originaux. Nous montrons que plusieurs caméras virtuelles peuvent être simulés en choisissant des trajectoires différentes de culture fenêtres à l'intérieur de l'enregistrement original. L'une des nouveautés principales de ce travail est un cadre de optimisation pour calculer les trajectoires des caméras virtuelles à l'aide des informations extraites de la vidéo originale basée sur des techniques de vision par ordinateur. Les acteurs présents sur scène sont considérés comme les éléments les plus importants de la scène. Pour la tâche de localiser et de nommer les acteurs, nous introduisons modèles génératifs pour apprendre vue personne indépendante et détecteurs spécifiques costume d'un ensemble d'exemples étiquetés. Nous expliquons comment apprendre les modèles à partir d'un petit nombre d'images clés marqués ou pistes vidéo, et comment détecter de nouveaux aspects des acteurs dans un cadre du maximum de vraisemblance. Nous démontrons que les modèles spécifiques comme des acteurs peuvent localiser avec précision les acteurs malgré les changements de point de vue et des occlusions, et d'améliorer de manière significative les taux de rappel de détection plus détecteurs génériques. La thèse présente ensuite un algorithme hors ligne pour le suivi des objets et des acteurs dans les séquences vidéo longues utilisation de ces modèles spécifiques d'acteurs. Détections sont d'abord effectuées pour sélectionner indépendamment emplacements candidats de l'acteur / objet dans chaque image de la vidéo. Les détections candidats sont ensuite combinés en des trajectoires lisses dans une étape d'optimisation en minimisant une fonction de coût qui représente les fausses détections et les occlusions. Les pistes d'acteur, nous proposons un cadre pour plusieurs clips générant automatiquement adaptés pour le montage vidéo en simulant pan-tilt-zoom mouvements de caméra dans le cadre d'une seule caméra statique. Notre méthode ne nécessite que peu de données utilisateur pour définir l'objet de chaque sous-séquence. La composition de chaque sous-clip est automatiquement calculée dans un cadre nouveau d'optimisation norme L1. Notre approche code pour plusieurs pratiques cinématographiques communs dans un seul problème de minimisation de la fonction de coût convexe, ce qui sous-clips esthétiquement agréables qui peuvent être facilement éditées ensemble en utilisant multi-pince logiciel off-the-shelf montage vidéo. / Professional quality videos of live staged performances are created by recording them from different appropriate viewpoints. These are then edited together to portray an eloquent story replete with the ability to draw out the intended emotion from the viewers. Creating such competent videos, involves the combination of multiple high quality cameras and skilled camera operators. We present a thesis to make even the low budget productions adept and pleasant by producing professional quality vidoes sans a fully and expensively equipped crew of cameramen. A high resolution static camera replaces the plural camera crew and their efficient camera movements are then simulated by virtually panning - tilting - zooming within the original recordings. We show that multiple virtual cameras can be simulated by choosing different trajectories of cropping windows inside the original recording. One of the key novelties of this work is an optimazation framework for computing the virtual camera trajectories using the information extracted from the original video based on computer vision techniques. The actors present on stage are considered as the most important elements of the scene. For the task of localizing and naming actors, we introduce generative models for learning view independent person and costume specific detectors from a set of labeled examples. We explain how to learn the models from a small number of labeled keyframes or video tracks, and how to detect novel appearances of the actors in a maximum likelihood framework. We demonstrate that such actor specific models can accurately localize actors despite changes in view point and occlusions, and significantly improve the detection recall rates over generic detectors. The dissertation then presents an offline algorithm for tracking objects and actors in long video sequences using these actor specific models. Detections are first performed to independently select candidate locations of the actor/object in each frame of the video. The candidate detections are then combined into smooth trajectories in an optimization step minimizing a cost function accounting for false detections and occlusions. Using the actor tracks, we propose a framework for automatically generating multiple clips suitable for video editing by simulating pan-tilt-zoom camera movements within the frame of a single static camera. Our method requires only minimal user input to define the subject matter of each sub-clip. The composition of each sub-clip is automatically computed in a novel L1-norm optimization framework. Our approach encodes several common cinematographic practices into a single convex cost function minimization problem, resulting in aesthetically-pleasing sub-clips which can easily be edited together using off-the-shelf multi-clip video editing software. Montage automatique Video retargeting Video Processing Cinematography Film grammar 004 510
4	Algoritmy pro automatický ořez sférické fotografie a videa / Algorithms for Automatic Spherical Image and Video Cropping Ivančo, Martin January 2020 (has links) Cieľom tejto práce je priniesť detailný pohľad na doterajší prieskum v oblasti sférických videí. Konkrétne sa táto práca zameriava na problém tvorby videa s normálnym zorným poľom zo sférického videa pre potreby zobrazovania. Prináša tiež implementáciu niektorých dostupných metód. Doteraz boli predstavené tri metódy v štyroch článkoch, ktoré riešia tento problém. Všetky priniesli zaujímavé výsledky a táto práca sa dvomi z nich zaoberá hlbšie. Táto práca tiež prináša základnú metódu využívajúcu overené metódy automat- ického orezu obrazu. Táto metóda je využitá na porovnanie so skúmanými metódami, u ktorých zvýrazní ich vylepšenia ale aj nedostatky. Na základe porovnania metód pomocou užívateľského experimentu táto práca usudzuje, že najlepšou zo skúmaných metód pre túto úlohu je upravená varianta metódy od Pavel et al. [14], predstavená v tejto práci.
5	Dynamic Headpose Classification and Video Retargeting with Human Attention Anoop, K R January 2015 (has links) (PDF) Over the years, extensive research has been devoted to the study of people's head pose due to its relevance in security, human-computer interaction, advertising as well as cognitive, neuro and behavioural psychology. One of the main goals of this thesis is to estimate people's 3D head orientation as they freely move around in naturalistic settings such as parties, supermarkets etc. Head pose classification from surveillance images acquired with distant, large field-of-view cameras is difficult as faces captured are at low-resolution with a blurred appearance. Also labelling sufficient training data for headpose estimation in such settings is difficult due to the motion of targets and the large possible range of head orientations. Domain adaptation approaches are useful for transferring knowledge from the training source to the test target data having different attributes, minimizing target data labelling efforts in the process. This thesis examines the use of transfer learning for efficient multi-view head pose classification. Relationship between head pose and facial appearance from many labelled examples corresponding to the source data is learned initially. Domain adaptation techniques are then employed to transfer this knowledge to the target data. The following three challenging situations is addressed (I) ranges of head poses in the source and target images is different, (II) where source images capture a stationary person while target images capture a moving person with varying facial appearance due to changing perspective, scale and (III) a combination of (I) and (II). All proposed transfer learning methods are sufficiently tested and benchmarked on a new compiled dataset DPOSE for headpose classification. This thesis also looks at a novel signature representation for describing object sets for covariance descriptors, Covariance Profiles (CPs). CP is well suited for representing a set of similarly related objects. CPs posit that the covariance matrices, pertaining to a specific entity, share the same eigen-structure. Such a representation is not only compact but also eliminates the need to store all the training data. Experiments on images as well as videos for applications such as object-track clustering and headpose estimation is shown using CP. In the second part, Human-gaze for interest point detection for video retargeting is explored. Regions in video streams attracting human interest contribute significantly to human understanding of the video. Being able to predict salient and informative Regions of Interest (ROIs) through a sequence of eye movements is a challenging problem. This thesis proposes an interactive human-in-loop framework to model eye-movements and predicts visual saliency in yet-unseen frames. Eye-tracking and video content is used to model visual attention in a manner that accounts for temporal discontinuities due to sudden eye movements, noise and behavioural artefacts. Gaze buffering, for eye-gaze analysis and its fusion with content based features is proposed. The method uses eye-gaze information along with bottom-up and top-down saliency to boost the importance of image pixels. Our robust visual saliency prediction is instantiated for content aware Video Retargeting. Headpose Classification Video Retargeting Human Gaze Information Multiview Headpose Estimation Domain Transfer Learning Computer Vision ARCO-Xboost Covariance Profiles Headpose Estimation Head Pose Classification Head-pose Classification Canonical Correlation Analysis Electrical Engineering

1

Page generated in 0.3311 seconds