• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 113
  • 5
  • 4
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 161
  • 161
  • 99
  • 82
  • 68
  • 61
  • 54
  • 49
  • 46
  • 37
  • 35
  • 30
  • 28
  • 28
  • 27
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Knowledge Distillation for Semantic Segmentation and Autonomous Driving. : Astudy on the influence of hyperparameters, initialization of a student network and the distillation method on the semantic segmentation of urban scenes.

Sanchez Nieto, Juan January 2022 (has links)
Reducing the size of a neural network whilst maintaining a comparable performance is an important problem to be solved since the constrictions on resources of small devices make it impossible to deploy large models in numerous real-life scenarios. A prominent example is autonomous driving, where computer vision tasks such as object detection and semantic segmentation need to be performed in real time by mobile devices. In this thesis, the knowledge and spherical knowledge distillation techniques are utilized to train a small model (PSPNet50) under the supervision of a large model (PSPNet101) in order to perform semantic segmentation of urban scenes. The importance of the distillation hyperparameters is studied first, namely the influence of the temperature and the weights of the loss function on the performance of the distilled model, showing no decisive advantage over the individual training of the student. Thereafter, distillation is performed utilizing a pretrained student, revealing a good improvement in performance. Contrary to expectations, the pretrained student benefits from a high learning rate when training resumes under distillation, especially in the spherical knowledge distillation case, displaying a superior and more stable performance when compared to the regular knowledge distillation setting. These findings are validated by several experiments conducted using the Cityscapes dataset. The best distilled model achieves 87.287% pixel accuracy and a 42.0% mean Intersection-Over-Union value (mIoU) on the validation set, higher than the 86.356% pixel accuracy and 39.6% mIoU obtained by the baseline student. On the test set, the official evaluation obtained by submission to the Cityscapes website yields 42.213% mIoU for the distilled model and 41.085% for the baseline student. / Att minska storleken på ett neuralt nätverk med bibehållen prestanda är ett viktigt problem som måste lösas, eftersom de begränsade resurserna i små enheter gör det omöjligt att använda stora modeller i många verkliga situationer. Ett framträdande exempel är autonom körning, där datorseende uppgifter som objektsdetektering och semantisk segmentering måste utföras i realtid av mobila enheter. I den här avhandlingen används tekniker för destillation av kunskap och sfärisk kunskap för att träna en liten modell (PSPNet50) under övervakning av en stor modell (PSPNet101) för att utföra semantisk segmentering av stadsscener. Betydelsen av hyperparametrarna för destillation studeras först, nämligen temperaturens och förlustfunktionens vikter för den destillerade modellens prestanda, vilket inte visar någon avgörande fördel jämfört med individuell träning av eleven. Därefter utförs destillation med hjälp av en utbildad elev, vilket visar på en god förbättring av prestanda. Tvärtemot förväntningarna har den utbildade eleven en hög inlärningshastighet när utbildningen återupptas under destillation, särskilt i fallet med sfärisk kunskapsdestillation, vilket ger en överlägsen och stabilare prestanda jämfört med den vanliga kunskapsdestillationssituationen. Dessa resultat bekräftas av flera experiment som utförts med hjälp av datasetet Cityscapes. Den bästa destillerade modellen uppnår 87.287% pixelprecision och ett 42.0% medelvärde för skärning över union (mIoU) på valideringsuppsättningen, vilket är högre än de 86.356% pixelprecision och 39.6% mIoU som uppnåddes av grundstudenten. I testuppsättningen ger den officiella utvärderingen som gjordes på webbplatsen Cityscapes 42.213% mIoU för den destillerade modellen och 41.085% för grundstudenten.
82

Towards a Smart Food Diary : Evaluating semantic segmentation models on a newly annotated dataset: FoodSeg103

Reibel, Yann January 2024 (has links)
Automatic food recognition is becoming a solution to perform diet control as it has the ability to release the burden of self diet assessment by offering an easy process that immediately detects the food elements in the picture. This step consisting of accurately segmenting the different areas into the proper food category is crucial to make an accurate calorie estimation. In this thesis, we utilize the PREVENT project as a background to the task of creating a model capable of segmenting food. We decided to carry out the research on a newly annotated dataset FoodSeg103 that consists of a more data-realistic support for the implementation of this study. Most papers performed on FoodSeg103 focus on Vision transformer models that are seen as very trendy but also with computational constraints. We decided to choose DeepLabV3 as a dilation-based semantic segmentation model with main objective of training the model on the dataset and additionally with hope of improving the state-of-the-art results. We set up an iterative optimization process with purpose of maximizing the results and managed to attain 48.27% mIOU (also mentioned as "mIOU all" in the thesis). We also obtained a significant difference in average mIOU troughout all random search experiments in comparison to bayesian optimization experiments.This study has not overpassed the state-of-the-art performance but has managed to settle 1% behind, BEIT v2 Large remaining in first position with 49.4% mIOU.
83

Extracting Topography from Historic Topographic Maps Using GIS-Based Deep Learning

Pierce, Briar 01 May 2023 (has links) (PDF)
Historical topographic maps are valuable resources for studying past landscapes, but they are unsuitable for geospatial analysis. Cartographic map elements must be extracted and digitized for use in GIS. This can be accomplished by sophisticated image processing and pattern recognition techniques, and more recently, artificial intelligence. While these methods are generally effective, they require high levels of technical expertise. This study presents a straightforward method to digitally extract historical topographic map elements from within popular GIS software, using new and rapidly evolving toolsets. A convolutional neural network deep learning model was used to extract elevation contour lines from a 1940 United States Geological Survey (USGS) quadrangle in Sevier County, TN, ultimately producing a Digital Elevation Model (DEM). The topographically derived DEM (TOPO-DEM) is compared to a modern LiDAR-derived DEM to analyze its quality and utility. GIS-capable historians, archaeologists, geographers, and others can use this method in research and land management.
84

From interactive to semantic image segmentation

Gulshan, Varun January 2011 (has links)
This thesis investigates two well defined problems in image segmentation, viz. interactive and semantic image segmentation. Interactive segmentation involves power assisting a user in cutting out objects from an image, whereas semantic segmentation involves partitioning pixels in an image into object categories. We investigate various models and energy formulations for both these problems in this thesis. In order to improve the performance of interactive systems, low level texture features are introduced as a replacement for the more commonly used RGB features. To quantify the improvement obtained by using these texture features, two annotated datasets of images are introduced (one consisting of natural images, and the other consisting of camouflaged objects). A significant improvement in performance is observed when using texture features for the case of monochrome images and images containing camouflaged objects. We also explore adding mid-level cues such as shape constraints into interactive segmentation by introducing the idea of geodesic star convexity, which extends the existing notion of a star convexity prior in two important ways: (i) It allows for multiple star centres as opposed to single stars in the original prior and (ii) It generalises the shape constraint by allowing for Geodesic paths as opposed to Euclidean rays. Global minima of our energy function can be obtained subject to these new constraints. We also introduce Geodesic Forests, which exploit the structure of shortest paths in implementing the extended constraints. These extensions to star convexity allow us to use such constraints in a practical segmentation system. This system is evaluated by means of a “robot user” to measure the amount of interaction required in a precise way, and it is shown that having shape constraints reduces user effort significantly compared to existing interactive systems. We also introduce a new and harder dataset which augments the existing GrabCut dataset with more realistic images and ground truth taken from the PASCAL VOC segmentation challenge. In the latter part of the thesis, we bring in object category level information in order to make the interactive segmentation tasks easier, and move towards fully automated semantic segmentation. An algorithm to automatically segment humans from cluttered images given their bounding boxes is presented. A top down segmentation of the human is obtained using classifiers trained to predict segmentation masks from local HOG descriptors. These masks are then combined with bottom up image information in a local GrabCut like procedure. This algorithm is later completely automated to segment humans without requiring a bounding box, and is quantitatively compared with other semantic segmentation methods. We also introduce a novel way to acquire large quantities of segmented training data relatively effortlessly using the Kinect. In the final part of this work, we explore various semantic segmentation methods based on learning using bottom up super-pixelisations. Different methods of combining multiple super-pixelisations are discussed and quantitatively evaluated on two segmentation datasets. We observe that simple combinations of independently trained classifiers on single super-pixelisations perform almost as good as complex methods based on jointly learning across multiple super-pixelisations. We also explore CRF based formulations for semantic segmentation, and introduce novel visual words based object boundary description in the energy formulation. The object appearance and boundary parameters are trained jointly using structured output learning methods, and the benefit of adding pairwise terms is quantified on two different datasets.
85

Mise en correspondance robuste et détection de modèles visuels appliquées à l'analyse de façades / Robust feature correspondence and pattern detection for façade analysis

Ok, David 25 March 2013 (has links)
Depuis quelques années, avec l'émergence de larges bases d'images comme Google Street View, la capacité à traiter massivement et automatiquement des données, souvent très contaminées par les faux positifs et massivement ambiguës, devient un enjeu stratégique notamment pour la gestion de patrimoine et le diagnostic de l'état de façades de bâtiment. Sur le plan scientifique, ce souci est propre à faire avancer l'état de l'art dans des problèmes fondamentaux de vision par ordinateur. Notamment, nous traitons dans cette thèse les problèmes suivants: la mise en correspondance robuste, algorithmiquement efficace de caractéristiques visuelles et l'analyse d'images de façades par grammaire. L'enjeu est de développer des méthodes qui doivent également être adaptées à des problèmes de grande échelle. Tout d'abord, nous proposons une formalisation mathématique de la cohérence géométrique qui joue un rôle essentiel pour une mise en correspondance robuste de caractéristiques visuelles. A partir de cette formalisation, nous en dérivons un algorithme de mise en correspondance qui est algorithmiquement efficace, précise et robuste aux données fortement contaminées et massivement ambiguës. Expérimentalement, l'algorithme proposé se révèle bien adapté à des problèmes de mise en correspondance d'objets déformés, et à des problèmes de mise en correspondance précise à grande échelle pour la calibration de caméras. En s'appuyant sur notre algorithme de mise en correspondance, nous en dérivons ensuite une méthode de recherche d'éléments répétés, comme les fenêtres. Celle-ci s'avère expérimentalement très efficace et robuste face à des conditions difficiles comme la grande variabilité photométrique des éléments répétés et les occlusions. De plus, elle fait également peu d'hallucinations. Enfin, nous proposons des contributions méthodologiques qui exploitent efficacement les résultats de détections d'éléments répétés pour l'analyse de façades par grammaire, qui devient substantiellement plus précise et robuste / For a few years, with the emergence of large image database such as Google Street View, designing efficient, scalable, robust and accurate strategies have now become a critical issue to process very large data, which are also massively contaminated by false positives and massively ambiguous. Indeed, this is of particular interest for property management and diagnosing the health of building fac{c}ades. Scientifically speaking, this issue puts into question the current state-of-the-art methods in fundamental computer vision problems. More particularly, we address the following problems: (1) robust and scalable feature correspondence and (2) façade image parsing. First, we propose a mathematical formalization of the geometry consistency which plays a key role for a robust feature correspondence. From such a formalization, we derive a novel match propagation method. Our method is experimentally shown to be robust, efficient, scalable and accurate for highly contaminated and massively ambiguous sets of correspondences. Our experiments show that our method performs well in deformable object matching and large-scale and accurate matching problem instances arising in camera calibration. We build a novel repetitive pattern search upon our feature correspondence method. Our pattern search method is shown to be effective for accurate window localization and robust to the potentially great appearance variability of repeated patterns and occlusions. Furthermore, our pattern search method makes very few hallucinations. Finally, we propose methodological contributions that exploit our repeated pattern detection results, which results in a substantially more robust and more accurate façade image parsing
86

Segmentation sémantique d'images fortement structurées et faiblement structurées / Semantic Segmentation of Highly Structured and Weakly Structured Images

Gadde, Raghu Deep 30 June 2017 (has links)
Cette thèse pour but de développer des méthodes de segmentation pour des scènes fortement structurées (ex. bâtiments et environnements urbains) ou faiblement structurées (ex. paysages ou objets naturels). En particulier, les images de bâtiments peuvent être décrites en termes d'une grammaire de formes, et une dérivation de cette grammaire peut être inférée pour obtenir une segmentation d'une image. Cependant, il est difficile et long d'écrire de telles grammaires. Pour répondre à ce problème, nous avons développé une nouvelle méthode qui permet d'apprendre automatiquement une grammaire à partir d'un ensemble d'images et de leur segmentation associée. Des expériences montrent que des grammaires ainsi apprises permettent une inférence plus rapide et produisent de meilleures segmentations. Nous avons également étudié une méthode basée sur les auto-contextes pour segmenter des scènes fortement structurées et notamment des images de bâtiments. De manière surprenante, même sans connaissance spécifique sur le type de scène particulier observé, nous obtenons des gains significatifs en qualité de segmentation sur plusieurs jeux de données. Enfin, nous avons développé une technique basée sur les réseaux de neurones convolutifs (CNN) pour segmenter des images de scènes faiblement structurées. Un filtrage adaptatif est effectué à l'intérieur même du réseau pour permettre des dépendances entre zones d'images distantes. Des expériences sur plusieurs jeux de données à grande échelle montrent là aussi un gain important sur la qualité de segmentation / The aim of this thesis is to develop techniques for segmenting strongly-structuredscenes (e.g. building images) and weakly-structured scenes (e.g. natural images). Buildingimages can naturally be expressed in terms of grammars and inference is performed usinggrammars to obtain the optimal segmentation. However, it is difficult and time consum-ing to write such grammars. To alleviate this problem, a novel method to automaticallylearn grammars from a given training set of image and ground-truth segmentation pairs isdeveloped. Experiments suggested that such learned grammars help in better and fasterinference. Next, the effect of using grammars for strongly structured scenes is explored.To this end, a very simple technique based on Auto-Context is used to segment buildingimages. Surprisingly, even with out using any domain specific knowledge, we observedsignificant improvements in terms of performance on several benchmark datasets. Lastly,a novel technique based on convolutional neural networks is developed to segment imageswithout any high-level structure. Image-adaptive filtering is performed within a CNN ar-chitecture to facilitate long-range connections. Experiments on different large scale bench-marks show significant improvements in terms of performance
87

Restoring the balance between stuff and things in scene understanding

Caesar, Holger January 2018 (has links)
Scene understanding is a central field in computer vision that attempts to detect objects in a scene and reason about their spatial, functional and semantic relations. While many works focus on things (objects with a well-defined shape), less attention has been given to stuff classes (amorphous background regions). However, stuff classes are important as they allow to explain many aspects of an image, including the scene type, thing classes likely to be present and physical attributes of all objects in the scene. The goal of this thesis is to restore the balance between stuff and things in scene understanding. In particular, we investigate how the recognition of stuff differs from things and develop methods that are suitable to deal with both. We use stuff to find things and annotate a large-scale dataset to study stuff and things in context. First, we present two methods for semantic segmentation of stuff and things. Most methods require manual class weighting to counter imbalanced class frequency distributions, particularly on datasets with stuff and thing classes. We develop a novel joint calibration technique that takes into account class imbalance, class competition and overlapping regions by calibrating for the pixel-level evaluation criterion. The second method shows how to unify the advantages of region-based approaches (accurately delineated object boundaries) and fully convolutional approaches (end-to-end training). Both are combined in a universal framework that is equally suitable to deal with stuff and things. Second, we propose to help weakly supervised object localization for classes where location annotations are not available, by transferring things and stuff knowledge from a source set with available annotations. This is particularly important if we want to scale scene understanding to real-world applications with thousands of classes, without having to exhaustively annotate millions of images. Finally, we present COCO-Stuff - the largest existing dataset with dense stuff and thing annotations. Existing datasets are much smaller and were made with expensive polygon-based annotation. We use a very efficient stuff annotation protocol to densely annotate 164K images. Using this new dataset, we provide a detailed analysis of the dataset and visualize how stuff and things co-occur spatially in an image. We revisit the question whether stuff or things are easier to detect and which is more important based on visual and linguistic analysis.
88

Réseaux de neurones convolutifs pour la segmentation sémantique et l'apprentissage d'invariants de couleur / Convolutional neural networks for semantic segmentation and color constancy

Fourure, Damien 12 December 2017 (has links)
La vision par ordinateur est un domaine interdisciplinaire étudiant la manière dont les ordinateurs peuvent acquérir une compréhension de haut niveau à partir d’images ou de vidéos numériques. En intelligence artificielle, et plus précisément en apprentissage automatique, domaine dans lequel se positionne cette thèse, la vision par ordinateur passe par l’extraction de caractéristiques présentes dans les images puis par la généralisation de concepts liés à ces caractéristiques. Ce domaine de recherche est devenu très populaire ces dernières années, notamment grâce aux résultats des réseaux de neurones convolutifs à la base des méthodes dites d’apprentissage profond. Aujourd’hui les réseaux de neurones permettent, entre autres, de reconnaître les différents objets présents dans une image, de générer des images très réalistes ou même de battre les champions au jeu de Go. Leurs performances ne s’arrêtent d’ailleurs pas au domaine de l’image puisqu’ils sont aussi utilisés dans d’autres domaines tels que le traitement du langage naturel (par exemple en traduction automatique) ou la reconnaissance de son. Dans cette thèse, nous étudions les réseaux de neurones convolutifs afin de développer des architectures et des fonctions de coûts spécialisées à des tâches aussi bien de bas niveau (la constance chromatique) que de haut niveau (la segmentation sémantique d’image). Une première contribution s’intéresse à la tâche de constance chromatique. En vision par ordinateur, l’approche principale consiste à estimer la couleur de l’illuminant puis à supprimer son impact sur la couleur perçue des objets. Les expériences que nous avons menées montrent que notre méthode permet d’obtenir des performances compétitives avec l’état de l’art. Néanmoins, notre architecture requiert une grande quantité de données d’entraînement. Afin de corriger en parti ce problème et d’améliorer l’entraînement des réseaux de neurones, nous présentons plusieurs techniques d’augmentation artificielle de données. Nous apportons également deux contributions sur une problématique de haut niveau : la segmentation sémantique d’image. Cette tâche, qui consiste à attribuer une classe sémantique à chacun des pixels d’une image, constitue un défi en vision par ordinateur de par sa complexité. D’une part, elle requiert de nombreux exemples d’entraînement dont les vérités terrains sont coûteuses à obtenir. D’autre part, elle nécessite l’adaptation des réseaux de neurones convolutifs traditionnels afin d’obtenir une prédiction dite dense, c’est-à-dire, une prédiction pour chacun pixel présent dans l’image d’entrée. Pour résoudre la difficulté liée à l’acquisition de données d’entrainements, nous proposons une approche qui exploite simultanément plusieurs bases de données annotées avec différentes étiquettes. Pour cela, nous définissons une fonction de coût sélective. Nous développons aussi une approche dites d’auto-contexte capturant d’avantage les corrélations existantes entre les étiquettes des différentes bases de données. Finalement, nous présentons notre troisième contribution : une nouvelle architecture de réseau de neurones convolutifs appelée GridNet spécialisée pour la segmentation sémantique d’image. Contrairement aux réseaux traditionnels, notre architecture est implémentée sous forme de grille 2D permettant à plusieurs flux interconnectés de fonctionner à différentes résolutions. Afin d’exploiter la totalité des chemins de la grille, nous proposons une technique d’entraînement inspirée du dropout. En outre, nous montrons empiriquement que notre architecture généralise de nombreux réseaux bien connus de l’état de l’art. Nous terminons par une analyse des résultats empiriques obtenus avec notre architecture qui, bien qu’entraînée avec une initialisation aléatoire des poids, révèle de très bonnes performances, dépassant les approches populaires souvent pré-entraînés / Computer vision is an interdisciplinary field that investigates how computers can gain a high level of understanding from digital images or videos. In artificial intelligence, and more precisely in machine learning, the field in which this thesis is positioned,computer vision involves extracting characteristics from images and then generalizing concepts related to these characteristics. This field of research has become very popular in recent years, particularly thanks to the results of the convolutional neural networks that form the basis of so-called deep learning methods. Today, neural networks make it possible, among other things, to recognize different objects present in an image, to generate very realistic images or even to beat the champions at the Go game. Their performance is not limited to the image domain, since they are also used in other fields such as natural language processing (e. g. machine translation) or sound recognition. In this thesis, we study convolutional neural networks in order to develop specialized architectures and loss functions for low-level tasks (color constancy) as well as high-level tasks (semantic segmentation). Color constancy, is the ability of the human visual system to perceive constant colours for a surface despite changes in the spectrum of illumination (lighting change). In computer vision, the main approach consists in estimating the color of the illuminant and then suppressing its impact on the perceived color of objects. We approach the task of color constancy with the use of neural networks by developing a new architecture composed of a subsampling operator inspired by traditional methods. Our experience shows that our method makes it possible to obtain competitive performances with the state of the art. Nevertheless, our architecture requires a large amount of training data. In order to partially correct this problem and improve the training of neural networks, we present several techniques for artificial data augmentation. We are also making two contributions on a high-level issue : semantic segmentation. This task, which consists of assigning a semantic class to each pixel of an image, is a challenge in computer vision because of its complexity. On the one hand, it requires many examples of training that are costly to obtain. On the other hand, it requires the adaptation of traditional convolutional neural networks in order to obtain a so-called dense prediction, i. e., a prediction for each pixel present in the input image. To solve the difficulty of acquiring training data, we propose an approach that uses several databases annotated with different labels at the same time. To do this, we define a selective loss function that has the advantage of allowing the training of a convolutional neural network from data from multiple databases. We also developed self-context approach that captures the correlations between labels in different databases. Finally, we present our third contribution : a new convolutional neural network architecture called GridNet specialized for semantic segmentation. Unlike traditional networks, implemented with a single path from the input (image) to the output (prediction), our architecture is implemented as a 2D grid allowing several interconnected streams to operate at different resolutions. In order to exploit all the paths of the grid, we propose a technique inspired by dropout. In addition, we empirically demonstrate that our architecture generalize many of well-known stateof- the-art networks. We conclude with an analysis of the empirical results obtained with our architecture which, although trained from scratch, reveals very good performances, exceeding popular approaches often pre-trained
89

Apprentissage autosupervisé de modèles prédictifs de segmentation à partir de vidéos / Self-supervised learning of predictive segmentation models from video

Luc, Pauline 25 June 2019 (has links)
Les modèles prédictifs ont le potentiel de permettre le transfert des succès récents en apprentissage par renforcement à de nombreuses tâches du monde réel, en diminuant le nombre d’interactions nécessaires avec l’environnement.La tâche de prédiction vidéo a attiré un intérêt croissant de la part de la communauté ces dernières années, en tant que cas particulier d’apprentissage prédictif dont les applications en robotique et dans les systèmes de navigations sont vastes.Tandis que les trames RGB sont faciles à obtenir et contiennent beaucoup d’information, elles sont extrêmement difficile à prédire, et ne peuvent être interprétées directement par des applications en aval.C’est pourquoi nous introduisons ici une tâche nouvelle, consistant à prédire la segmentation sémantique ou d’instance de trames futures.Les espaces de descripteurs que nous considérons sont mieux adaptés à la prédiction récursive, et nous permettent de développer des modèles de segmentation prédictifs performants jusqu’à une demi-seconde dans le futur.Les prédictions sont interprétables par des applications en aval et demeurent riches en information, détaillées spatialement et faciles à obtenir, en s’appuyant sur des méthodes état de l’art de segmentation.Dans cette thèse, nous nous attachons d’abord à proposer pour la tâche de segmentation sémantique, une approche discriminative se basant sur un entrainement par réseaux antagonistes.Ensuite, nous introduisons la tâche nouvelle de prédiction de segmentation sémantique future, pour laquelle nous développons un modèle convolutionnel autoregressif.Enfin, nous étendons notre méthode à la tâche plus difficile de prédiction de segmentation d’instance future, permettant de distinguer entre différents objets.Du fait du nombre de classes variant selon les images, nous proposons un modèle prédictif dans l’espace des descripteurs d’image convolutionnels haut niveau du réseau de segmentation d’instance Mask R-CNN.Cela nous permet de produire des segmentations visuellement plaisantes en haute résolution, pour des scènes complexes comportant un grand nombre d’objets, et avec une performance satisfaisante jusqu’à une demi seconde dans le futur. / Predictive models of the environment hold promise for allowing the transfer of recent reinforcement learning successes to many real-world contexts, by decreasing the number of interactions needed with the real world.Video prediction has been studied in recent years as a particular case of such predictive models, with broad applications in robotics and navigation systems.While RGB frames are easy to acquire and hold a lot of information, they are extremely challenging to predict, and cannot be directly interpreted by downstream applications.Here we introduce the novel tasks of predicting semantic and instance segmentation of future frames.The abstract feature spaces we consider are better suited for recursive prediction and allow us to develop models which convincingly predict segmentations up to half a second into the future.Predictions are more easily interpretable by downstream algorithms and remain rich, spatially detailed and easy to obtain, relying on state-of-the-art segmentation methods.We first focus on the task of semantic segmentation, for which we propose a discriminative approach based on adversarial training.Then, we introduce the novel task of predicting future semantic segmentation, and develop an autoregressive convolutional neural network to address it.Finally, we extend our method to the more challenging problem of predicting future instance segmentation, which additionally segments out individual objects.To deal with a varying number of output labels per image, we develop a predictive model in the space of high-level convolutional image features of the Mask R-CNN instance segmentation model.We are able to produce visually pleasing segmentations at a high resolution for complex scenes involving a large number of instances, and with convincing accuracy up to half a second ahead.
90

Segmentation and structuring of video documents for indexing applications

Tapu, Ruxandra Georgina 07 December 2012 (has links) (PDF)
Recent advances in telecommunications, collaborated with the development of image and video processing and acquisition devices has lead to a spectacular growth of the amount of the visual content data stored, transmitted and exchanged over Internet. Within this context, elaborating efficient tools to access, browse and retrieve video content has become a crucial challenge. In Chapter 2 we introduce and validate a novel shot boundary detection algorithm able to identify abrupt and gradual transitions. The technique is based on an enhanced graph partition model, combined with a multi-resolution analysis and a non-linear filtering operation. The global computational complexity is reduced by implementing a two-pass approach strategy. In Chapter 3 the video abstraction problem is considered. In our case, we have developed a keyframe representation system that extracts a variable number of images from each detected shot, depending on the visual content variation. The Chapter 4 deals with the issue of high level semantic segmentation into scenes. Here, a novel scene/DVD chapter detection method is introduced and validated. Spatio-temporal coherent shots are clustered into the same scene based on a set of temporal constraints, adaptive thresholds and neutralized shots. Chapter 5 considers the issue of object detection and segmentation. Here we introduce a novel spatio-temporal visual saliency system based on: region contrast, interest points correspondence, geometric transforms, motion classes' estimation and regions temporal consistency. The proposed technique is extended on 3D videos by representing the stereoscopic perception as a 2D video and its associated depth

Page generated in 0.1123 seconds