Global ETD Search

11	On Fundamental Elements of Visual Navigation Systems Siddiqui, Abujawad Rafid January 2014 (has links) Visual navigation is a ubiquitous yet complex task which is performed by many species for the purpose of survival. Although visual navigation is actively being studied within the robotics community, the determination of elemental constituents of a robust visual navigation system remains a challenge. Motion estimation is mistakenly considered as the sole ingredient to make a robust autonomous visual navigation system and therefore efforts are made to improve the accuracy of motion estimations. On the contrary, there are other factors which are as important as motion and whose absence could result in inability to perform seamless visual navigation such as the one exhibited by humans. Therefore, it is needed that a general model for a visual navigation system be devised which would describe it in terms of a set of elemental units. In this regard, a set of visual navigation elements (i.e. spatial memory, motion memory, scene geometry, context and scene semantics) are suggested as building blocks of a visual navigation system in this thesis. A set of methods are proposed which investigate the existence and role of visual navigation elements in a visual navigation system. A quantitative research methodology in the form of a series of systematic experiments is conducted on these methods. The thesis formulates, implements and analyzes the proposed methods in the context of visual navigation elements which are arranged into three major groupings; a) Spatial memory b) Motion Memory c) Manhattan, context and scene semantics. The investigations are carried out on multiple image datasets obtained by robot mounted cameras (2D/3D) moving in different environments. Spatial memory is investigated by evaluation of proposed place recognition methods. The recognized places and inter-place associations are then used to represent a visited set of places in the form of a topological map. Such a representation of places and their spatial associations models the concept of spatial memory. It resembles the humans’ ability of place representation and mapping for large environments (e.g. cities). Motion memory in a visual navigation system is analyzed by a thorough investigation of various motion estimation methods. This leads to proposals of direct motion estimation methods which compute accurate motion estimates by basing the estimation process on dominant surfaces. In everyday world, planar surfaces, especially the ground planes, are ubiquitous. Therefore, motion models are built upon this constraint. Manhattan structure provides geometrical cues which are helpful in solving navigation problems. There are some unique geometric primitives (e.g. planes) which make up an indoor environment. Therefore, a plane detection method is proposed as a result of investigations performed on scene structure. The method uses supervised learning to successfully classify the segmented clusters in 3D point-cloud datasets. In addition to geometry, the context of a scene also plays an important role in robustness of a visual navigation system. The context in which navigation is being performed imposes a set of constraints on objects and sections of the scene. The enforcement of such constraints enables the observer to robustly segment the scene and to classify various objects in the scene. A contextually aware scene segmentation method is proposed which classifies the image of a scene into a set of geometric classes. The geometric classes are sufficient for most of the navigation tasks. However, in order to facilitate the cognitive visual decision making process, the scene ought to be semantically segmented. The semantic of indoor scenes as well as semantic of the outdoor scenes are dealt with separately and separate methods are proposed for visual mapping of environments belonging to each type. An indoor scene consists of a corridor structure which is modeled as a cubic space in order to build a map of the environment. A “flash-n-extend” strategy is proposed which is responsible for controlling the map update frequency. The semantics of the outdoor scenes is also investigated and a scene classification method is proposed. The method employs a Markov Random Field (MRF) based classification framework which generates a set of semantic maps. robot navigation localization visual mapping scene understanding semantic mapping Computer Science Datavetenskap (datalogi)
12	Learning Structured and Deep Representations for Traffc Scene Understanding Yu, Zhiding 01 December 2017 (has links) Recent advances in representation learning have led to an increasing variety of vision-based approaches in traffic scene understanding. This includes general vision problems such as object detection, depth estimation, edge/boundary/contour detection, semantic segmentation and scene classification, as well as application-driven problems such as pedestrian detection, vehicle detection, lane marker detection and road segmentation, etc. In this thesis, we approach some of these problems by exploring structured and invariant representations from the visual input. Our research is mainly motivated by two facts: 1. Traffic scenes often contain highly structured layouts. Exploring structured priors is expected to help considerably in improving the scene understanding performance. 2. A major challenge of traffic scene understanding lies in the diverse and changing nature of the contents. It is therefore important to find robust visual representations that are invariant against such variability. We start from highway scenarios where we are interested in detecting the hard road borders and estimating the drivable space before such physical boundary. To this end, we treat the task as a joint detection and tracking problem, and formulate it with structured Hough voting (SVH): A conditional random field model that explores both intra-frame geometric and interframe temporal information to generate more accurate and stable predictions. Turning from highway scenes to urban scenes, we consider dense prediction problems such as category-aware semantic edge detection and semantic segmentation. Category-aware semantic edge detection is challenging as the model is required to jointly localize object contours and classify each edge pixel to one or multiple predefined classes. We propose CASENet, a multilabel deep network with state of the art edge detection performance. To address the label misalignment problem in edge learning, we also propose SEAL, a framework towards simultaneous edge alignment and learning. Failure across different domains has been a common bottleneck of semantic segmentation methods. In this thesis, we address the problem of adapting a segmentation model trained on a source domain to another different target domain without knowing the target domain labels, and propose a class-balanced self-training approach for such unsupervised domain adaptation. We adopt the \synthetic-to-real" setting where a model is pre-trained on GTA-5 and adapted to real world datasets such as Cityscapes and Nexar, as well as the \cross-city" setting where a model is pre-trained on Cityscapes, and adapted to unseen data from Rio, Tokyo, Rome and Taipei. Experiment shows the superior performance of our method compared to state of the art methods, such as adversarial training based domain adaptation. computer vision convolutional neural network deep learning scene understanding structured prediction
13	Depth Estimation Using Adaptive Bins via Global Attention at High Resolution Bhat, Shariq 21 April 2021 (has links) We address the problem of estimating a high quality dense depth map from a single RGB input image. We start out with a baseline encoder-decoder convolutional neural network architecture and pose the question of how the global processing of information can help improve overall depth estimation. To this end, we propose a transformer-based architecture block that divides the depth range into bins whose center value is estimated adaptively per image. The final depth values are estimated as linear combinations of the bin centers. We call our new building block AdaBins. Our results show a decisive improvement over the state-of-the-art on several popular depth datasets across all metrics. We also validate the effectiveness of the proposed block with an ablation study. Monocular Depth Estimation 3D reconstruction Transformers 3D scene understanding adaptive binning Convolutional Neural Networks
14	Spatio-temporal reasoning for semantic scene understanding and its application in recognition and prediction of manipulation actions in image sequences Ziaeetabar, Fatemeh 07 May 2019 (has links) No description available. 510 Manipulation actions Semantic scene understanding Spatial reasoning Robotics Informatik (PPN619939052)
15	A Multi-camera based Next Best View Approach for Semantic Scene Understanding Persson, Anton January 2023 (has links) Robots are becoming more common; robotics has gone from bleeding-edge technology to an everyday topic that families discuss around thedinner table.The number of robots in the industry is growing, which means thatthe demand and need for robots to understand the environment it isworking in is also growing.The standard method for a robot to gather information about a sceneinvolves moving to different pre-determined poses from which it canview and analyze the scene. However, this approach does not con-sider the topology of the scene that the robot should explore.This thesis aims to create a two-dimensional approach to determinethe next best view ( 2D-NBV) to view and explore the scene, intro-duced in the method section.The 2D-NBV method converts a point cloud of the scene to an ele-vation map. A segmenting network is used to get the positions ofpre-trained objects. The positions are then used to generate a2DGaussian kernel heatmap of the scene. Using the 2D elevation andGaussian map, the NBV pose is then calculated. The NBV pose isthen converted back to a 6D pose that the robot moves to capture anew point cloud and register it to the scene.The 2D-NBV method is compared to a baseline and a state-of-the-artmethod. The baseline method captures four different point cloudsfrom pre-determined positions and registers them together. The state-of-the-art methods find a point of interest and declare a set of viewcandidates on a sphere around the point. Ray casting is used to findthe pose with the highest information gain. This pose is set as theNBV for the robot to move to. The goal of this thesis is that themethod should perform better than the baseline method, describedfurther in the method section.The evaluation metric used in this thesis is how wellthe differentmethods could estimate the bounding boxes of pre-trained items us-ing an off-the-shelf semantic scene segmentation method. Six sceneswith varying difficulty were constructed to test the methods.The results showed that the 2D-NBV method successfully comple-mented the scene with information about its empty cells. The 2D-NBV outperforms the state-of-the-art on occluded scenes. The 2D-NBV performed overall just as well as the baseline. The reason thatthe 2D-NBV did not outperform the baseline is seen as a consequenceof the information loss going from 3D to 2D. Next Best View NBV Semantic Scene Understanding Robotics Robotics Robotteknik och automation Computer Systems Datorsystem
16	Incorporating spatial relationship information in signal-to-text processing Davis, Jeremy Elon 13 May 2022 (has links) (PDF) This dissertation outlines the development of a signal-to-text system that incorporates spatial relationship information to generate scene descriptions. Existing signal-to-text systems generate accurate descriptions in regards to information contained in an image. However, to date, no signalto- text system incorporates spatial relationship information. A survey of related work in the fields of object detection, signal-to-text, and spatial relationships in images is presented first. Three methodologies followed by evaluations were conducted in order to create the signal-to-text system: 1) generation of object localization results from a set of input images, 2) derivation of Level One Summaries from an input image, and 3) inference of Level Two Summaries from the derived Level One Summaries. Validation processes are described for the second and third evaluations, as the first evaluation has been previously validated in the related original works. The goal of this research is to show that a signal-to-text system that incorporates spatial information results in more informative descriptions of the content contained in an image. An additional goal of this research is to demonstrate the signal-to-text system can be easily applied to additional data sets, other than the sets used to train the system, and achieve similar results to the training sets. To achieve this goal, a validation study was conducted and is presented to the reader. Object detection Signal-to-text Scene understanding Spatial reasoning Natural language processing Artificial Intelligence and Robotics
17	Semantic Segmentation of Urban Scene Images Using Recurrent Neural Networks Daliparthi, Venkata Satya Sai Ajay January 2020 (has links) Background: In Autonomous Driving Vehicles, the vehicle receives pixel-wise sensor data from RGB cameras, point-wise depth information from the cameras, and sensors data as input. The computer present inside the Autonomous Driving vehicle processes the input data and provides the desired output, such as steering angle, torque, and brake. To make an accurate decision by the vehicle, the computer inside the vehicle should be completely aware of its surroundings and understand each pixel in the driving scene. Semantic Segmentation is the task of assigning a class label (Such as Car, Road, Pedestrian, or Sky) to each pixel in the given image. So, a better performing Semantic Segmentation algorithm will contribute to the advancement of the Autonomous Driving field. Research Gap: Traditional methods, such as handcrafted features and feature extraction methods, were mainly used to solve Semantic Segmentation. Since the rise of deep learning, most of the works are using deep learning to dealing with Semantic Segmentation. The most commonly used neural network architecture to deal with Semantic Segmentation was the Convolutional Neural Network (CNN). Even though some works made use of Recurrent Neural Network (RNN), the effect of RNN in dealing with Semantic Segmentation was not yet thoroughly studied. Our study addresses this research gap. Idea: After going through the existing literature, we came up with the idea of “Using RNNs as an add-on module, to augment the skip-connections in Semantic Segmentation Networks through residual connections.” Objectives and Method: The main objective of our work is to improve the Semantic Segmentation network’s performance by using RNNs. The Experiment was chosen as a methodology to conduct our study. In our work, We proposed three novel architectures called UR-Net, UAR-Net, and DLR-Net by implementing our idea to the existing networks U-Net, Attention U-Net, and DeepLabV3+ respectively. Results and Findings: We empirically showed that our proposed architectures have shown improvement in efficiently segmenting the edges and boundaries. Through our study, we found that there is a trade-off between using RNNs and Inference time of the model. Suppose we use RNNs to improve the performance of Semantic Segmentation Networks. In that case, we need to trade off some extra seconds during the inference of the model. Conclusion: Our findings will not contribute to the Autonomous driving field, where we need better performance in real-time. But, our findings will contribute to the advancement of Bio-medical Image segmentation, where doctors can trade-off those extra seconds during inference for better performance. Image Segmentation Deep Learning Convolutional Neural Networks Recurrent Neural Networks Encoder-Decoder Models and Scene Understanding Computer Sciences Datavetenskap (datalogi)
18	Information fusion for scene understanding / Fusion d'informations pour la compréhesion de scènes Xu, Philippe 28 November 2014 (has links) La compréhension d'image est un problème majeur de la robotique moderne, la vision par ordinateur et l'apprentissage automatique. En particulier, dans le cas des systèmes avancés d'aide à la conduite, la compréhension de scènes routières est très importante. Afin de pouvoir reconnaître le grand nombre d’objets pouvant être présents dans la scène, plusieurs capteurs et algorithmes de classification doivent être utilisés. Afin de pouvoir profiter au mieux des méthodes existantes, nous traitons le problème de la compréhension de scènes comme un problème de fusion d'informations. La combinaison d'une grande variété de modules de détection, qui peuvent traiter des classes d'objets différentes et utiliser des représentations distinctes, est faites au niveau d'une image. Nous considérons la compréhension d'image à deux niveaux : la détection d'objets et la segmentation sémantique. La théorie des fonctions de croyance est utilisée afin de modéliser et combiner les sorties de ces modules de détection. Nous mettons l'accent sur la nécessité d'avoir un cadre de fusion suffisamment flexible afin de pouvoir inclure facilement de nouvelles classes d'objets, de nouveaux capteurs et de nouveaux algorithmes de détection d'objets. Dans cette thèse, nous proposons une méthode générale permettant de transformer les sorties d’algorithmes d'apprentissage automatique en fonctions de croyance. Nous étudions, ensuite, la combinaison de détecteurs de piétons en utilisant les données Caltech Pedestrian Detection Benchmark. Enfin, les données du KITTI Vision Benchmark Suite sont utilisées pour valider notre approche dans le cadre d'une fusion multimodale d'informations pour de la segmentation sémantique. / Image understanding is a key issue in modern robotics, computer vison and machine learning. In particular, driving scene understanding is very important in the context of advanced driver assistance systems for intelligent vehicles. In order to recognize the large number of objects that may be found on the road, several sensors and decision algorithms are necessary. To make the most of existing state-of-the-art methods, we address the issue of scene understanding from an information fusion point of view. The combination of many diverse detection modules, which may deal with distinct classes of objects and different data representations, is handled by reasoning in the image space. We consider image understanding at two levels : object detection ans semantic segmentation. The theory of belief functions is used to model and combine the outputs of these detection modules. We emphazise the need of a fusion framework flexible enough to easily include new classes, new sensors and new object detection algorithms. In this thesis, we propose a general method to model the outputs of classical machine learning techniques as belief functions. Next, we apply our framework to the combination of pedestrian detectors using the Caltech Pedestrain Detection Benchmark. The KITTI Vision Benchmark Suite is then used to validate our approach in a semantic segmentation context using multi-modal information Fusion d'informations Compréhension de scènes routières Théorie des fonctions de croyance Détection d'objets Segmentation sémantique Information fusion Driving scene understanding Theory of belief function Demster-Shafer theory Object detection Semantic segmentation
19	Modélisation géométrique de scènes intérieures à partir de nuage de points / Geometric modeling of indoor scenes from acquired point data Oesau, Sven 24 June 2015 (has links) La modélisation géométrique et la sémantisation de scènes intérieures à partir d'échantillon de points et un sujet de recherche qui prend de plus en plus d'importance. Cependant, le traitement d'un ensemble volumineux de données est rendu difficile d'une part par le nombre élevé d'objets parasitant la scène et d'autre part par divers défauts d'acquisitions comme par exemple des données manquantes ou un échantillonnage de la scène non isotrope. Cette thèse s'intéresse de près à de nouvelles méthodes permettant de modéliser géométriquement un nuage de point non structuré et d’y donner de la sémantique. Dans le chapitre 2, nous présentons deux méthodes permettant de transformer le nuage de points en un ensemble de formes. Nous proposons en premier lieu une méthode d'extraction de lignes qui détecte des segments à partir d'une coupe horizontale du nuage de point initiale. Puis nous introduisons une méthode par croissance de régions qui détecte et renforce progressivement des régularités parmi les formes planaires. Dans la première partie du chapitre 3, nous proposons une méthode basée sur de l'analyse statistique afin de séparer de la structure de la scène les objets la parasitant. Dans la seconde partie, nous présentons une méthode d'apprentissage supervisé permettant de classifier des objets en fonction d'un ensemble de formes planaires. Nous introduisons dans le chapitre 4 une méthode permettant de modéliser géométriquement le volume d'une pièce (sans meubles). Une formulation énergétique est utilisée afin de labelliser les régions d’une partition générée à partir de formes élémentaires comme étant intérieur ou extérieur de manière robuste au bruit et aux données. / Geometric modeling and semantization of indoor scenes from sampled point data is an emerging research topic. Recent advances in acquisition technologies provide highly accurate laser scanners and low-cost handheld RGB-D cameras for real-time acquisition. However, the processing of large data sets is hampered by high amounts of clutter and various defects such as missing data, outliers and anisotropic sampling. This thesis investigates three novel methods for efficient geometric modeling and semantization from unstructured point data: Shape detection, classification and geometric modeling. Chapter 2 introduces two methods for abstracting the input point data with primitive shapes. First, we propose a line extraction method to detect wall segments from a horizontal cross-section of the input point cloud. Second, we introduce a region growing method that progressively detects and reinforces regularities of planar shapes. This method utilizes regularities common to man-made architecture, i.e. coplanarity, parallelism and orthogonality, to reduce complexity and improve data fitting in defect-laden data. Chapter 3 introduces a method based on statistical analysis for separating clutter from structure. We also contribute a supervised machine learning method for object classification based on sets of planar shapes. Chapter 4 introduces a method for 3D geometric modeling of indoor scenes. We first partition the space using primitive shapes detected from permanent structures. An energy formulation is then used to solve an inside/outside labeling of a space partitioning, the latter providing robustness to missing data and outliers. Traitement de la géometrie Modélisation 3D LIDAR Compréhension de scène Reconstruction de scène d'intérieur Détection d'ombre Geometry processing Shape detection Indoor scene reconstruction Scene understanding 3D modeling LIDAR data
20	Modeling and recognizing interactions between people, objects and scenes / Modélisation et reconnaissance des actions humaines dans les images Delaitre, Vincent 07 April 2015 (has links) Nous nous intéressons dans cette thèse à la modélisation des interactions entre personnes, objets et scènes. Nous montrons l’intérêt de combiner ces trois sources d’information pour améliorer la classification d’action et la compréhension automatique des scènes. Dans la première partie, nous cherchons à exploiter le contexte fourni par les objets et la scène pour améliorer la classification des actions humaines dans les photographies. Nous explorons différentes variantes du modèle dit de “bag-of-features” et proposons une méthode tirant avantage du contexte scénique. Nous proposons ensuite un nouveau modèle exploitant les objets pour la classification d’action basé sur des paires de détecteurs de parties du corps et/ou d’objet. Nous évaluons ces méthodes sur notre base de données d’images nouvellement collectée ainsi que sur trois autres jeux de données pour la classification d’action et obtenons des résultats proches de l’état de l’art. Dans la seconde partie de cette thèse, nous nous attaquons au problème inverse et cherchons à utiliser l’information contextuelle fournie par les personnes pour aider à la localisation des objets et à la compréhension des scènes. Nous collectons une nouvelle base de données de time-lapses comportant de nombreuses interactions entre personnes, objets et scènes. Nous développons une approche permettant de décrire une zone de l’image par la distribution des poses des personnes qui interagissent avec et nous utilisons cette représentation pour améliorer la localisation d’objets. De plus, nous démontrons qu’utiliser des informations provenant des personnes détectées peut améliorer plusieurs étapes de l’algorithme utilisé pour la compréhension des scènes d’intérieur. Pour finir, nous proposons des annotations 3D de notre base de time-lapses et montrons comment estimer l’espace utilisé par différentes classes d’objets dans une pièce. Pour résumer, les contributions de cette thèse sont les suivantes : (i) nous mettons au point des modèles pour la classification d’image tirant avantage du contexte scénique et des objets environnants et nous proposons une nouvelle base de données pour évaluer leurs performances, (ii) nous développons un nouveau modèle pour améliorer la localisation d’objet grâce à l’observation des acteurs humains interagissant avec une scène et nous le testons sur un nouveau jeu de vidéos comportant de nombreuses interactions entre personnes, objets et scènes, (iii) nous proposons la première méthode pour évaluer les volumes occupés par différentes classes d’objets dans une pièce, ce qui nous permet d’analyser les différentes étapes pour la compréhension automatique de scène d’intérieur et d’en identifier les principales sources d’erreurs. / In this thesis, we focus on modeling interactions between people, objects and scenes and show benefits of combining corresponding cues for improving both action classification and scene understanding. In the first part, we seek to exploit the scene and object context to improve action classification in still images. We explore alternative bag-of-features models and propose a method that takes advantage of the scene context. We then propose a new model exploiting the object context for action classification based on pairs of body part and object detectors. We evaluate our methods on our newly collected still image dataset as well as three other datasets for action classification and show performance close to the state of the art. In the second part of this thesis, we address the reverse problem and aim at using the contextual information provided by people to help object localization and scene understanding. We collect a new dataset of time-lapse videos involving people interacting with indoor scenes. We develop an approach to describe image regions by the distribution of human co-located poses and use this pose-based representation to improve object localization. We further demonstrate that people cues can improve several steps of existing pipelines for indoor scene understanding. Finally, we extend the annotation of our time-lapse dataset to 3D and show how to infer object labels for occupied 3D volumes of a scene. To summarize, the contributions of this thesis are the following: (i) we design action classification models for still images that take advantage of the scene and object context and we gather a new dataset to evaluate their performance, (ii) we develop a new model to improve object localization thanks to observations of people interacting with an indoor scene and test it on a new dataset centered on person, object and scene interactions, (iii) we propose the first method to evaluate the volumes occupied by different object classes in a room that allow us to analyze the current 3D scene understanding pipeline and identify its main source of errors. Vision par ordinateur Classification d'action humaine Compréhension de scène Computer vision Human action classification Scene understanding Action object scene interactions 006.4

Search results