Global ETD Search

11	Bayesian Data Association for Temporal Scene Understanding Brau Avila, Ernesto January 2013 (has links) Understanding the content of a video sequence is not a particularly difficult problem for humans. We can easily identify objects, such as people, and track their position and pose within the 3D world. A computer system that could understand the world through videos would be extremely beneficial in applications such as surveillance, robotics, biology. Despite significant advances in areas like tracking and, more recently, 3D static scene understanding, such a vision system does not yet exist. In this work, I present progress on this problem, restricted to videos of objects that move in smoothly and which are relatively easily detected, such as people. Our goal is to identify all the moving objects in the scene and track their physical state (e.g., their 3D position or pose) in the world throughout the video. We develop a Bayesian generative model of a temporal scene, where we separately model data association, the 3D scene and imaging system, and the likelihood function. Under this model, the video data is the result of capturing the scene with the imaging system, and noisily detecting video features. This formulation is very general, and can be used to model a wide variety of scenarios, including videos of people walking, and time-lapse images of pollen tubes growing in vitro. Importantly, we model the scene in world coordinates and units, as opposed to pixels, allowing us to reason about the world in a natural way, e.g., explaining occlusion and perspective distortion. We use Gaussian processes to model motion, and propose that it is a general and effective way to characterize smooth, but otherwise arbitrary, trajectories. We perform inference using MCMC sampling, where we fit our model of the temporal scene to data extracted from the videos. We address the problem of variable dimensionality by estimating data association and integrating out all scene variables. Our experiments show our approach is competitive, producing results which are comparable to state-of-the-art methods. Data association MCMC Multiple target tracking Scene understanding Computer Science Bayesian inference
12	On Fundamental Elements of Visual Navigation Systems Siddiqui, Abujawad Rafid January 2014 (has links) Visual navigation is a ubiquitous yet complex task which is performed by many species for the purpose of survival. Although visual navigation is actively being studied within the robotics community, the determination of elemental constituents of a robust visual navigation system remains a challenge. Motion estimation is mistakenly considered as the sole ingredient to make a robust autonomous visual navigation system and therefore efforts are made to improve the accuracy of motion estimations. On the contrary, there are other factors which are as important as motion and whose absence could result in inability to perform seamless visual navigation such as the one exhibited by humans. Therefore, it is needed that a general model for a visual navigation system be devised which would describe it in terms of a set of elemental units. In this regard, a set of visual navigation elements (i.e. spatial memory, motion memory, scene geometry, context and scene semantics) are suggested as building blocks of a visual navigation system in this thesis. A set of methods are proposed which investigate the existence and role of visual navigation elements in a visual navigation system. A quantitative research methodology in the form of a series of systematic experiments is conducted on these methods. The thesis formulates, implements and analyzes the proposed methods in the context of visual navigation elements which are arranged into three major groupings; a) Spatial memory b) Motion Memory c) Manhattan, context and scene semantics. The investigations are carried out on multiple image datasets obtained by robot mounted cameras (2D/3D) moving in different environments. Spatial memory is investigated by evaluation of proposed place recognition methods. The recognized places and inter-place associations are then used to represent a visited set of places in the form of a topological map. Such a representation of places and their spatial associations models the concept of spatial memory. It resembles the humans’ ability of place representation and mapping for large environments (e.g. cities). Motion memory in a visual navigation system is analyzed by a thorough investigation of various motion estimation methods. This leads to proposals of direct motion estimation methods which compute accurate motion estimates by basing the estimation process on dominant surfaces. In everyday world, planar surfaces, especially the ground planes, are ubiquitous. Therefore, motion models are built upon this constraint. Manhattan structure provides geometrical cues which are helpful in solving navigation problems. There are some unique geometric primitives (e.g. planes) which make up an indoor environment. Therefore, a plane detection method is proposed as a result of investigations performed on scene structure. The method uses supervised learning to successfully classify the segmented clusters in 3D point-cloud datasets. In addition to geometry, the context of a scene also plays an important role in robustness of a visual navigation system. The context in which navigation is being performed imposes a set of constraints on objects and sections of the scene. The enforcement of such constraints enables the observer to robustly segment the scene and to classify various objects in the scene. A contextually aware scene segmentation method is proposed which classifies the image of a scene into a set of geometric classes. The geometric classes are sufficient for most of the navigation tasks. However, in order to facilitate the cognitive visual decision making process, the scene ought to be semantically segmented. The semantic of indoor scenes as well as semantic of the outdoor scenes are dealt with separately and separate methods are proposed for visual mapping of environments belonging to each type. An indoor scene consists of a corridor structure which is modeled as a cubic space in order to build a map of the environment. A “flash-n-extend” strategy is proposed which is responsible for controlling the map update frequency. The semantics of the outdoor scenes is also investigated and a scene classification method is proposed. The method employs a Markov Random Field (MRF) based classification framework which generates a set of semantic maps. robot navigation localization visual mapping scene understanding semantic mapping Computer Science Datavetenskap (datalogi)
13	Learning Structured and Deep Representations for Traffc Scene Understanding Yu, Zhiding 01 December 2017 (has links) Recent advances in representation learning have led to an increasing variety of vision-based approaches in traffic scene understanding. This includes general vision problems such as object detection, depth estimation, edge/boundary/contour detection, semantic segmentation and scene classification, as well as application-driven problems such as pedestrian detection, vehicle detection, lane marker detection and road segmentation, etc. In this thesis, we approach some of these problems by exploring structured and invariant representations from the visual input. Our research is mainly motivated by two facts: 1. Traffic scenes often contain highly structured layouts. Exploring structured priors is expected to help considerably in improving the scene understanding performance. 2. A major challenge of traffic scene understanding lies in the diverse and changing nature of the contents. It is therefore important to find robust visual representations that are invariant against such variability. We start from highway scenarios where we are interested in detecting the hard road borders and estimating the drivable space before such physical boundary. To this end, we treat the task as a joint detection and tracking problem, and formulate it with structured Hough voting (SVH): A conditional random field model that explores both intra-frame geometric and interframe temporal information to generate more accurate and stable predictions. Turning from highway scenes to urban scenes, we consider dense prediction problems such as category-aware semantic edge detection and semantic segmentation. Category-aware semantic edge detection is challenging as the model is required to jointly localize object contours and classify each edge pixel to one or multiple predefined classes. We propose CASENet, a multilabel deep network with state of the art edge detection performance. To address the label misalignment problem in edge learning, we also propose SEAL, a framework towards simultaneous edge alignment and learning. Failure across different domains has been a common bottleneck of semantic segmentation methods. In this thesis, we address the problem of adapting a segmentation model trained on a source domain to another different target domain without knowing the target domain labels, and propose a class-balanced self-training approach for such unsupervised domain adaptation. We adopt the \synthetic-to-real" setting where a model is pre-trained on GTA-5 and adapted to real world datasets such as Cityscapes and Nexar, as well as the \cross-city" setting where a model is pre-trained on Cityscapes, and adapted to unseen data from Rio, Tokyo, Rome and Taipei. Experiment shows the superior performance of our method compared to state of the art methods, such as adversarial training based domain adaptation. computer vision convolutional neural network deep learning scene understanding structured prediction
14	Depth Estimation Using Adaptive Bins via Global Attention at High Resolution Bhat, Shariq 21 April 2021 (has links) We address the problem of estimating a high quality dense depth map from a single RGB input image. We start out with a baseline encoder-decoder convolutional neural network architecture and pose the question of how the global processing of information can help improve overall depth estimation. To this end, we propose a transformer-based architecture block that divides the depth range into bins whose center value is estimated adaptively per image. The final depth values are estimated as linear combinations of the bin centers. We call our new building block AdaBins. Our results show a decisive improvement over the state-of-the-art on several popular depth datasets across all metrics. We also validate the effectiveness of the proposed block with an ablation study. Monocular Depth Estimation 3D reconstruction Transformers 3D scene understanding adaptive binning Convolutional Neural Networks
15	Spatio-temporal reasoning for semantic scene understanding and its application in recognition and prediction of manipulation actions in image sequences Ziaeetabar, Fatemeh 07 May 2019 (has links) No description available. 510 Manipulation actions Semantic scene understanding Spatial reasoning Robotics Informatik (PPN619939052)
16	A Multi-camera based Next Best View Approach for Semantic Scene Understanding Persson, Anton January 2023 (has links) Robots are becoming more common; robotics has gone from bleeding-edge technology to an everyday topic that families discuss around thedinner table.The number of robots in the industry is growing, which means thatthe demand and need for robots to understand the environment it isworking in is also growing.The standard method for a robot to gather information about a sceneinvolves moving to different pre-determined poses from which it canview and analyze the scene. However, this approach does not con-sider the topology of the scene that the robot should explore.This thesis aims to create a two-dimensional approach to determinethe next best view ( 2D-NBV) to view and explore the scene, intro-duced in the method section.The 2D-NBV method converts a point cloud of the scene to an ele-vation map. A segmenting network is used to get the positions ofpre-trained objects. The positions are then used to generate a2DGaussian kernel heatmap of the scene. Using the 2D elevation andGaussian map, the NBV pose is then calculated. The NBV pose isthen converted back to a 6D pose that the robot moves to capture anew point cloud and register it to the scene.The 2D-NBV method is compared to a baseline and a state-of-the-artmethod. The baseline method captures four different point cloudsfrom pre-determined positions and registers them together. The state-of-the-art methods find a point of interest and declare a set of viewcandidates on a sphere around the point. Ray casting is used to findthe pose with the highest information gain. This pose is set as theNBV for the robot to move to. The goal of this thesis is that themethod should perform better than the baseline method, describedfurther in the method section.The evaluation metric used in this thesis is how wellthe differentmethods could estimate the bounding boxes of pre-trained items us-ing an off-the-shelf semantic scene segmentation method. Six sceneswith varying difficulty were constructed to test the methods.The results showed that the 2D-NBV method successfully comple-mented the scene with information about its empty cells. The 2D-NBV outperforms the state-of-the-art on occluded scenes. The 2D-NBV performed overall just as well as the baseline. The reason thatthe 2D-NBV did not outperform the baseline is seen as a consequenceof the information loss going from 3D to 2D. Next Best View NBV Semantic Scene Understanding Robotics Robotics Robotteknik och automation Computer Systems Datorsystem
17	Incorporating spatial relationship information in signal-to-text processing Davis, Jeremy Elon 13 May 2022 (has links) (PDF) This dissertation outlines the development of a signal-to-text system that incorporates spatial relationship information to generate scene descriptions. Existing signal-to-text systems generate accurate descriptions in regards to information contained in an image. However, to date, no signalto- text system incorporates spatial relationship information. A survey of related work in the fields of object detection, signal-to-text, and spatial relationships in images is presented first. Three methodologies followed by evaluations were conducted in order to create the signal-to-text system: 1) generation of object localization results from a set of input images, 2) derivation of Level One Summaries from an input image, and 3) inference of Level Two Summaries from the derived Level One Summaries. Validation processes are described for the second and third evaluations, as the first evaluation has been previously validated in the related original works. The goal of this research is to show that a signal-to-text system that incorporates spatial information results in more informative descriptions of the content contained in an image. An additional goal of this research is to demonstrate the signal-to-text system can be easily applied to additional data sets, other than the sets used to train the system, and achieve similar results to the training sets. To achieve this goal, a validation study was conducted and is presented to the reader. Object detection Signal-to-text Scene understanding Spatial reasoning Natural language processing Artificial Intelligence and Robotics
18	Collaborative Unmanned Air and Ground Vehicle Perception for Scene Understanding, Planning and GPS-denied Localization Christie, Gordon A. 05 January 2017 (has links) Autonomous robot missions in unknown environments are challenging. In many cases, the systems involved are unable to use a priori information about the scene (e.g. road maps). This is especially true in disaster response scenarios, where existing maps are now out of date. Areas without GPS are another concern, especially when the involved systems are tasked with navigating a path planned by a remote base station. Scene understanding via robots' perception data (e.g. images) can greatly assist in overcoming these challenges. This dissertation makes three contributions that help overcome these challenges, where there is a focus on the application of autonomously searching for radiation sources with unmanned aerial vehicles (UAV) and unmanned ground vehicles (UGV) in unknown and unstructured environments. The three main contributions of this dissertation are: (1) An approach to overcome the challenges associated with simultaneously trying to understand 2D and 3D information about the environment. (2) Algorithms and experiments involving scene understanding for real-world autonomous search tasks. The experiments involve a UAV and a UGV searching for potentially hazardous sources of radiation is an unknown environment. (3) An approach to the registration of a UGV in areas without GPS using 2D image data and 3D data, where localization is performed in an overhead map generated from imagery captured in the air. / Ph. D. Scene Understanding Semantic Segmentation Unmanned Systems Drone aircraft UGV Path Planning
19	Semantic Segmentation of Urban Scene Images Using Recurrent Neural Networks Daliparthi, Venkata Satya Sai Ajay January 2020 (has links) Background: In Autonomous Driving Vehicles, the vehicle receives pixel-wise sensor data from RGB cameras, point-wise depth information from the cameras, and sensors data as input. The computer present inside the Autonomous Driving vehicle processes the input data and provides the desired output, such as steering angle, torque, and brake. To make an accurate decision by the vehicle, the computer inside the vehicle should be completely aware of its surroundings and understand each pixel in the driving scene. Semantic Segmentation is the task of assigning a class label (Such as Car, Road, Pedestrian, or Sky) to each pixel in the given image. So, a better performing Semantic Segmentation algorithm will contribute to the advancement of the Autonomous Driving field. Research Gap: Traditional methods, such as handcrafted features and feature extraction methods, were mainly used to solve Semantic Segmentation. Since the rise of deep learning, most of the works are using deep learning to dealing with Semantic Segmentation. The most commonly used neural network architecture to deal with Semantic Segmentation was the Convolutional Neural Network (CNN). Even though some works made use of Recurrent Neural Network (RNN), the effect of RNN in dealing with Semantic Segmentation was not yet thoroughly studied. Our study addresses this research gap. Idea: After going through the existing literature, we came up with the idea of “Using RNNs as an add-on module, to augment the skip-connections in Semantic Segmentation Networks through residual connections.” Objectives and Method: The main objective of our work is to improve the Semantic Segmentation network’s performance by using RNNs. The Experiment was chosen as a methodology to conduct our study. In our work, We proposed three novel architectures called UR-Net, UAR-Net, and DLR-Net by implementing our idea to the existing networks U-Net, Attention U-Net, and DeepLabV3+ respectively. Results and Findings: We empirically showed that our proposed architectures have shown improvement in efficiently segmenting the edges and boundaries. Through our study, we found that there is a trade-off between using RNNs and Inference time of the model. Suppose we use RNNs to improve the performance of Semantic Segmentation Networks. In that case, we need to trade off some extra seconds during the inference of the model. Conclusion: Our findings will not contribute to the Autonomous driving field, where we need better performance in real-time. But, our findings will contribute to the advancement of Bio-medical Image segmentation, where doctors can trade-off those extra seconds during inference for better performance. Image Segmentation Deep Learning Convolutional Neural Networks Recurrent Neural Networks Encoder-Decoder Models and Scene Understanding Computer Sciences Datavetenskap (datalogi)
20	Restoring the balance between stuff and things in scene understanding Caesar, Holger January 2018 (has links) Scene understanding is a central field in computer vision that attempts to detect objects in a scene and reason about their spatial, functional and semantic relations. While many works focus on things (objects with a well-defined shape), less attention has been given to stuff classes (amorphous background regions). However, stuff classes are important as they allow to explain many aspects of an image, including the scene type, thing classes likely to be present and physical attributes of all objects in the scene. The goal of this thesis is to restore the balance between stuff and things in scene understanding. In particular, we investigate how the recognition of stuff differs from things and develop methods that are suitable to deal with both. We use stuff to find things and annotate a large-scale dataset to study stuff and things in context. First, we present two methods for semantic segmentation of stuff and things. Most methods require manual class weighting to counter imbalanced class frequency distributions, particularly on datasets with stuff and thing classes. We develop a novel joint calibration technique that takes into account class imbalance, class competition and overlapping regions by calibrating for the pixel-level evaluation criterion. The second method shows how to unify the advantages of region-based approaches (accurately delineated object boundaries) and fully convolutional approaches (end-to-end training). Both are combined in a universal framework that is equally suitable to deal with stuff and things. Second, we propose to help weakly supervised object localization for classes where location annotations are not available, by transferring things and stuff knowledge from a source set with available annotations. This is particularly important if we want to scale scene understanding to real-world applications with thousands of classes, without having to exhaustively annotate millions of images. Finally, we present COCO-Stuff - the largest existing dataset with dense stuff and thing annotations. Existing datasets are much smaller and were made with expensive polygon-based annotation. We use a very efficient stuff annotation protocol to densely annotate 164K images. Using this new dataset, we provide a detailed analysis of the dataset and visualize how stuff and things co-occur spatially in an image. We revisit the question whether stuff or things are easier to detect and which is more important based on visual and linguistic analysis.

Search results