Global ETD Search

1	A theory of scene understanding and object recognition. Dillon, Craig January 1996 (has links) This dissertation presents a new approach to image interpretation which can produce hierarchical descriptions of visually sensed scenes based on an incrementally learnt hierarchical knowledge base. Multiple segmentation and labelling hypotheses are generated with local constraint satisfaction being achieved through a hierarchical form of relaxation labelling. The traditionally unidirectional segmentation-matching process is recast into a dynamic closed-loop system where the current interpretation state is used to drive the lower level image processing functions. The theory presented in this dissertation is applied to a new object recognition and scene understanding system called Cite which is described in detail.
2	Coordination of vision and language in cross-modal referential processing Coco, Moreno Ignazio January 2011 (has links) This thesis investigates the mechanisms underlying the formation, maintenance, and sharing of reference in tasks in which language and vision interact. Previous research in psycholinguistics and visual cognition has provided insights into the formation of reference in cross-modal tasks. The conclusions reached are largely independent, with the focus on mechanisms pertaining to either linguistic or visual processing. In this thesis, we present a series of eye-tracking experiments that aim to unify these distinct strands of research by identifying and quantifying factors that underlie the cross-modal interaction between scene understanding and sentence processing. Our results show that both low-level (imagebased) and high-level (object-based) visual information interacts actively with linguistic information during situated language processing tasks. In particular, during language understanding (Chapter 3), image-based information, i.e., saliency, is used to predict the upcoming arguments of the sentence, when the linguistic material alone is not sufficient to make such predictions. During language production (Chapter 4), visual attention has the active role of sourcing referential information for sentence encoding. We show that two important factors influencing this process are the visual density of the scene, i.e., clutter, and the animacy of the objects described. Both factors influence the type of linguistic encoding observed and the associated visual responses. We uncover a close relationship between linguistic descriptions and visual responses, triggered by the cross-modal interaction of scene and object properties, which implies a general mechanism of cross-modal referential coordination. Further investigation (Chapter 5) shows that visual attention and sentence processing are closely coordinated during sentence production: similar sentences are associated with similar scan patterns. This finding holds across different scenes, which suggests that coordination goes beyond the well-known scene-based effects guiding visual attention, again supporting the existence of a general mechanism for the cross-modal coordination of referential information. The extent to which cross-modal mechanisms are activated depends on the nature of the task performed. We compare the three tasks of visual search, object naming, and scene description (Chapter 6) and explore how the modulation of cross-modal reference is reflected in the visual responses of participants. Our results show that the cross-modal coordination required in naming and description triggers longer visual processing and higher scan pattern similarity than in search. This difference is due to the coordination required to integrate and organize visual and linguistic referential processing. Overall, this thesis unifies explanations of distinct cognitive processes (visual and linguistic) based on the principle of cross-modal referentiality, and provides a new framework for unraveling the mechanisms that allow scene understanding and sentence processing to share and integrate information during cross-modal processing. 410
3	On Fundamental Elements of Visual Navigation Systems Siddiqui, Rafid January 2014 (has links) Visual navigation is a ubiquitous yet complex task which is performed by many species for the purpose of survival. Although visual navigation is actively being studied within the robotics community, the determination of elemental constituents of a robust visual navigation system remains a challenge. Motion estimation is mistakenly considered as the sole ingredient to make a robust autonomous visual navigation system and therefore efforts are made to improve the accuracy of motion estimations. On the contrary, there are other factors which are as important as motion and whose absence could result in inability to perform seamless visual navigation such as the one exhibited by humans. Therefore, it is needed that a general model for a visual navigation system be devised which would describe it in terms of a set of elemental units. In this regard, a set of visual navigation elements (i.e. spatial memory, motion memory, scene geometry, context and scene semantics) are suggested as building blocks of a visual navigation system in this thesis. A set of methods are proposed which investigate the existence and role of visual navigation elements in a visual navigation system. A quantitative research methodology in the form of a series of systematic experiments is conducted on these methods. The thesis formulates, implements and analyzes the proposed methods in the context of visual navigation elements which are arranged into three major groupings; a) Spatial memory b) Motion Memory c) Manhattan, context and scene semantics. The investigations are carried out on multiple image datasets obtained by robot mounted cameras (2D/3D) moving in different environments. Spatial memory is investigated by evaluation of proposed place recognition methods. The recognized places and inter-place associations are then used to represent a visited set of places in the form of a topological map. Such a representation of places and their spatial associations models the concept of spatial memory. It resembles the humans’ ability of place representation and mapping for large environments (e.g. cities). Motion memory in a visual navigation system is analyzed by a thorough investigation of various motion estimation methods. This leads to proposals of direct motion estimation methods which compute accurate motion estimates by basing the estimation process on dominant surfaces. In everyday world, planar surfaces, especially the ground planes, are ubiquitous. Therefore, motion models are built upon this constraint. Manhattan structure provides geometrical cues which are helpful in solving navigation problems. There are some unique geometric primitives (e.g. planes) which make up an indoor environment. Therefore, a plane detection method is proposed as a result of investigations performed on scene structure. The method uses supervised learning to successfully classify the segmented clusters in 3D point-cloud datasets. In addition to geometry, the context of a scene also plays an important role in robustness of a visual navigation system. The context in which navigation is being performed imposes a set of constraints on objects and sections of the scene. The enforcement of such constraints enables the observer to robustly segment the scene and to classify various objects in the scene. A contextually aware scene segmentation method is proposed which classifies the image of a scene into a set of geometric classes. The geometric classes are sufficient for most of the navigation tasks. However, in order to facilitate the cognitive visual decision making process, the scene ought to be semantically segmented. The semantic of indoor scenes as well as semantic of the outdoor scenes are dealt with separately and separate methods are proposed for visual mapping of environments belonging to each type. An indoor scene consists of a corridor structure which is modeled as a cubic space in order to build a map of the environment. A “flash-n-extend” strategy is proposed which is responsible for controlling the map update frequency. The semantics of the outdoor scenes is also investigated and a scene classification method is proposed. The method employs a Markov Random Field (MRF) based classification framework which generates a set of semantic maps. robot navigation localization visual mapping scene understanding semantic mapping
4	Reasoning scene geometry from single images Liu, Yixian January 2014 (has links) Holistic scene understanding is one of the major goals in recent research of computer vision. Most popular recognition algorithms focus on semantic understanding and are incapable of providing the global depth information of the scene structure from the 2D projection of the world. Yet it is obvious that recovery of scene surface layout could be used to help many practical 3D-based applications, including 2D-to-3D movie re-production, robotic navigation, view synthesis, etc. Therefore, we identify scene geometric reasoning as the key problem of scene understanding. This PhD work makes a contribution to the reconstruction problem of 3D shape of scenes from monocular images. We propose an approach to recognise and reconstruct the geometric structure of the scene from a single image. We have investigated several typical scene geometries and built a few corresponding reference models in a hierarchical order for scene representation. The framework is set up based on the analysis of image statistical features and scene geometric features. Correlation is introduced to theoretically integrate these two types of features. Firstly, an image is categorized into one of the reference geometric models using the spatial pattern classi cation. Then, we estimate the depth pro le of the speci c scene by proposing an algorithm for adaptive automatic scene reconstruction. This algorithm employs speci cally developed reconstruction approaches for di erent geometric models. The theory and algorithms are instantiated in a system for the scene classi cation and visualization. The system is able to fi nd the best fi t model for most of the images from several benchmark datasets. Our experiments show that un-calibrated low-quality monocular images could be e fficiently and realistically reconstructed in simulated 3D space. By our approach, computers could interpret a single still image as its underlying geometry straightforwardly, avoiding usual object occlusion, semantic overlapping and defi ciency problems. 006.3
5	Scene Understanding for Mobile Robots exploiting Deep Learning Techniques Rangel, José Carlos 05 September 2017 (has links) Every day robots are becoming more common in the society. Consequently, they must have certain basic skills in order to interact with humans and the environment. One of these skills is the capacity to understand the places where they are able to move. Computer vision is one of the ways commonly used for achieving this purpose. Current technologies in this field offer outstanding solutions applied to improve data quality every day, therefore producing more accurate results in the analysis of an environment. With this in mind, the main goal of this research is to develop and validate an efficient object-based scene understanding method that will be able to help solve problems related to scene identification for mobile robotics. We seek to analyze state-of-the-art methods for finding the most suitable one for our goals, as well as to select the kind of data most convenient for dealing with this issue. Another primary goal of the research is to determine the most suitable data input for analyzing scenes in order to find an accurate representation for the scenes by meaning of semantic labels or point cloud features descriptors. As a secondary goal we will show the benefits of using semantic descriptors generated with pre-trained models for mapping and scene classification problems, as well as the use of deep learning models in conjunction with 3D features description procedures to build a 3D object classification model that is directly related with the representation goal of this work. The research described in this thesis was motivated by the need for a robust system capable of understanding the locations where a robot usually interacts. In the same way, the advent of better computational resources has allowed to implement some already defined techniques that demand high computational capacity and that offer a possible solution for dealing with scene understanding issues. One of these techniques are Convolutional Neural Networks (CNNs). These networks have the capacity of classifying an image based on their visual appearance. Then, they generate a list of lexical labels and the probability for each label, representing the likelihood of the present of an object in the scene. Labels are derived from the training sets that the networks learned to recognize. Therefore, we could use this list of labels and probabilities as an efficient representation of the environment and then assign a semantic category to the regions where a mobile robot is able to navigate, and at the same time construct a semantic or topological map based on this semantic representation of the place. After analyzing the state-of-the-art in Scene Understanding, we identified a set of approaches in order to develop a robust scene understanding procedure. Among these approaches we identified an almost unexplored gap in the topic of understanding scenes based on objects present in them. Consequently, we propose to perform an experimental study in this approach aimed at finding a way of fully describing a scene considering the objects lying in place. As the Scene Understanding task involves object detection and annotation, one of the first steps is to determine the kind of data to use as input data in our proposal. With this in mind, our proposal considers to evaluate the use of 3D data. This kind of data suffers from the presence of noise, therefore, we propose to use the Growing Neural Gas (GNG) algorithm to reduce noise effect in the object recognition procedure. GNGs have the capacity to grow and adapt their topology to represent 2D information, producing a smaller representation with a slight noise influence from the input data. Applied to 3D data, the GNG presents a good approach able to tackle with noise. However, using 3D data poses a set of problems such as the lack of a 3D object dataset with enough models to generalize methods and adapt them to real situations, as well as the fact that processing three-dimensional data is computationally expensive and requires a huge storage space. These problems led us to explore new approaches for developing object recognition tasks. Therefore, considering the outstanding results obtained by the CNNs in the latest ImageNet challenge, we propose to carry out an evaluation of the former as an object detection system. These networks were initially proposed in the 90s and are nowadays easily implementable due to hardware improvements in the recent years. CNNs have shown satisfying results when they tested in problems such as: detection of objects, pedestrians, traffic signals, sound waves classification, and for medical image processing, among others. Moreover, an aggregate value of CNNs is the semantic description capabilities produced by the categories/labels that the network is able to identify and that could be translated as a semantic explanation of the input image. Consequently, we propose using the evaluation of these semantic labels as a scene descriptor for building a supervised scene classification model. Having said that, we also propose using semantic descriptors to generate topological maps and test the description capabilities of lexical labels. In addition, semantic descriptors could be suitable for unsupervised places or environment labeling, so we propose using them to deal with this kind of problem in order to achieve a robust scene labeling method. Finally, for tackling the object recognition problem we propose to develop an experimental study for unsupervised object labeling. This will be applied to the objects present in a point cloud and labeled using a lexical labeling tool. Then, objects will be used as the training instances of a classifier mixing their 3D features with label assigned by the external tool. Scene Understanding Deep Learning Robotics
6	Detecting Behavioral Zones in Local and Global Camera Views Nedrich, Matthew 22 July 2011 (has links) No description available. Computer Science scene modeling computer vision scene understanding
7	3D Object Detection from Images Simonelli, Andrea 28 September 2022 (has links) Remarkable advancements in the field of Computer Vision, Artificial Intelligence and Machine Learning have led to unprecedented breakthroughs in what machines are able to achieve. In many tasks such as in Image Classification in fact, they are now capable of even surpassing human performance. While this is truly outstanding, there are still many tasks in which machines lag far behind. Walking in a room, driving on an highway, grabbing some food for example. These are all actions that feel natural to us but can be quite unfeasible for them. Such actions require to identify and localize objects in the environment, effectively building a robust understanding of the scene. Humans easily gain this understanding thanks to their binocular vision, which provides an high-resolution and continuous stream of information to our brain that efficiently processes it. Unfortunately, things are much different for machines. With cameras instead of eyes and artificial neural networks instead of a brain, gaining this understanding is still an open problem. In this thesis we will not focus on solving this problem as a whole, but instead delve into a very relevant part of it. We will in fact analyze how to make ma- chines be able to identify and precisely localize objects in the 3D space by relying only on visual input i.e. 3D Object Detection from Images. One of the most complex aspects of Image-based 3D Object Detection is that it inherently requires the solution of many different sub-tasks e.g. the estimation of the object’s distance and its rotation. A first contribution of this thesis is an analysis of how these sub-tasks are usually learned, highlighting a destructivebehavior which limits the overall performance and the proposal of an alternative learning method that avoids it. A second contribution is the discovery of a flaw in the computation of the metric which is widely used in the field, affecting the re-computation of the performance of all published methods and the introduction of a novel un-flawed metric which has now become the official one. A third contribution is focused on one particular sub-task, i.e. estimation of the object’s distance, which is demonstrated to be the most challenging. Thanks to the introduction of a novel approach which normalizes the appearance of objects with respect to their distance, detection performances can be greatly improved. A last contribution of the thesis is the critical analysis of the recently proposed Pseudo-LiDAR methods. Two flaws in their training protocol have been identified and analyzed. On top of this, a novel method able to achieve state-of-the-art in Image-based 3D Object Detection has been developed.
8	DEEP LEARNING MODELS FOR IMAGE-BASED DISEASE CLASSIFICATION AND ASSISTIVE TECHNOLOGY RELATED TO ALZHEIMER’S DISEASE Ke Xu (7023074) 16 August 2019 (has links) <p>Alzheimer’s disease (AD), is a devastating neurodegenerative disorder that destroys the patient’s ability to perform daily living task and eventually, takes their lives. Currently, there are 5.8 million people in North America that suffer from AD. This number is projected to by 13.8 million by the year of 2050. For many years, researchers have been dedicated on performing automated diagnosis based on neuroimaging. There are critical needs in two aspects of AD: 1) computer-based AD classification with MRI images; 2) computer-based tools/system to enhance the AD patient’s quality of life. We are addressing these two gaps via two specific objectives in this study.</p> <p>For objective 1, the task is to develop a machine-learning based intelligent model for classification of AD conditions (Normal Control [NC], Mild Cognitive Impairment [MCI], Alzheimer’s disease [AD]) based on MRI images. Specifically, four different deep learning models were developed and assessed. The overall average accuracy for AD classification is 81.5%, provided by Multi-Layer-Output model.</p> <p>For objective 2, a deep learning model was developed and evaluated to recognitze three specific type of indoor scenes (bedroom, living room and dining room). An accuracy of 97% was obtained.</p> <p>This study showed the potential of application in deep learning models for two different aspects of AD - disease classification and intelligent model-based assistive device for AD patients. Further research and development activities are recommended to further validate these findings on larger and different datasets.</p> deep learning Alzheimer's disease Scene understanding
9	Learning Statistical Features of Scene Images Lee, Wooyoung 01 September 2014 (has links) Scene perception is a fundamental aspect of vision. Humans are capable of analyzing behaviorally-relevant scene properties such as spatial layouts or scene categories very quickly, even from low resolution versions of scenes. Although humans perform these tasks effortlessly, they are very challenging for machines. Developing methods that well capture the properties of the representation used by the visual system will be useful for building computational models that are more consistent with perception. While it is common to use hand-engineered features that extract information from predefined dimensions, they require careful tuning of parameters and do not generalize well to other tasks or larger datasets. This thesis is driven by the hypothesis that the perceptual representations are adapted to the statistical properties of natural visual scenes. For developing statistical features for global-scale structures (low spatial frequency information that encompasses entire scenes), I propose to train hierarchical probabilistic models on whole scene images. I first investigate statistical clusters of scene images by training a mixture model under the assumption that each image can be decoded by sparse and independent coefficients. Each cluster discovered by the unsupervised classifier is consistent with the high-level semantic categories (such as indoor, outdoor-natural and outdoor-manmade) as well as perceptual layout properties (mean depth, openness and perspective). To address the limitation of mixture models in their assumptions of a discrete number of underlying clusters, I further investigate a continuous representation for the distributions of whole scenes. The model parameters optimized for natural visual scenes reveal a compact representation that encodes their global-scale structures. I develop a probabilistic similarity measure based on the model and demonstrate its consistency with the perceptual similarities. Lastly, to learn the representations that better encode the manifold structures in general high-dimensional image space, I develop the image normalization process to find a set of canonical images that anchors the probabilistic distributions around the real data manifolds. The canonical images are employed as the centers of the conditional multivariate Gaussian distributions. This approach allows to learn more detailed structures of the local manifolds resulting in improved representation of the high level properties of scene images. Visual scene understanding visual features probabilistic models generative models adaptive representation feature learning
10	Top-Down Bayesian Modeling and Inference for Indoor Scenes Del Pero, Luca January 2013 (has links) People can understand the content of an image without effort. We can easily identify the objects in it, and figure out where they are in the 3D world. Automating these abilities is critical for many applications, like robotics, autonomous driving and surveillance. Unfortunately, despite recent advancements, fully automated vision systems for image understanding do not exist. In this work, we present progress restricted to the domain of images of indoor scenes, such as bedrooms and kitchens. These environments typically have the "Manhattan" property that most surfaces are parallel to three principal ones. Further, the 3D geometry of a room and the objects within it can be approximated with simple geometric primitives, such as 3D blocks. Our goal is to reconstruct the 3D geometry of an indoor environment while also understanding its semantic meaning, by identifying the objects in the scene, such as beds and couches. We separately model the 3D geometry, the camera, and an image likelihood, to provide a generative statistical model for image data. Our representation captures the rich structure of an indoor scene, by explicitly modeling the contextual relationships among its elements, such as the typical size of objects and their arrangement in the room, and simple physical constraints, such as 3D objects do not intersect. This ensures that the predicted image interpretation will be globally coherent geometrically and semantically, which allows tackling the ambiguities caused by projecting a 3D scene onto an image, such as occlusions and foreshortening. We fit this model to images using MCMC sampling. Our inference method combines bottom-up evidence from the data and top-down knowledge from the 3D world, in order to explore the vast output space efficiently. Comprehensive evaluation confirms our intuition that global inference of the entire scene is more effective than estimating its individual elements independently. Further, our experiments show that our approach is competitive and often exceeds the results of state-of-the-art methods. Bayesian inference Computer Vision Indoor scenes Object recognition Scene understanding Computer Science 3D reconstruction

Search results