Return to search

Geometric context from single and multiple views

In order for computers to interact with and understand the visual world, they must be equipped with reasoning systems that include high–level quantities such as objects, actions, and scenes. This thesis is concerned with extracting such representations of the world from visual input. The first part of this thesis describes an approach to scene understanding in which texture characteristics of the visual world are used to infer scene categories. We show that in the context of a moving camera, it is common to observe images containing very few individually salient image regions, yet overall texture structure often allows our system to derive powerful contextual cues about the environment. Our approach builds on ideas from texture recognition, and we show that our algorithm out–performs the well–known Gist descriptor on several classification tasks. In the second part of this thesis we we are interested in scene understanding in the context of multiple calibrated views of a scene, as might be obtained from a Structure–from–Motion or Simultaneous Localization and Mapping (SLAM) system. Though such systems are capable of localizing the camera robustly and efficiently, the maps produced are typically sparse point-clouds that are difficult to interpret and of little use for higher–level reasoning tasks such as scene understanding or human-machine interaction. In this thesis we begin to address this deficiency, presenting progress towards modeling scenes using semantically meaningful primitives such as floor, wall, and ceiling planes. To this end we adopt the indoor Manhattan representation, which was recently proposed for single–view reconstruction. This thesis presents the first in–depth description and analysis of this model in the literature. We describe a probabilistic model relating photometric features, stereo photo–consistencies, and 3D point clouds to Manhattan scene structure in a Bayesian framework. We then present a fast dynamic programming algorithm that solves exact MAP inference in this model in time linear in image size. We show detailed comparisons with the state–of–the art in both the single– and multiple–view contexts. Finally, we present a framework for learning within the indoor Manhattan hypothesis class. Our system is capable of extrapolating from labelled training examples to predict scene structure for unseen images. We cast learning as a structured prediction problem and show how to optimize with respect to two realistic loss functions. We present experiments in which we learn to recover scene structure from both single and multiple views — from the perspective of our learning algorithm these problems differ only by a change of feature space. This work constitutes one of the most complicated output spaces (in terms of internal constraints) yet considered within a structure prediction framework.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:644877
Date January 2012
CreatorsFlint, Alexander John
ContributorsReid, Ian ; Murray, David
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://ora.ox.ac.uk/objects/uuid:f6c11e50-c059-4254-9dfc-5cbd2ee8147f

Page generated in 0.0018 seconds