Global ETD Search

Return to search

Learning Language-vision Correspondences

Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. We also exploit co-occurrences among appearance models to learn hierarchical appearance models. Results of applying our method to three data sets in a variety of conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.

http://hdl.handle.net/1807/26192

image annotation

object recognition

0984

Identifer	oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OTU.1807/26192
Date	15 February 2011
Creators	Jamieson, Michael
Contributors	Dickinson, Sven, Stevenson, Suzanne
Source Sets	Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada
Language	en_ca
Detected Language	English
Type	Thesis

Page generated in 0.0015 seconds

Learning Language-vision Correspondences

Description

Links & Downloads

Tags

Additional Fields