Return to search

Tell me what to track: visual object tracking and retrieval by natural language descriptions

Natural Language (NL) descriptions can be one of the most convenient ways to initialize a visual tracker. NL descriptions can also help provide information for longer-term invariance, thus helping the tracker cope better with typical visual tracking challenges, e.g. occlusion, motion blur, etc. However, deriving a formulation to combine the strengths of appearance-based tracking with the NL modality is not straightforward. In this thesis, we use deep neural networks to learn a joint representation of language and vision that can perform various tasks, such as visual tracking by NL, tracked-object retrieval by NL, and spatio-temporal video groundings by NL.

First, we study the Single Object Tracking (SOT) by NL descriptions task, which requires spatial localizations of a target object in a video sequence. We propose two novel approaches. The first is a tracking-by-detection approach, which performs object detection in the video sequence via similarity matching between potential objects' pooled visual representations and NL descriptions. The second approach uses a novel Siamese Natural Language Region Proposal Network with a depth-wise cross correlation operation to replace the visual template with a language template in Siamese trackers, e.g. SiamFC, SiamRPN++, etc., and achieves state-of-the-art on standard single object tracking by NL benchmarks.

Second, based on experimental results and findings from the SOT by NL task, we propose the Tracked-object Retrieval by NL (TRNL) descriptions task and collect the CityFlow-NL Benchmark for it. CityFlow-NL contains more than 6,500precise NL descriptions of tracked vehicle targets, making it the first densely annotated dataset of tracked-objects paired with NL descriptions. To highlight the novelty of our dataset, we propose two models for the retrieval by NL task: a single-stream model based on cross-modality similarity matching and a quad-stream retrieval model that models the similarity between language features and visual features, including local visual features, frame-level features, motions, and relationships between visually similar targets. We release the CityFlow-NL Benchmark together with our models as challenges in the 5th and the 6th AI City Challenge.

Lastly, we focus on the most challenging yet practical task of Spatio-Temporal Video Grounding (STVG), which aims to spatially and temporally localize a target in videos with NL descriptions. We propose new evaluation protocols for the STVG task to adapt to the new challenges of CityFlow-NL that are not well-represented in prior STVG benchmarks. Three intuitive and novel approaches to the STVG task are proposed and studied in this thesis, i.e. Multi-Object Tracking (MOT) + Retrieval by NL approach, Single Object Tracking (SOT) by NL based approach, and a direct localization approach that uses a transformer network to learn a joint representation from both the NL and vision modalities.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/45236
Date05 October 2022
CreatorsFeng, Qi
ContributorsSclaroff, Stanley
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation
RightsAttribution 4.0 International, http://creativecommons.org/licenses/by/4.0/

Page generated in 0.0023 seconds