• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Tell me what to track: visual object tracking and retrieval by natural language descriptions

Feng, Qi 05 October 2022 (has links)
Natural Language (NL) descriptions can be one of the most convenient ways to initialize a visual tracker. NL descriptions can also help provide information for longer-term invariance, thus helping the tracker cope better with typical visual tracking challenges, e.g. occlusion, motion blur, etc. However, deriving a formulation to combine the strengths of appearance-based tracking with the NL modality is not straightforward. In this thesis, we use deep neural networks to learn a joint representation of language and vision that can perform various tasks, such as visual tracking by NL, tracked-object retrieval by NL, and spatio-temporal video groundings by NL. First, we study the Single Object Tracking (SOT) by NL descriptions task, which requires spatial localizations of a target object in a video sequence. We propose two novel approaches. The first is a tracking-by-detection approach, which performs object detection in the video sequence via similarity matching between potential objects' pooled visual representations and NL descriptions. The second approach uses a novel Siamese Natural Language Region Proposal Network with a depth-wise cross correlation operation to replace the visual template with a language template in Siamese trackers, e.g. SiamFC, SiamRPN++, etc., and achieves state-of-the-art on standard single object tracking by NL benchmarks. Second, based on experimental results and findings from the SOT by NL task, we propose the Tracked-object Retrieval by NL (TRNL) descriptions task and collect the CityFlow-NL Benchmark for it. CityFlow-NL contains more than 6,500precise NL descriptions of tracked vehicle targets, making it the first densely annotated dataset of tracked-objects paired with NL descriptions. To highlight the novelty of our dataset, we propose two models for the retrieval by NL task: a single-stream model based on cross-modality similarity matching and a quad-stream retrieval model that models the similarity between language features and visual features, including local visual features, frame-level features, motions, and relationships between visually similar targets. We release the CityFlow-NL Benchmark together with our models as challenges in the 5th and the 6th AI City Challenge. Lastly, we focus on the most challenging yet practical task of Spatio-Temporal Video Grounding (STVG), which aims to spatially and temporally localize a target in videos with NL descriptions. We propose new evaluation protocols for the STVG task to adapt to the new challenges of CityFlow-NL that are not well-represented in prior STVG benchmarks. Three intuitive and novel approaches to the STVG task are proposed and studied in this thesis, i.e. Multi-Object Tracking (MOT) + Retrieval by NL approach, Single Object Tracking (SOT) by NL based approach, and a direct localization approach that uses a transformer network to learn a joint representation from both the NL and vision modalities.

Page generated in 0.0484 seconds