Return to search

Multi-Template Temporal Siamese Network for Visual Object Tracking

Visual object tracking is the task of giving a unique ID to an object in a video frame, understanding whether it is present or not in a current frame and if it is present, precisely localizing its position. There are numerous challenges in object tracking, such as change of illumination, partial or full occlusion, change of target appearance, blurring caused by camera movement, presence of similar objects to the target, changes in video image quality through time, etc. Due to these challenges, traditional computer vision techniques cannot perform high-quality tracking, especially for long-term tracking. Almost all the state-of-the-art methods in object tracking use artificial intelligence nowadays, and more specifically, Convolutional Neural Networks. In this work, we present a Siamese based tracker which is different from previous works in two ways. Firstly, most of the Siamese based trackers takes the target in the first frame as the ground truth. Despite the success of such methods in previous years, it does not guarantee robust tracking as it cannot handle many of the challenges causing change in target appearance, such as blurring caused by camera movement, occlusion, pose variation, etc. In this work, while keeping the first frame as a template, we add five other additional templates that are dynamically updated and replaced considering target classification score in different frames. Diversity, similarity and recency are criteria to choose the members of the bag. We call it as a bag of dynamic templates. Secondly, many Siamese based trackers are vulnerable to mistakenly tracking another similar looking object instead of the intended target. Many researchers proposed computationally expensive approaches, such as tracking all the distractors and the given target and discriminate them in every frame. In this work, we propose an approach to handle this issue by estimate the next frame position by using the target's bounding box coordinates in previous frames. We use temporal network with past history of several previous frames, measure classification scores of candidates considering templates in the bag of dynamic templates and use tracker sequential confidence value which shows how confident the tracker has been in previous frames. We call it as robustifier that prevents the tracker from continuously switching between the target and possible distractors with this hypothesis in mind. Extensive experiments on OTB 50, OTB 100 and UAV20L datasets demonstrate the superiority of our work over the state-of-the-art methods.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/44461
Date04 January 2023
CreatorsSekhavati, Ali
ContributorsLee, Wonsook
PublisherUniversité d'Ottawa / University of Ottawa
Source SetsUniversité d’Ottawa
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Formatapplication/pdf
RightsAttribution 4.0 International, http://creativecommons.org/licenses/by/4.0/

Page generated in 0.0023 seconds