With the increasing demand of intelligent video surveillance systems, person re-identification (re-ID) plays an important role in intelligent video analysis, which aims at matching person images across non-overlapping camera views. It has gained increasing attention in computer vision community. With the advanced deep neural networks, existing methods have achieved promising performance on the widely-used re-ID benchmarks, even outperform the human-level rank-1 matching accuracy. However, most of the research efforts are conducted on the closed-world settings, with large-scale well annotated training data and all the person images are from the same visible modality. As a prerequisite in practical video surveillance application, there is still a large gap between the closed-world research-oriented setting and the practical open-world settings. In this thesis, we try to narrow the gap by studying three important issues in open-world person re-identification, including 1) unsupervised learning with large-scale unlabelled training data; 2) learning robust re-ID model with label corrupted training data and 3) cross-modality visible-thermal person re-identification with multi-modality data. For unsupervised learning with unlabelled training data, we mainly focus on video-based person re-identification, since the video data is usually easily obtained by tracking algorithms and the video sequence provides rich weakly labelled samples by assuming the image frames within the tracked sequence belonging to the same person identity. Following the cross-camera label estimation approach, we formulate the cross-camera label estimation as a one-to-one graph matching problem, and then propose a novel dynamic graph matching framework to estimate cross-camera labels. However, in a practical wild scenario, the unlabelled training data usually cannot satisfy the one-to-one matching constraint, which would result in a large proportion of false positives. To address this issue, we further propose a novel robust anchor embedding method for unsupervised video re-ID. In the proposed method, some anchor sequences are firstly selected to initialize the CNN feature representation. Then a robust anchor embedding method is proposed to measure the relationship between the unlabelled sequences and anchor sequences, which considers both the scalability and efficiency. After that, a top-{dollar}k{dollar} counts label prediction strategy is proposed to predict the labels of unlabelled sequences. With the newly estimated sequences, the CNN representation could be further updated. For robust re-ID model learning with label corrupted training data, we propose a two-stage learning method to handle the label noise. Rather than simply filtering the falsely annotated samples, we propose a joint learning method by simultaneously refining the falsely annotated labels and optimizing the neural networks. To address the limited training samples for each identity, we further propose a novel hard-aware instance re-weighting strategy to fine-tune the learned model, which assigns larger weights to hard samples with correct labels. For cross-modality visible-thermal person re-identification, it addresses an important issue in night-time surveillance applications by matching person images across different modalities. We propose a dual-path network to learn the cross-modality feature representations, which learns the multi-modality sharable feature representations by simultaneously considering the modality discrepancy and commonness. To guide the feature representation learning process, we propose a dual-constrained top-ranking loss, which contains both cross-modality and intra-modality top-ranking constraints to reduce the large cross-modality and intra-modality variations. Besides the open-world person re-identification, we have also studied the unsupervised embedding learning problem for general image classification and retrieval. Motivated by supervised embedding learning, we propose a data augmentation invariant and instance spread-out feature. To learn the feature embedding, we propose a instance feature-based softmax embedding, which optimizes the embedding directly on top of the real-time instance features. It achieves much faster learning speed and better accuracy than existing methods. In short, the major contributions of this thesis are summarized as follows. l A dynamic graph matching framework is proposed to estimate cross-camera labels for unsupervised video-based person re-identification. l A robust anchor embedding method with top-{dollar}k{dollar} counts label prediction is proposed to efficiently estimate the cross-camera labels for unsupervised video-based person re-identification under wild settings. l A two-stage PurifyNet is introduced to handle the label noise problem in person re-identification, which jointly refines the falsely annotated labels and mines hard samples with correct labels. l A dual-constrained top-ranking loss with a dual-path network is proposed for cross-modality visible-thermal person re-identification, which simultaneously addresses the cross-modality and intra-modality variations. l A data augmentation invariant and instance spread-out feature is proposed for unsupervised embedding learning, which directly optimizes the learned embedding on top of real-time instance features with softmax function
Identifer | oai:union.ndltd.org:hkbu.edu.hk/oai:repository.hkbu.edu.hk:etd_oa-1691 |
Date | 30 August 2019 |
Creators | Ye, Mang |
Publisher | HKBU Institutional Repository |
Source Sets | Hong Kong Baptist University |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Open Access Theses and Dissertations |
Page generated in 0.0018 seconds