Spelling suggestions: "subject:"siamese betworks"" "subject:"siamese conetworks""
1 |
Multi-Template Temporal Siamese Network for Visual Object TrackingSekhavati, Ali 04 January 2023 (has links)
Visual object tracking is the task of giving a unique ID to an object in a video frame, understanding whether it is present or not in a current frame and if it is present, precisely localizing its position. There are numerous challenges in object tracking, such as change of illumination, partial or full occlusion, change of target appearance, blurring caused by camera movement, presence of similar objects to the target, changes in video image quality through time, etc. Due to these challenges, traditional computer vision techniques cannot perform high-quality tracking, especially for long-term tracking. Almost all the state-of-the-art methods in object tracking use artificial intelligence nowadays, and more specifically, Convolutional Neural Networks. In this work, we present a Siamese based tracker which is different from previous works in two ways. Firstly, most of the Siamese based trackers takes the target in the first frame as the ground truth. Despite the success of such methods in previous years, it does not guarantee robust tracking as it cannot handle many of the challenges causing change in target appearance, such as blurring caused by camera movement, occlusion, pose variation, etc. In this work, while keeping the first frame as a template, we add five other additional templates that are dynamically updated and replaced considering target classification score in different frames. Diversity, similarity and recency are criteria to choose the members of the bag. We call it as a bag of dynamic templates. Secondly, many Siamese based trackers are vulnerable to mistakenly tracking another similar looking object instead of the intended target. Many researchers proposed computationally expensive approaches, such as tracking all the distractors and the given target and discriminate them in every frame. In this work, we propose an approach to handle this issue by estimate the next frame position by using the target's bounding box coordinates in previous frames. We use temporal network with past history of several previous frames, measure classification scores of candidates considering templates in the bag of dynamic templates and use tracker sequential confidence value which shows how confident the tracker has been in previous frames. We call it as robustifier that prevents the tracker from continuously switching between the target and possible distractors with this hypothesis in mind. Extensive experiments on OTB 50, OTB 100 and UAV20L datasets demonstrate the superiority of our work over the state-of-the-art methods.
|
2 |
A Computational Approach to Relative Image AestheticsJanuary 2016 (has links)
abstract: Computational visual aesthetics has recently become an active research area. Existing state-of-art methods formulate this as a binary classification task where a given image is predicted to be beautiful or not. In many applications such as image retrieval and enhancement, it is more important to rank images based on their aesthetic quality instead of binary-categorizing them. Furthermore, in such applications, it may be possible that all images belong to the same category. Hence determining the aesthetic ranking of the images is more appropriate. To this end, a novel problem of ranking images with respect to their aesthetic quality is formulated in this work. A new data-set of image pairs with relative labels is constructed by carefully selecting images from the popular AVA data-set. Unlike in aesthetics classification, there is no single threshold which would determine the ranking order of the images across the entire data-set.
This problem is attempted using a deep neural network based approach that is trained on image pairs by incorporating principles from relative learning. Results show that such relative training procedure allows the network to rank the images with a higher accuracy than a state-of-art network trained on the same set of images using binary labels. Further analyzing the results show that training a model using the image pairs learnt better aesthetic features than training on same number of individual binary labelled images.
Additionally, an attempt is made at enhancing the performance of the system by incorporating saliency related information. Given an image, humans might fixate their vision on particular parts of the image, which they might be subconsciously intrigued to. I therefore tried to utilize the saliency information both stand-alone as well as in combination with the global and local aesthetic features by performing two separate sets of experiments. In both the cases, a standard saliency model is chosen and the generated saliency maps are convoluted with the images prior to passing them to the network, thus giving higher importance to the salient regions as compared to the remaining. Thus generated saliency-images are either used independently or along with the global and the local features to train the network. Empirical results show that the saliency related aesthetic features might already be learnt by the network as a sub-set of the global features from automatic feature extraction, thus proving the redundancy of the additional saliency module. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
3 |
Increasing speaker invariance in unsupervised speech learning by partitioning probabilistic models using linear siamese networks / Ökad talarinvarians i obevakad talinlärning genom partitionering av probabilistiska modeller med hjälp av linjära siamesiska nätverkFahlström Myrman, Arvid January 2017 (has links)
Unsupervised learning of speech is concerned with automatically finding patterns such as words or speech sounds, without supervision in the form of orthographical transcriptions or a priori knowledge of the language. However, a fundamental problem is that unsupervised speech learning methods tend to discover highly speaker-specific and context-dependent representations of speech. We propose a method for improving the quality of posteriorgrams generated from an unsupervised model through partitioning of the latent classes discovered by the model. We do this by training a sparse siamese model to find a linear transformation of input posteriorgrams, extracted from the unsupervised model, to lower-dimensional posteriorgrams. The siamese model makes use of same-category and different-category speech fragment pairs obtained through unsupervised term discovery. After training, the model is converted into an exact partitioning of the posteriorgrams. We evaluate the model on the minimal-pair ABX task in the context of the Zero Resource Speech Challenge. We are able to demonstrate that our method significantly reduces the dimensionality of standard Gaussian mixture model posteriorgrams, while also making them more speaker invariant. This suggests that the model may be viable as a general post-processing step to improve probabilistic acoustic features obtained by unsupervised learning. / Obevakad inlärning av tal innebär att automatiskt hitta mönster i tal, t ex ord eller talljud, utan bevakning i form av ortografiska transkriptioner eller tidigare kunskap om språket. Ett grundläggande problem är dock att obevakad talinlärning tenderar att hitta väldigt talar- och kontextspecifika representationer av tal. Vi föreslår en metod för att förbättra kvaliteten av posteriorgram genererade med en obevakad modell, genom att partitionera de latenta klasserna funna av modellen. Vi gör detta genom att träna en gles siamesisk modell för att hitta en linjär transformering av de givna posteriorgrammen, extraherade från den obevakade modellen, till lågdimensionella posteriorgram. Den siamesiska modellen använder sig av talfragmentpar funna med obevakad ordupptäckning, där varje par består av fragment som antingen tillhör samma eller olika klasser. Den färdigtränade modellen görs sedan om till en exakt partitionering av posteriorgrammen. Vi följer Zero Resource Speech Challenge, och evaluerar modellen med hjälp av minimala ordpar-ABX-uppgiften. Vi demonstrerar att vår metod avsevärt minskar posteriorgrammens dimensionalitet, samtidigt som posteriorgrammen blir mer talarinvarianta. Detta antyder att modellen kan vara användbar som ett generellt extra steg för att förbättra probabilistiska akustiska särdrag från obevakade modeller.
|
Page generated in 0.0615 seconds