Spelling suggestions: "subject:"crossview"" "subject:"karosseriewerk""
1 |
Cross-view learningZhang, Li January 2018 (has links)
Key to achieving more efficient machine intelligence is the capability to analysing and understanding data across different views - which can be camera views or modality views (such as visual and textual). One generic learning paradigm for automated understanding data from different views called cross-view learning which includes cross-view matching, cross-view fusion and cross-view generation. Specifically, this thesis investigates two of them, cross-view matching and cross-view generation, by developing new methods for addressing the following specific computer vision problems. The first problem is cross-view matching for person re-identification which a person is captured by multiple non-overlapping camera views, the objective is to match him/her across views among a large number of imposters. Typically a person's appearance is represented using features of thousands of dimensions, whilst only hundreds of training samples are available due to the difficulties in collecting matched training samples. With the number of training samples much smaller than the feature dimension, the existing methods thus face the classic small sample size (SSS) problem and have to resort to dimensionality reduction techniques and/or matrix regularisation, which lead to loss of discriminative power for cross-view matching. To that end, this thesis proposes to overcome the SSS problem in subspace learning by matching cross-view data in a discriminative null space of the training data. The second problem is cross-view matching for zero-shot learning where data are drawn from different modalities each for a different view (e.g. visual or textual), versus single-modal data considered in the first problem. This is inherently more challenging as the gap between different views becomes larger. Specifically, the zero-shot learning problem can be solved if the visual representation/view of the data (object) and its textual view are matched. Moreover, it requires learning a joint embedding space where different view data can be projected to for nearest neighbour search. This thesis argues that the key to make zero-shot learning models succeed is to choose the right embedding space. Different from most existing zero-shot learning models utilising a textual or an intermediate space as the embedding space for achieving crossview matching, the proposed method uniquely explores the visual space as the embedding space. This thesis finds that in the visual space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective. Moreover, a natural mechanism for multiple textual modalities optimised jointly in an end-to-end manner in this model demonstrates significant advantages over existing methods. The last problem is cross-view generation for image captioning which aims to automatically generate textual sentences from visual images. Most existing image captioning studies are limited to investigate variants of deep learning-based image encoders, improving the inputs for the subsequent deep sentence decoders. Existing methods have two limitations: (i) They are trained to maximise the likelihood of each ground-truth word given the previous ground-truth words and the image, termed Teacher-Forcing. This strategy may cause a mismatch between training and testing since at test-time the model uses the previously generated words from the model distribution to predict the next word. This exposure bias can result in error accumulation in sentence generation during test time, since the model has never been exposed to its own predictions. (ii) The training supervision metric, such as the widely used cross entropy loss, is different from the evaluation metrics at test time. In other words, the model is not directly optimised towards the task expectation. This learned model is therefore suboptimal. One main underlying reason responsible is that the evaluation metrics are non-differentiable and therefore much harder to be optimised against. This thesis overcomes the problems as above by exploring the reinforcement learning idea. Specifically, a novel actor-critic based learning approach is formulated to directly maximise the reward - the actual Natural Language Processing quality metrics of interest. As compared to existing reinforcement learning based captioning models, the new method has the unique advantage of a per-token advantage and value computation is enabled leading to better model training.
|
2 |
UAV geolocalization in Swedish fields and forests using Deep Learning / Geolokalisering av UAVs över svenska fält och skogar med hjälp av djupinlärningRohlén, Andreas January 2021 (has links)
The ability for unmanned autonomous aerial vehicles (UAV) to localize themselves in an environment is fundamental for them to be able to function, even if they do not have access to a global positioning system. Recently, with the success of deep learning in vision based tasks, there have been some proposed methods for absolute geolocalization using vison based deep learning with satellite and UAV images. Most of these are only tested in urban environments, which begs the question: How well do they work in non-urban areas like forests and fields? One drawback of deep learning is that models are often regarded as black boxes, as it is hard to know why the models make the predictions they do, i.e. what information is important and is used for the prediction. To solve this, several neural network interpretation methods have been developed. These methods provide explanations so that we may understand these models better. This thesis investigates the localization accuracy of one geolocalization method in both urban and non-urban environments as well as applies neural network interpretation in order to see if it can explain the potential difference in localization accuracy of the method in these different environments. The results show that the method performs best in urban environments, getting a mean absolute horizontal error of 38.30m and a mean absolute vertical error of 16.77m, while it performed significantly worse in non-urban environments, getting a mean absolute horizontal error of 68.11m and a mean absolute vertical error 22.83m. Further, the results show that if the satellite images and images from the unmanned aerial vehicle are collected during different seasons of the year, the localization accuracy is even worse, resulting in a mean absolute horizontal error of 86.91m and a mean absolute vertical error of 23.05m. The neural network interpretation did not aid in providing an explanation for why the method performs worse in non-urban environments and is not suitable for this kind of problem. / Obemannade autonoma luftburna fordons (UAV) förmåga att lokaliera sig själva är fundamental för att de ska fungera, även om de inte har tillgång till globala positioneringssystem. Med den nyliga framgången hos djupinlärning applicerat på visuella problem har det kommit metoder för absolut geolokalisering med visuell djupinlärning med satellit- och UAV-bilder. De flesta av dessa metoder har bara blivit testade i stadsmiljöer, vilket leder till frågan: Hur väl fungerar dessa metoder i icke-urbana områden som fält och skogar? En av nackdelarna med djupinlärning är att dessa modeller ofta ses som svarta lådor eftersom det är svårt att veta varför modellerna gör de gissningar de gör, alltså vilken information som är viktig och används för gissningen. För att lösa detta har flera metoder för att tolka neurala nätverk utvecklats. Dessa metoder ger förklaringar så att vi kan förstå dessa modeller bättre. Denna uppsats undersöker lokaliseringsprecisionen hos en geolokaliseringsmetod i både urbana och icke-urbana miljöer och applicerar även en tolkningsmetod för neurala nätverk för att se ifall den kan förklara den potentialla skillnaden i precision hos metoden i dessa olika miljöer. Resultaten visar att metoden fungerar bäst i urbana miljöer där den får ett genomsnittligt absolut horisontellt lokaliseringsfel på 38.30m och ett genomsnittligt absolut vertikalt fel på 16.77m medan den presterade signifikant sämre i icke-urbana miljöer där den fick ett genomsnittligt absolut horisontellt lokaliseringsfel på 68.11m och ett genomsnittligt absolut vertikalt fel på 22.83m. Vidare visar resultaten att om satellitbilderna och UAV-bilderna är tagna från olika årstider blir lokaliseringsprecisionen ännu sämre, där metoden får genomsnittligt absolut horisontellt lokaliseringsfel på 86.91m och ett genomsnittligt absolut vertikalt fel på 23.05m. Tolkningsmetoden hjälpte inte i att förklara varför metoden fungerar sämre i icke-urbana miljöer och är inte passande att använda för denna sortens problem.
|
Page generated in 0.0446 seconds