The goal of this thesis has been global geolocalization using only visual input and a 3D database for reference. In recent years Convolutional Neural Networks (CNNs) have seen huge success in the task of classifying images. The flattened tensors at the final layers of a CNN can be viewed as vectors describing different input image features. Two networks were trained so that satellite and aerial images taken from different views of the same location had feature vectors that were similar. The networks were also trained so that images taken from different locations had different feature vectors. After training, the position of a given aerial image can then be estimated by finding the satellite image with a feature vector that is the most similar to that of the aerial image. A previous method called Where-CNN was used as a baseline model. Batch-Hard triplet loss, the Adam optimizer, and a different CNN backbone were tested as possible augmentations to this method. The models were trained on 2640 different locations in Linköping and Norrköping. The models were then tested on a sequence of 4411 query images along a path in Jönköping. The search region had 1449 different locations constituting a total area of 24km2. In Top-1% accuracy, there was a significant improvement over the baseline, increasing from 61.62% accuracy to 88.62%. The environment was modeled as a Hidden Markov Model to filter the sequence of guesses. The Viterbi algorithm was then used to find the most probable path. This filtering procedure reduced the average error along the path from 2328.0 m to just 264.4 m for the best model. Here the baseline had an average error of 563.0 m after filtering. A few different 3D methods were also tested. One drawback was that no pretrained weights existed for these models, as opposed to the 2D models, which were pretrained on the ImageNet dataset. The best 3D model achieved a Top-1% accuracy of 70.41%. It should be noted that the best 2D model without using any pretraining achieved a lower Top-1% accuracy of 49.38%. In addition, a 3D method for efficiently doing convolution on sparse 3D data was presented. Compared to the straight-forward method, it was almost 2.5 times faster while still having comparable accuracy at individual query prediction. While there was a significant improvement over the baseline, it was not significant enough to provide reliable and accurate localization for individual images. For global navigation, using the entire Earth as search space, the information in a 2D image might not be enough to be uniquely identifiable. However, the 3D CNN techniques tested did not improve the results of the pretrained 2D models. The use of more data and experimentation with different 3D CNN architectures is a direction in which further research would be exciting.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-185719 |
Date | January 2022 |
Creators | Karlsson, Justus |
Publisher | Linköpings universitet, Institutionen för systemteknik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds