Spelling suggestions: "subject:"monocular"" "subject:"nonocular""
111 |
Evaluation of Monocular Visual SLAM Methods on UAV Imagery to Reconstruct 3D TerrainJohansson, Fredrik, Svensson, Samuel January 2021 (has links)
When reconstructing the Earth in 3D, the imagery can come from various mediums, including satellites, planes, and drones. One significant benefit of utilizing drones in combination with a Visual Simultaneous Localization and Mapping (V-SLAM) system is that specific areas of the world can be accurately mapped in real-time at a low cost. Drones can essentially be equipped with any camera sensor, but most commercially available drones use a monocular rolling shutter camera sensor. Therefore, on behalf of Maxar Technologies, multiple monocular V-SLAM systems were studied during this thesis, and ORB-SLAM3 and LDSO were determined to be evaluated further. In order to provide an accurate and reproducible result, the methods were benchmarked on the public datasets EuRoC MAV and TUM monoVO, which includes drone imagery and outdoor sequences, respectively. A third dataset was collected with a DJI Mavic 2 Enterprise Dual drone to evaluate how the methods would perform with a consumer-friendly drone. The datasets were used to evaluate the two V-SLAM systems regarding the generated 3D map (point cloud) and estimated camera trajectory. The results showed that ORB-SLAM3 is less impacted by the artifacts caused by a rolling shutter camera sensor than LDSO. However, ORB-SLAM3 generates a sparse point cloud where depth perception can be challenging since it abstracts the images using feature descriptors. In comparison, LDSO produces a semi-dense 3D map where each point includes the pixel intensity, which improves the depth perception. Furthermore, LDSO is more suitable for dark environments and low-texture surfaces. Depending on the use case, either method can be used as long as the required prerequisites are provided. In conclusion, monocular V-SLAM systems are highly dependent on the type of sensor being used. The differences in the accuracy and robustness of the systems using a global shutter and a rolling shutter are significant, as the geometric artifacts caused by a rolling shutter are devastating for a pure visual pipeline. / <p>Examensarbetet är utfört vid Institutionen för teknik och naturvetenskap (ITN) vid Tekniska fakulteten, Linköpings universitet</p>
|
112 |
Fusion of Stationary Monocular and Stereo Camera Technologies for Traffic Parameters EstimationAli, Syed Musharaf 07 March 2017 (has links)
Modern day intelligent transportation system (ITS) relies on reliable and accurate estimated traffic parameters. Travel speed, traffic flow, and traffic state classification are the main traffic parameters of interest. These parameters can be estimated through efficient vision-based algorithms and appropriate camera sensor technology.
With the advances in camera technologies and increasing computing power, use of monocular vision, stereo vision, and camera sensor fusion technologies have been an active research area in the field of ITS. In this thesis, we investigated stationary monocular and stereo camera technology for traffic parameters estimation. Stationary camera sensors provide large spatial-temporal information of the road section with relatively low installation costs.
Two novel scientific contributions for vehicle detection and recognition are proposed. The first one is the use stationary stereo camera technology, and the second contribution is the fusion of monocular and stereo camera technologies.
A vision-based ITS consists of several hardware and software components. The overall performance of such a system does not only depend on these single modules but also on their interaction. Therefore, a systematic approach considering all essential modules was chosen instead of focusing on one element of the complete system chain. This leads to detailed investigations of several core algorithms, e.g. background subtraction, histogram based fingerprints, and data fusion methods.
From experimental results on standard datasets, we concluded that proposed fusion-based approach, consisting of monocular and stereo camera technologies performs better than each particular technology for vehicle detection and vehicle recognition. Moreover, this research work has a potential to provide a low-cost vision-based solution for online traffic monitoring systems in urban and rural environments.
|
113 |
Estimation de profondeur à partir d'images monoculaires par apprentissage profond / Depth estimation from monocular images by deep learningMoukari, Michel 01 July 2019 (has links)
La vision par ordinateur est une branche de l'intelligence artificielle dont le but est de permettre à une machine d'analyser, de traiter et de comprendre le contenu d'images numériques. La compréhension de scène en particulier est un enjeu majeur en vision par ordinateur. Elle passe par une caractérisation à la fois sémantique et structurelle de l'image, permettant d'une part d'en décrire le contenu et, d'autre part, d'en comprendre la géométrie. Cependant tandis que l'espace réel est de nature tridimensionnelle, l'image qui le représente, elle, est bidimensionnelle. Une partie de l'information 3D est donc perdue lors du processus de formation de l'image et il est d'autant plus complexe de décrire la géométrie d'une scène à partir d'images 2D de celle-ci.Il existe plusieurs manières de retrouver l'information de profondeur perdue lors de la formation de l'image. Dans cette thèse nous nous intéressons à l’estimation d'une carte de profondeur étant donné une seule image de la scène. Dans ce cas, l'information de profondeur correspond, pour chaque pixel, à la distance entre la caméra et l'objet représenté en ce pixel. L'estimation automatique d'une carte de distances de la scène à partir d'une image est en effet une brique algorithmique critique dans de très nombreux domaines, en particulier celui des véhicules autonomes (détection d’obstacles, aide à la navigation).Bien que le problème de l'estimation de profondeur à partir d'une seule image soit un problème difficile et intrinsèquement mal posé, nous savons que l'Homme peut apprécier les distances avec un seul œil. Cette capacité n'est pas innée mais acquise et elle est possible en grande partie grâce à l'identification d'indices reflétant la connaissance a priori des objets qui nous entourent. Par ailleurs, nous savons que des algorithmes d'apprentissage peuvent extraire ces indices directement depuis des images. Nous nous intéressons en particulier aux méthodes d’apprentissage statistique basées sur des réseaux de neurones profond qui ont récemment permis des percées majeures dans de nombreux domaines et nous étudions le cas de l'estimation de profondeur monoculaire. / Computer vision is a branch of artificial intelligence whose purpose is to enable a machine to analyze, process and understand the content of digital images. Scene understanding in particular is a major issue in computer vision. It goes through a semantic and structural characterization of the image, on one hand to describe its content and, on the other hand, to understand its geometry. However, while the real space is three-dimensional, the image representing it is two-dimensional. Part of the 3D information is thus lost during the process of image formation and it is therefore non trivial to describe the geometry of a scene from 2D images of it.There are several ways to retrieve the depth information lost in the image. In this thesis we are interested in estimating a depth map given a single image of the scene. In this case, the depth information corresponds, for each pixel, to the distance between the camera and the object represented in this pixel. The automatic estimation of a distance map of the scene from an image is indeed a critical algorithmic brick in a very large number of domains, in particular that of autonomous vehicles (obstacle detection, navigation aids).Although the problem of estimating depth from a single image is a difficult and inherently ill-posed problem, we know that humans can appreciate distances with one eye. This capacity is not innate but acquired and made possible mostly thanks to the identification of indices reflecting the prior knowledge of the surrounding objects. Moreover, we know that learning algorithms can extract these clues directly from images. We are particularly interested in statistical learning methods based on deep neural networks that have recently led to major breakthroughs in many fields and we are studying the case of the monocular depth estimation.
|
114 |
Robust Learning of a depth map for obstacle avoidance with a monocular stabilized flying camera / Apprentissage robuste d'une carte de profondeur pour l'évitement d'obstacle dans le cas des cameras volantes, monoculaires et stabiliséesPinard, Clément 24 June 2019 (has links)
Le drone orienté grand public est principalement une caméra volante, stabilisée et de bonne qualité. Ceux-ci ont démocratisé la prise de vue aérienne, mais avec leur succès grandissant, la notion de sécurité est devenue prépondérante.Ce travail s'intéresse à l'évitement d'obstacle, tout en conservant un vol fluide pour l'utilisateur.Dans ce contexte technologique, nous utilisons seulement une camera stabilisée, par contrainte de poids et de coût.Pour leur efficacité connue en vision par ordinateur et leur performance avérée dans la résolution de tâches complexes, nous utilisons des réseaux de neurones convolutionnels (CNN). Notre stratégie repose sur un systeme de plusieurs niveaux de complexité dont les premieres étapes sont de mesurer une carte de profondeur depuis la caméra. Cette thèse étudie les capacités d'un CNN à effectuer cette tâche.La carte de profondeur, étant particulièrement liée au flot optique dans le cas d'images stabilisées, nous adaptons un réseau connu pour cette tâche, FlowNet, afin qu'il calcule directement la carte de profondeur à partir de deux images stabilisées. Ce réseau est appelé DepthNet.Cette méthode fonctionne en simulateur avec un entraînement supervisé, mais n'est pas assez robuste pour des vidéos réelles. Nous étudions alors les possibilites d'auto-apprentissage basées sur la reprojection différentiable d'images. Cette technique est particulièrement nouvelle sur les CNNs et nécessite une étude détaillée afin de ne pas dépendre de paramètres heuristiques.Finalement, nous développons un algorithme de fusion de cartes de profondeurs pour utiliser DepthNet sur des vidéos réelles. Plusieurs paires différentes sont données à DepthNet afin d'avoir une grande plage de profondeurs mesurées. / Customer unmanned aerial vehicles (UAVs) are mainly flying cameras. They democratized aerial footage, but with thei success came security concerns.This works aims at improving UAVs security with obstacle avoidance, while keeping a smooth flight. In this context, we use only one stabilized camera, because of weight and cost incentives.For their robustness in computer vision and thei capacity to solve complex tasks, we chose to use convolutional neural networks (CNN). Our strategy is based on incrementally learning tasks with increasing complexity which first steps are to construct a depth map from the stabilized camera. This thesis is focused on studying ability of CNNs to train for this task.In the case of stabilized footage, the depth map is closely linked to optical flow. We thus adapt FlowNet, a CNN known for optical flow, to output directly depth from two stabilized frames. This network is called DepthNet.This experiment succeeded with synthetic footage, but is not robust enough to be used directly on real videos. Consequently, we consider self supervised training with real videos, based on differentiably reproject images. This training method for CNNs being rather novel in literature, a thorough study is needed in order not to depend too moch on heuristics.Finally, we developed a depth fusion algorithm to use DepthNet efficiently on real videos. Multiple frame pairs are fed to DepthNet to get a great depth sensing range.
|
115 |
Monocular Depth Prediction in Deep Neural NetworksTang, Guanqian January 2019 (has links)
With the development of artificial neural network (ANN), it has been introduced in more and more computer vision tasks. Convolutional neural networks (CNNs) are widely used in object detection, object tracking, and semantic segmentation, achieving great performance improvement than traditional algorithms. As a classical topic in computer vision, the exploration of applying deep CNNs for depth recovery from monocular images is popular, since the single-view image is more common than stereo image pair and video. However, due to the lack of motion and geometry information, monocular depth estimation is much more difficult. This thesis aims at investigating depth prediction from single images by exploiting state-of-the-art deep CNN models. Two neural networks are studied: the first network uses the idea of a global and local network, and the other one adopts a deeper fully convolutional network by using a pre-trained backbone CNN (ResNet or DenseNet). We compare the performance of the two networks and the result shows that the deeper convolutional neural network with the pre-trained backbone can achieve better performance. The pre-trained model can significantly accelerate the training process. We also find that the amount of training dataset is essential for CNN-based monocular depth prediction. / Utvecklingen av artificiella neurala nätverk (ANN) har gjort att det nu använts i flertal datorseende tekniker för att förbättra prestandan. Convolutional Neural Networks (CNN) används ofta inom objektdetektering, objektspårning och semantisk segmentering, och har en bättre prestanda än de föregående algoritmerna. Användningen av CNNs för djup prediktering för single-image har blivit populärt, på grund av att single-image är vanligare än stereo-image och filmer. På grund av avsaknaden av rörelse och geometrisk information, är det mycket svårare att veta djupet i en bild än för en film. Syftet med masteruppsatsen är att implementera en ny algoritm för djup prediktering, specifikt för bilder genom att använda CNN modeller. Två olika neurala nätverk analyserades; det första använder sig av lokalt och globalt nätverk och det andra består av ett avancerat Convolutional Neural Network som använder en pretrained backbone CNN (ResNet eller DenseNet). Våra analyser visar att avancerat Convolutional Neural Network som använder en pre-trained backbone CNN har en bättre prestanda som påskyndade inlärningsprocessen avsevärt. Vi kom även fram till att mängden data för inlärningsprocessen var avgörande för CNN-baserad monokulär djup prediktering.
|
116 |
Rolling shutter in feature-based Visual-SLAM : Robustness through rectification in a wearable and monocular contextNorée Palm, Caspar January 2023 (has links)
This thesis analyzes the impact of and implements compensation for rolling shutter distortions in the state-of-the-art feature-based visual SLAM system ORB-SLAM3. The compensation method involves rectifying the detected features, and the evaluation was conducted on the "Rolling-Shutter Visual-Inertial Odometry Dataset" from TUM, which comprises of ten sequences recorded with side-by-side synchronized global and rolling shutter cameras in a single room. The performance of ORB-SLAM3 on rolling shutter without the implemented rectification algorithms substantially decreased in terms of accuracy and robustness. The global shutter camera achieved centimeter or even sub-centimeter accuracy, while the rolling shutter camera's accuracy could reach the decimeter range in the more challenging sequences. Also, specific individual executions using a rolling shutter camera could not track the trajectory effectively, indicating a degradation in robustness. The effects of rolling shutter in inertial ORB-SLAM3 were even more pronounced with higher trajectory errors and outright failure to track in some sequences. This was the case even though using inertial measurements with the global shutter camera resulted in better accuracy and robustness compared to the non-inertial case. The rectification algorithms implemented in this thesis yielded significant accuracy increases of up to a 7x relative improvement for the non-inertial case, which turned trajectory errors back to the centimeter scale from the decimeter one for the more challenging sequences. For the inertial case, the rectification scheme was even more crucial. It resulted in better trajectory accuracies, better than the non-inertial case for the less challenging sequences, and made tracking possible for the more challenging ones.
|
117 |
Improving deep monocular depth predictions using dense narrow field of view depth imagesMöckelind, Christoffer January 2018 (has links)
In this work we study a depth prediction problem where we provide a narrow field of view depth image and a wide field of view RGB image to a deep network tasked with predicting the depth for the entire RGB image. We show that by providing a narrow field of view depth image, we improve results for the area outside the provided depth compared to an earlier approach only utilizing a single RGB image for depth prediction. We also show that larger depth maps provide a greater advantage than smaller ones and that the accuracy of the model decreases with the distance from the provided depth. Further, we investigate several architectures as well as study the effect of adding noise and lowering the resolution of the provided depth image. Our results show that models provided low resolution noisy data performs on par with the models provided unaltered depth. / I det här arbetet studerar vi ett djupapproximationsproblem där vi tillhandahåller en djupbild med smal synvinkel och en RGB-bild med bred synvinkel till ett djupt nätverk med uppgift att förutsäga djupet för hela RGB-bilden. Vi visar att genom att ge djupbilden till nätverket förbättras resultatet för området utanför det tillhandahållna djupet jämfört med en existerande metod som använder en RGB-bild för att förutsäga djupet. Vi undersöker flera arkitekturer och storlekar på djupbildssynfält och studerar effekten av att lägga till brus och sänka upplösningen på djupbilden. Vi visar att större synfält för djupbilden ger en större fördel och även att modellens noggrannhet minskar med avståndet från det angivna djupet. Våra resultat visar också att modellerna som använde sig av det brusiga lågupplösta djupet presterade på samma nivå som de modeller som använde sig av det omodifierade djupet.
|
118 |
Registration and Localization of Unknown Moving Objects in Markerless Monocular SLAMTroutman, Blake 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Simultaneous localization and mapping (SLAM) is a general device localization technique that uses realtime sensor measurements to develop a virtualization of the sensor's environment while also using this growing virtualization to determine the position and orientation of the sensor. This is useful for augmented reality (AR), in which a user looks through a head-mounted display (HMD) or viewfinder to see virtual components integrated into the real world. Visual SLAM (i.e., SLAM in which the sensor is an optical camera) is used in AR to determine the exact device/headset movement so that the virtual components can be accurately redrawn to the screen, matching the perceived motion of the world around the user as the user moves the device/headset. However, many potential AR applications may need access to more than device localization data in order to be useful; they may need to leverage environment data as well. Additionally, most SLAM solutions make the naive assumption that the environment surrounding the system is completely static (non-moving). Given these circumstances, it is clear that AR may benefit substantially from utilizing a SLAM solution that detects objects that move in the scene and ultimately provides localization data for each of these objects. This problem is known as the dynamic SLAM problem. Current attempts to address the dynamic SLAM problem often use machine learning to develop models that identify the parts of the camera image that belong to one of many classes of potentially-moving objects. The limitation with these approaches is that it is impractical to train models to identify every possible object that moves; additionally, some potentially-moving objects may be static in the scene, which these approaches often do not account for. Some other attempts to address the dynamic SLAM problem also localize the moving objects they detect, but these systems almost always rely on depth sensors or stereo camera configurations, which have significant limitations in real-world use cases. This dissertation presents a novel approach for registering and localizing unknown moving objects in the context of markerless, monocular, keyframe-based SLAM with no required prior information about object structure, appearance, or existence. This work also details a novel deep learning solution for determining SLAM map initialization suitability in structure-from-motion-based initialization approaches. This dissertation goes on to validate these approaches by implementing them in a markerless, monocular SLAM system called LUMO-SLAM, which is built from the ground up to demonstrate this approach to unknown moving object registration and localization. Results are collected for the LUMO-SLAM system, which address the accuracy of its camera localization estimates, the accuracy of its moving object localization estimates, and the consistency with which it registers moving objects in the scene. These results show that this solution to the dynamic SLAM problem, though it does not act as a practical solution for all use cases, has an ability to accurately register and localize unknown moving objects in such a way that makes it useful for some applications of AR without thwarting the system's ability to also perform accurate camera localization.
|
119 |
Monocular 3D Human Pose Estimation / Monokulär 3D-människans hållningsuppskattningRey, Robert January 2023 (has links)
The focus of this work is the task of 3D human pose estimation, more specifically by making use of key points located in single monocular images in order to estimate the location of human body joints in a 3D space. It was done in association with Tracab, a company based in Stockholm, who specialises in advanced sports tracking and analytics solutions. Tracab’s core product is their optical tracking system for football, which involves installing multiple highspeed cameras around the sports venue. One of the main benefits of this work will be to reduce the number of cameras required to create the 3D skeletons of the players, hence reducing production costs as well as making the whole process of creating the 3D skeletons much simpler in the future. The main problem we are tackling consists in going from a set of 2D joint locations and lifting them to a 3D space, which would add an information of depth to the joint locations. One problem with this task is the limited availability of in-thewild datasets with corresponding 3D ground truth labels. We hope to tackle this issue by making use of the restricted Human3.6m dataset along with the Tracab dataset in order to achieve adequate results. Since the Tracab dataset is very large, i.e millions of unique poses and skeletons, we have focused our experiments on a single football game. Although extensive research has been done in the field by using architectures such as convolutional neural networks, transformers, spatial-temporal architectures and more, we are tackling this issue by making use of a simple feedforward neural network developed by Martinez et al, this is mainly possible due to the abundance of data available at Tracab. / Fokus för detta arbete är att estimera 3D kroppspositioner, genom att använda detekterade punkter på människokroppen i enskilda monokulära bilder för att uppskatta 3D positionen av dessa ledpunkter. Detta arbete genomfördes i samarbete med Tracab, ett företag baserat i Stockholm, som specialiserar sig på avancerade lösningar för följning och analys inom idrott. Tracabs huvudprodukt är deras optiska följningssystem, som innebär att flera synkroniserade höghastighetskameror installeras runt arenan. En av de främsta fördelarna med detta arbete kommer att vara att minska antalet kameror som krävs för att skapa 3D-skelett av spelarna, vilket minskar produktionskostnaderna och förenklar hela processen för att skapa 3D-skelett i framtiden. Huvudproblemet vi angriper är att gå från en uppsättning 2D-ledpunkter och lyfta dem till 3D-utrymme. Ett problem är den begränsade tillgången till datamängder med 3D ground truth från realistiska miljöer. Vi angriper detta problem genom att använda den begränsade Human3.6m-datasetet tillsammans med Tracab-datasetet för att uppnå tillräckliga resultat. Eftersom Tracab-datamängden är mycket stor, med miljontals unika poser och skelett, .har vi begränsat våra experiment till en fotbollsmatch. Omfattande forskning har gjorts inom området med användning av arkitekturer som konvolutionella neurala nätverk, transformerare, rumsligttemporala arkitekturer med mera. Här använder vi ett enkelt framåtriktat neuralt nätverk utvecklat av Martinez et al, vilket är möjligt tack vare den stora mängden data som är tillgänglig hos Tracab.
|
120 |
Inferring 3D trajectory from monocular data using deep learning / Inferens av 3D bana utifrån 2D data med djupa arkitekturerSellstedt, Victor January 2021 (has links)
Trajectory estimation, with regards to reconstructing a 3D trajectory from a 2D trajectory, is commonly achieved using stereo or multi camera setups. Although projections from 3D to 2D suffer significant information loss, some methods approach this problem from a monocular perspective to address limitations of multi camera systems, such as requiring points in to be observed by more than one camera. This report explores how deep learning methodology can be applied to estimation of golf balls’ 3D trajectories using features from synthetically generated monocular data. Three neural network architectures for times series analysis, Long Short-Term Memory (LSTM), Bidirectional LSTM(BLSTM), and Temporal Convolutional Network (TCN); are compared to a simpler Multi Layer Perceptron (MLP) baseline and theoretical stereo error. The results show the models’ performances are varied with median performances often significantly better than average, caused by some predictions with very large errors. Overall the BLSTM performed best of all models both quantitatively and qualitatively, for some ranges with a lower error than a stereo estimate with an estimated disparity error of 1. Although the performance of the proposed monocular approaches do not outperform a stereo system with a lower disparity error, the proposed approaches could be good alternatives where stereo solutions might not be possible. / Lösningar för inferens av 3D banor utifrån 2D sekvenser använder sig ofta av två eller fler kameror som datakällor. Trots att mycket information förloras i projektionen till kamerabilden använder sig vissa lösningar sig av endast en kamera. En sådan monokulär lösning kan vara mer fördelaktiga än multikamera lösningar i vissa fall, såsom när ett objekt endast är synligt av ena kamera. Denna rapport undersöker hur metoder baserade på djupa arkitekturer kan användas för att uppskatta golfbollars 3D banor med variabler som skapas utifrån syntetiskt genererad monokulär data. Tre olika arkitekturer för tidsserieanalys Long Short-Term Memory (LSTM), Bidirectional LSTM (BLSTM) och Temporal Convolutional Neural Network (TCN) jämförs mot en enklare Multi Layer Perceptron (MLP) och teoretiska stereo-fel. Resultaten visar att modellerna har en varierad prestation med median resultaten ofta mycket bättre än medelvärdena, på grund av några förutsägelser med stora fel. Överlag var den bästa modellen BLSTM:en både kvantitativt och kvalitativt samt bättre än stereo lösningen med högre fel för vissa intervall. Resultaten visar dock på att modellerna är tydligt sämre en stereo systemet med lägre fel. Trots detta kan de föreslagna metoderna utgöra bra alternativ för lösningar där stereo system inte kan användas.
|
Page generated in 0.0362 seconds