Global ETD Search

1	Learning object boundary detection from motion data Ross, Michael G., Kaelbling, Leslie P. 01 1900 (has links) This paper describes the initial results of a project to create a self-supervised algorithm for learning object segmentation from video data. Developmental psychology and computational experience have demonstrated that the motion segmentation of objects is a simpler, more primitive process than the detection of object boundaries by static image cues. Therefore, motion information provides a plausible supervision signal for learning the static boundary detection task and for evaluating performance on a test set. A video camera and previously developed background subtraction algorithms can automatically produce a large database of motion-segmented images for minimal cost. The purpose of this work is to use the information in such a database to learn how to detect the object boundaries in novel images using static information, such as color, texture, and shape. / Singapore-MIT Alliance (SMA) machine learning self-supervised algorithm motion segmentation object boundary detection
2	Contributions on 3D Human Computer-Interaction using Deep approaches Castro-Vargas, John Alejandro 16 March 2023 (has links) There are many challenges facing society today, both socially and industrially. Whether it is to improve productivity in factories or with the intention of improving the quality of life of people in their homes, technological advances in robotics and computing have led to solutions to many problems in modern society. These areas are of great interest and are in constant development, especially in societies with a relatively ageing population. In this thesis, we address different challenges in which robotics, artificial intelligence and computer vision are used as tools to propose solutions oriented to home assistance. These tools can be organised into three main groups: “Grasping Challenges”, where we have addressed the problem of performing robot grasping in domestic environments; “Hand Interaction Challenges”, where we have addressed the detection of static and dynamic hand gestures, using approaches based on DeepLearning and GeometricLearning; and finally, “Human Behaviour Recognition”, where using a machine learning model based on hyperbolic geometry, we seek to group the actions that performed in a video sequence. CNN Hand gesture Hand pose Self-supervised learning
3	Self-Supervised Remote Sensing Image Change Detection and Data Fusion Chen, Yuxing 27 November 2023 (has links) Self-supervised learning models, which are called foundation models, have achieved great success in computer vision. Meanwhile, the limited access to labeled data has driven the development of self-supervised methods in remote sensing tasks. In remote sensing image change detection, the generative models are extensively utilized in unsupervised binary change detection tasks, while they overly focus on pixels rather than on abstract feature representations. In addition, the state-of-the-art satellite image time series change detection approaches fail to effectively leverage the spatial-temporal information of image time series or generalize well to unseen scenarios. Similarly, in the context of multimodal remote sensing data fusion, the recent successes of deep learning techniques mainly focus on specific tasks and complete data fusion paradigms. These task-specific models lack of generalizability to other remote sensing tasks and become overfitted to the dominant modalities. Moreover, they fail to handle incomplete modalities inputs and experience severe degradation in downstream tasks. To address these challenges associated with individual supervised learning models, this thesis presents two novel contributions to self-supervised learning models on remote sensing image change detection and multimodal remote sensing data fusion. The first contribution proposes a bi-temporal / multi-temporal contrastive change detection framework, which employs contrastive loss on image patches or superpixels to get fine-grained change maps and incorporates an uncertainty method to enhance the temporal robustness. In the context of satellite image time series change detection, the proposed approach improves the consistency of pseudo labels through feature tracking and tackles the challenges posed by seasonal changes in long-term remote sensing image time series using supervised contrastive loss and the random walk loss in ConvLSTM. The second contribution develops a self-supervised multimodal RS data fusion framework, with a specific focus on addressing the incomplete multimodal RS data fusion challenges in downstream tasks. Within this framework, multimodal RS data are fused by applying a multi-view contrastive loss at the pixel level and reconstructing each modality using others in a generative way based on MultiMAE. In downstream tasks, the proposed approach leverages a random modality combination training strategy and an attention block to enable fusion across modal-incomplete inputs. The thesis assesses the effectiveness of the proposed self-supervised change detection approach on single-sensor and cross-sensor datasets of SAR and multispectral images, and evaluates the proposed self-supervised multimodal RS data fusion approach on the multimodal RS dataset with SAR, multispectral images, DEM and also LULC maps. The self-supervised change detection approach demonstrates improvements over state-of-the-art unsupervised change detection methods in challenging scenarios involving multi-temporal and multi-sensor RS image change detection. Similarly, the self-supervised multimodal remote sensing data fusion approach achieves the best performance by employing an intermediate fusion strategy on SAR and optical image pairs, outperforming existing unsupervised data fusion approaches. Notably, in incomplete multimodal fusion tasks, the proposed method exhibits impressive performance on all modal-incomplete and single modality inputs, surpassing the performance of vanilla MultiViT, which tends to overfit on dominant modality inputs and fails in tasks with single modality inputs.
4	From Pixels to Prices with ViTMAE : Integrating Real Estate Images through Masked Autoencoder Vision Transformers (ViTMAE) with Conventional Real Estate Data for Enhanced Automated Valuation / Från pixlar till priser med ViTMAE : Integrering av bostadsbilder genom Masked Autoencoder Vision Transformers (ViTMAE) med konventionell fastighetsdata för förbättrad automatiserad värdering Ekblad Voltaire, Fanny January 2024 (has links) The integration of Vision Transformers (ViTs) using Masked Autoencoder pre-training (ViTMAE) into real estate valuation is investigated in this Master’s thesis, addressing the challenge of effectively analyzing visual information from real estate images. This integration aims to enhance the accuracy and efficiency of valuation, a task traditionally dependent on realtor expertise. The research involved developing a model that combines ViTMAE-extracted visual features from real estate images with traditional property data. Focusing on residential properties in Sweden, the study utilized a dataset of images and metadata from online real estate listings. An adapted ViTMAE model, accessed via the Hugging Face library, was trained on the dataset for feature extraction, which was then integrated with metadata to create a comprehensive multimodal valuation model. Results indicate that including ViTMAE-extracted image features improves prediction accuracy in real estate valuation models. The multimodal approach, merging visual and traditional metadata, improved accuracy over metadata-only models. This thesis contributes to real estate valuation by showcasing the potential of advanced image processing techniques in enhancing valuation models. It lays the groundwork for future research in more refined holistic valuation models, incorporating a wider range of factors beyond visual data. / Detta examensarbete undersöker integrationen av Vision Transformers (ViTs) med Masked Autoencoder pre-training (ViTMAE) i bostadsvärdering, genom att addressera utmaningen att effektivt analysera visuell information från bostadsannonser. Denna integration syftar till att förbättra noggrannheten och effektiviteten i fastighetsvärdering, en uppgift som traditionellt är beroende av en fysisk besiktning av mäklare. Arbetet innefattade utvecklingen av en modell som kombinerar bildinformation extraherad med ViTMAE från fastighetsbilder med traditionella fastighetsdata. Med fokus på bostadsfastigheter i Sverige använde studien en databas med bilder och metadata från bostadsannonser. Den anpassade ViTMAE-modellen, tillgänglig via Hugging Face-biblioteket, tränades på denna databas för extraktion av bildinformation, som sedan integrerades med metadata för att skapa en omfattande värderingsmodell. Resultaten indikerar att inklusion av ViTMAE-extraherad bildinformation förbättrar noggranheten av bostadssvärderingsmodeller. Den multimodala metoden, som kombinerar visuell och traditionell metadata, visade en förbättring i noggrannhet jämfört med modeller som endast använder metadata. Denna uppsats bidrar till bostadsvärdering genom att visa på potentialen hos avancerade bildanalys för att förbättra värderingsmodeller. Den lägger grunden för framtida forskning i mer raffinerade holistiska värderingsmodeller som inkluderar ett bredare spektrum av faktorer utöver visuell data. Computer Vision Transformer Self-supervised Learning Masked Autoencoder Automated Real Estate Appraisal Datorseende Transformer Self-supervised Learning Masked Autoencoder Automatiserad Bostadsvärdering Computer and Information Sciences Data- och informationsvetenskap
5	Supervision Beyond Manual Annotations for Learning Visual Representations Doersch, Carl 01 April 2016 (has links) For both humans and machines, understanding the visual world requires relating new percepts with past experience. We argue that a good visual representation for an image should encode what makes it similar to other images, enabling the recall of associated experiences. Current machine implementations of visual representations can capture some aspects of similarity, but fall far short of human ability overall. Even if one explicitly labels objects in millions of images to tell the computer what should be considered similar—a very expensive procedure—the labels still do not capture everything that might be relevant. This thesis shows that one can often train a representation which captures similarity beyond what is labeled in a given dataset. That means we can begin with a dataset that has uninteresting labels, or no labels at all, and still build a useful representation. To do this, we propose to using pretext tasks: tasks that are not useful in and of themselves, but serve as an excuse to learn a more general-purpose representation. The labels for a pretext task can be inexpensive or even free. Furthermore, since this approach assumes training labels differ from the desired outputs, it can handle output spaces where the correct answer is ambiguous, and therefore impossible to annotate by hand. The thesis explores two broad classes of supervision. The first isweak image-level supervision, which is exploited to train mid-level discriminative patch classifiers. For example, given a dataset of street-level imagery labeled only with GPS coordinates, patch classifiers are trained to differentiate one specific geographical region (e.g. the city of Paris) from others. The resulting classifiers each automatically collect and associate a set of patches which all depict the same distinctive architectural element. In this way, we can learn to detect elements like balconies, signs, and lamps without annotations. The second type of supervision requires no information about images other than the pixels themselves. Instead, the algorithm is trained to predict the context around image patches. The context serves as a sort of weak label: to predict well, the algorithm must associate similar-looking patches which also have similar contexts. After training, the feature representation learned using this within-image context indeed captures visual similarity across images, which ultimately makes it useful for real tasks like object detection and geometry estimation. pretext tasks self-supervised learning computer vision unsupervised learning weakly-supervised learning context
6	Robots that Anticipate Pain: Anticipating Physical Perturbations from Visual Cues through Deep Predictive Models January 2017 (has links) abstract: To ensure system integrity, robots need to proactively avoid any unwanted physical perturbation that may cause damage to the underlying hardware. In this thesis work, we investigate a machine learning approach that allows robots to anticipate impending physical perturbations from perceptual cues. In contrast to other approaches that require knowledge about sources of perturbation to be encoded before deployment, our method is based on experiential learning. Robots learn to associate visual cues with subsequent physical perturbations and contacts. In turn, these extracted visual cues are then used to predict potential future perturbations acting on the robot. To this end, we introduce a novel deep network architecture which combines multiple sub- networks for dealing with robot dynamics and perceptual input from the environment. We present a self-supervised approach for training the system that does not require any labeling of training data. Extensive experiments in a human-robot interaction task show that a robot can learn to predict physical contact by a human interaction partner without any prior information or labeling. Furthermore, the network is able to successfully predict physical contact from either depth stream input or traditional video input or using both modalities as input. / Dissertation/Thesis / Masters Thesis Computer Science 2017 Robotics Artificial intelligence Anticipating Physical Perturbations Deep Learning Safe Robotics Self-Supervised Learning
7	Object Detection and Semantic Segmentation Using Self-Supervised Learning Gustavsson, Simon January 2021 (has links) In this thesis, three well known self-supervised methods have been implemented and trained on road scene images. The three so called pretext tasks RotNet, MoCov2, and DeepCluster were used to train a neural network self-supervised. The self-supervised trained networks where then evaluated on different amount of labeled data on two downstream tasks, object detection and semantic segmentation. The performance of the self-supervised methods are compared to networks trained from scratch on the respective downstream task. The results show that it is possible to achieve a performance increase using self-supervision on a dataset containing road scene images only. When only a small amount of labeled data is available, the performance increase can be substantial, e.g., a mIoU from 33 to 39 when training semantic segmentation on 1750 images with a RotNet pre-trained backbone compared to training from scratch. However, it seems that when a large amount of labeled images are available (>70000 images), the self-supervised pretraining does not increase the performance as much or at all. Self-supervised learning Computer vision
8	Self-supervised učení v aplikacích počítačového vidění / Self-supervised learning in computer vision applications Vančo, Timotej January 2021 (has links) The aim of the diploma thesis is to make research of the self-supervised learning in computer vision applications, then to choose a suitable test task with an extensive data set, apply self-supervised methods and evaluate. The theoretical part of the work is focused on the description of methods in computer vision, a detailed description of neural and convolution networks and an extensive explanation and division of self-supervised methods. Conclusion of the theoretical part is devoted to practical applications of the Self-supervised methods in practice. The practical part of the diploma thesis deals with the description of the creation of code for working with datasets and the application of the SSL methods Rotation, SimCLR, MoCo and BYOL in the role of classification and semantic segmentation. Each application of the method is explained in detail and evaluated for various parameters on the large STL10 dataset. Subsequently, the success of the methods is evaluated for different datasets and the limiting conditions in the classification task are named. The practical part concludes with the application of SSL methods for pre-training the encoder in the application of semantic segmentation with the Cityscapes dataset.
9	An Approach to Self-Supervised Object Localisation through Deep Learning Based Classification Politov, Andrei 28 December 2021 (has links) Deep learning has become ubiquitous in science and industry for classifying images or identifying patterns in data. The most widely used approach to training convolutional neural networks is supervised learning, which requires a large set of annotated data. To elude the high cost of collecting and annotating datasets, selfsupervised learning methods represent a promising way to learn the common functions of images and videos from large-scale unlabeled data without using humanannotated labels. This thesis provides the results of using self-supervised learning and explainable AI to localise objects in images from electron microscopes. The work used a synthetic geometric dataset and a synthetic pollen dataset. The classification was used as a pretext task. Different methods of explainable AI were applied: Grad-CAM and backpropagation-based approaches showed the lack of prospects; at the same time, the Extremal Perturbation function has shown efficiency. As a result of the downstream localisation task, the objects of interest were detected with competitive accuracy for one-class images. The advantages and limitations of the approach have been analysed. Directions for further work are proposed. info:eu-repo/classification/ddc/004 ddc:004
10	Indoor 3D Scene Understanding Using Depth Sensors Lahoud, Jean 09 1900 (has links) One of the main goals in computer vision is to achieve a human-like understanding of images. Nevertheless, image understanding has been mainly studied in the 2D image frame, so more information is needed to relate them to the 3D world. With the emergence of 3D sensors (e.g. the Microsoft Kinect), which provide depth along with color information, the task of propagating 2D knowledge into 3D becomes more attainable and enables interaction between a machine (e.g. robot) and its environment. This dissertation focuses on three aspects of indoor 3D scene understanding: (1) 2D-driven 3D object detection for single frame scenes with inherent 2D information, (2) 3D object instance segmentation for 3D reconstructed scenes, and (3) using room and floor orientation for automatic labeling of indoor scenes that could be used for self-supervised object segmentation. These methods allow capturing of physical extents of 3D objects, such as their sizes and actual locations within a scene. Depth sensors 3D Understanding 3D instance segmentation 3D object detection self-supervised pertaining object recognition

Search results