Global ETD Search

601	Deep Learning for Computer Vision and it's Application to Machine Perception of Hand and Object Sangpil Kim (9745326) 15 December 2020 (has links) <div>The advances in computing power and artificial intelligence have made applications such as augmented reality/virtual reality (AR/VR) and smart factories possible. In smart factories, robots interact with workers and, AR/VR devices are used for skill transfer. In order to enable these types of applications, a computer needs to recognize the user’s hand and body movement with objects and their interactions. In this regard, machine perception of hands and objects is the first step for human and computer integration. This is because personal activity is represented by the interaction of objects and hands. For machine perception of objects and hands, vision sensors are widely used in a wide range of industrial applications since visual information provides non-contact input signals. For these reasons, computer vision-oriented machine perception has been researched extensively. However, due to the complexity of object space and hand movement, machine perception of hands and objects remains a challenging problem.</div><div><br></div><div>Recently, deep learning has been introduced with groundbreaking results in the computer vision domain, which address many challenging problems and significantly improves the performance of AI in many tasks. The success of deep learning algorithms depends on the learning strategy and the quality and quantity of the training data. Therefore, in this thesis, we tackle machine perception of hands and objects with four aspects: learning underlying structure of 2D data, fusing surface and volume content of a 3D object, developing an annotation tool for mechanical components, and using thermal information of bare hands. More broadly, we improve the machine perception of interacting hand and object by developing a learning strategy and framework for large-scale dataset creation.</div><div><br></div><div>For the learning strategy, we use a conditional generative model, which learns conditional distribution of the dataset by minimizing the gap between data distribution and the model distribution for hands and objects. First, we propose an efficient conditional generative model for 2D images that can traverse the latent space given a conditional vector. Subsequently, we develop a conditional generative model for 3D space that fuses volume and surface representations and learns the association of functional parts. These methods improve machine perception of objects and hands for not only 2D images but also in 3D space. However, the performance of deep learning algorithms has positive correlation with the quality and quantity of datasets, which motivates us to develop the a large-scale dataset creation framework.</div><div><br></div><div>In order to leverage the learning strategies of deep learning algorithms, we develop annotation tools that can establish a large-scale dataset for objects and hands and evaluate existing deep learning methods with extensive performance analysis. For the object dataset creation, we establish a taxonomy of mechanical components and a web-based annotation tool. With this framework, we create a large-scale mechanical components dataset. With the dataset, we benchmark seven different machine perception algorithms for 3D objects. For hand annotation, we propose a novel data curation method for pixel-wise hand segmentation dataset creation, which uses thermal information and hand geometry to identify and segment the hands from objects and backgrounds. Also, we introduce a data fusion method that fuses thermal information and RGB-D data for the machine perception of hands while interacting with objects.</div> Computer Vision deep learning computer vision artificial intelligence machine perception
602	Learning Structured Representations for Understanding Visual and Multimedia Data Zareian, Alireza January 2021 (has links) Recent advances in Deep Learning (DL) have achieved impressive performance in a variety of Computer Vision (CV) tasks, leading to an exciting wave of academic and industrial efforts to develop Artificial Intelligence (AI) facilities for every aspect of human life. Nevertheless, there are inherent limitations in the understanding ability of DL models, which limit the potential of AI in real-world applications, especially in the face of complex, multimedia input. Despite tremendous progress in solving basic CV tasks, such as object detection and action recognition, state-of-the-art CV models can merely extract a partial summary of visual content, which lacks a comprehensive understanding of what happens in the scene. This is partly due to the oversimplified definition of CV tasks, which often ignore the compositional nature of semantics and scene structure. It is even less studied how to understand the content of multiple modalities, which requires processing visual and textual information in a holistic and coordinated manner, and extracting interconnected structures despite the semantic gap between the two modalities. In this thesis, we argue that a key to improve the understanding capacity of DL models in visual and multimedia domains is to use structured, graph-based representations, to extract and convey semantic information more comprehensively. To this end, we explore a variety of ideas to define more realistic DL tasks in both visual and multimedia domains, and propose novel methods to solve those tasks by addressing several fundamental challenges, such as weak supervision, discovery and incorporation of commonsense knowledge, and scaling up vocabulary. More specifically, inspired by the rich literature of semantic graphs in Natural Language Processing (NLP), we explore innovative scene understanding tasks and methods that describe images using semantic graphs, which reflect the scene structure and interactions between objects. In the first part of this thesis, we present progress towards such graph-based scene understanding solutions, which are more accurate, need less supervision, and have more human-like common sense compared to the state of the art. In the second part of this thesis, we extend our results on graph-based scene understanding to the multimedia domain, by incorporating the recent advances in NLP and CV, and developing a new task and method from the ground up, specialized for joint information extraction in the multimedia domain. We address the inherent semantic gap between visual content and text by creating high-level graph-based representations of images, and developing a multitask learning framework to establish a common, structured semantic space for representing both modalities. In the third part of this thesis, we explore another extension of our scene understanding methodology, to open-vocabulary settings, in order to make scene understanding methods more scalable and versatile. We develop visually grounded language models that use naturally supervised data to learn the meaning of all words, and transfer that knowledge to CV tasks such as object detection with little supervision. Collectively, the proposed solutions and empirical results set a new state of the art for the semantic comprehension of visual and multimedia content in a structured way, in terms of accuracy, efficiency, scalability, and robustness. Artificial intelligence Computer vision Computer vision--Mathematical models Semantics
603	Automatic Gait Recognition : using deep metric learning / Automatisk gångstilsigenkänning Persson, Martin January 2020 (has links) Recent improvements in pose estimation has opened up the possibility of new areas of application. One of them is gait recognition, the task of identifying persons based on their unique style of walking, which is increasingly being recognized as an important method of biometric indentification. This thesis has explored the possibilities of using a pose estimation system, OpenPose, together with deep Recurrent Neural Networks (RNNs) in order to see if there is sufficient information in sequences of 2D poses to use for gait recognition. For this to be possible, a new multi-camera dataset consisting of persons walking on a treadmill was gathered, dubbed the FOI dataset. The results show that this approach has some promise. It achieved an overall classification accuracy of 95,5 % on classes it had seen during training and 83,8 % for classes it had not seen during training. It was unable to recognize sequences from angles it had not seen during training, however. For that to be possible, more data pre-processing will likely be required. Gait recognition Computer vision
604	Development of Dropwise Additive Manufacturing with non-Brownian Suspensions: Applications of Computer Vision and Bayesian Modeling to Process Design, Monitoring and Control: Video Files in Chapter 5 and Appendix E Andrew J. Radcliffe (9080312) 24 July 2020 (has links) Video files found in Chapter 5. : AUTOMATED OBJECT TRACKING, EVENT DETECTION AND RECOGNITION FOR HIGH-SPEED VIDEO OF DROP FORMATION PHENOMENA.<div><br></div><div>Video files found in APPENDIX E. CHAPTER 5, RESOURCE 2.</div> Computer Vision drop formation image processing computer vision event detection object tracking
605	FROM SEEING BETTER TO UNDERSTANDING BETTER: DEEP LEARNING FOR MODERN COMPUTER VISION APPLICATIONS Tianqi Guo (12890459) 17 June 2022 (has links) <p>In this dissertation, we document a few of our recent attempts in bridging the gap between the fast evolving deep learning research and the vast industry needs for dealing with computer vision challenges. More specifically, we developed novel deep-learning-based techniques for the following application-driven computer vision challenges: image super-resolution with quality restoration, motion estimation by optical flow, object detection for shape reconstruction, and object segmentation for motion tracking. Those four topics cover the computer vision hierarchy from the low level where digital images are processed to restore missing information for better human perception, to middle level where certain objects of interest are recognized and their motions are analyzed, finally to high level where the scene captured in the video footage will be interpreted for further analysis. In the process of building the whole-package of ready-to-deploy solutions, we center our efforts on designing and training the most suitable convolutional neural networks for the particular computer vision problem at hand. Complementary procedures for data collection, data annotation, post-processing of network outputs tailored for specific application needs, and deployment details will also be discussed where necessary. We hope our work demonstrates the applicability and versatility of convolutional neural networks for real-world computer vision tasks on a broad spectrum, from seeing better to understanding better.</p> Computer vision Deep learning deep learning computer vision cell tracking cell segmentation super-resolution
606	Intelligent Collision Prevention System For SPECT Detectors by Implementing Deep Learning Based Real-Time Object Detection Tahrir Ibraq Siddiqui (11173185) 23 July 2021 (has links) <p>The SPECT-CT machines manufactured by Siemens consists of two heavy detector heads(~1500lbs each) that are moved into various configurations for radionuclide imaging. These detectors are driven by large torque powered by motors in the gantry that enable linear and rotational motion. If the detectors collide with large objects – stools, tables, patient extremities, etc. – they are very likely to damage the objects and get damaged as well. <a>This research work proposes an intelligent real-time object detection system to prevent collisions</a> between detector heads and external objects in the path of the detector’s motion by implementing an end-to-end deep learning object detector. The research extensively documents all the work done in identifying the most suitable object detection framework for this use case, collecting, and processing the image dataset of target objects, training the deep neural net to detect target objects, deploying the trained deep neural net in live demos by implementing a real-time object detection application written in Python, improving the model’s performance, and finally investigating methods to stop detector motion upon detecting external objects in the collision region. We successfully demonstrated that a <i>Caffe</i> version of <i>MobileNet-SSD </i>can be trained and deployed to detect target objects entering the collision region in real-time by following the methodologies outlined in this paper. We then laid out the future work that must be done in order to bring this system into production, such as training the model to detect all possible objects that may be found in the collision region, controlling the activation of the RTOD application, and efficiently stopping the detector motion.</p> Computer Vision Computer Vision Deep Learning Real-Time Object Detection MobileNet-SSD Caffe OpenCV Machine Learning
607	Evaluating DCNN architecturesfor multinomial area classicationusing satellite data / Utvärdering av DCNN arkitekturer för multinomial arealklassi-cering med hjälp av satellit data Wojtulewicz, Karol, Agbrink, Viktor January 2020 (has links) The most common approach to analysing satellite imagery is building or object segmentation,which expects an algorithm to find and segment objects with specific boundaries thatare present in the satellite imagery. The company Vricon takes satellite imagery analysisfurther with the goal of reproducing the entire world into a 3D mesh. This 3D reconstructionis performed by a set of complex algorithms excelling in different object reconstructionswhich need sufficient labeling in the original 2D satellite imagery to ensure validtransformations. Vricon believes that the labeling of areas can be used to improve the algorithmselection process further. Therefore, the company wants to investigate if multinomiallarge area classification can be performed successfully using the satellite image data availableat the company. To enable this type of classification, the company’s gold-standarddataset containing labeled objects such as individual buildings, single trees, roads amongothers, has been transformed into an large area gold-standard dataset in an unsupervisedmanner. This dataset was later used to evaluate large area classification using several stateof-the-art Deep Convolutional Neural Network (DCNN) semantic segmentation architectureson both RGB as well as RGB and Digital Surface Model (DSM) height data. Theresults yield close to 63% mIoU and close to 80% pixel accuracy on validation data withoutusing the DSM height data in the process. This thesis additionally contributes with a novelapproach for large area gold-standard creation from existing object labeled datasets. Segmentation FCN Computer Vision Autoencoder Area classification
608	Efficient Multi-Object Tracking On Unmanned Aerial Vehicle Xiao Hu (12469473) 27 April 2022 (has links) <p>Multi-object tracking has been well studied in the field of computer vision. Meanwhile, with the advancement of the Unmanned Aerial Vehicles (UAV) technology, the flexibility and accessibility of UAV draws research attention to deploy multi-object tracking on UAV. The conventional solutions usually adapt using the "tracking-by-detection" paradigm. Such a paradigm has the structure where tracking is achieved through detecting objects in consecutive frames and then associating them with re-identification. However, the dynamic background, crowded small objects, and limited computational resources make multi-object tracking on UAV more challenging. Providing energy-efficient multi-object tracking solutions on the drone-captured video is critically demanded by research community. </p> <p> </p> <p>To stimulate innovation in both industry and academia, we organized the 2021 Low-Power Computer Vision Challenge with a UAV Video track focusing on multi-class multi-object tracking with customized UAV video. This thesis analyzes the qualified submissions of 17 different teams and provides a detailed analysis of the best solution. Methods and future directions for energy-efficient AI and computer vision research are discussed. The solutions and insights presented in this thesis are expected to facilitate future research and applications in the field of low-power vision on UAV.</p> <p> </p> <p>With the knowledge gathered from the submissions, an optical flow oriented multi-object tracking framework, named OF-MOT, is proposed to address the similar problem with a more realistic drone-captured video dataset. OF-MOT uses the motion information of each detected object of the previous frame to detect the current frame, then applies a customized object tracker using the motion information to associate the detected instances. OF-MOT is evaluated on a drone-captured video dataset and achieves 24 FPS with 17\% accuracy on a modern GPU Titan X, showing that the optical flow can effectively improve the multi-object tracking.</p> <p> </p> <p>Both competition results analysis and OF-MOT provide insights or experiment results regarding deploying multi-object tracking on UAV. We hope these findings will facilitate future research and applications in the field of UAV vision.</p> Computer Engineering Computer Vision computer vision algorithms Multiple Object Tracking UAV platforms
609	TREE-BASED UNIDIRECTIONAL NEURAL NETWORKS FOR LOW-POWER COMPUTER VISION ON EMBEDDED DEVICES Abhinav Goel (12468279) 27 April 2022 (has links) <p>Deep Neural Networks (DNNs) are a class of machine learning algorithms that are widelysuccessful in various computer vision tasks. DNNs filter input images and videos with manyconvolution operations in each layer to extract high-quality features and achieve high ac-curacy. Although highly accurate, the state-of-the-art DNNs usually require server-gradeGPUs, and are too energy, computation and memory-intensive to be deployed on most de-vices. This is a significant problem because billions of mobile and embedded devices that donot contain GPUs are now equipped with high definition cameras. Running DNNs locallyon these devices enables applications such as emergency response and safety monitoring,because data cannot always be offloaded to the Cloud due to latency, privacy, or networkbandwidth constraints.</p> <p>Prior research has shown that a considerable number of a DNN’s memory accesses andcomputation are redundant when performing computer vision tasks. Eliminating these re-dundancies will enable faster and more efficient DNN inference on low-power embedded de-vices. To reduce these redundancies and thereby reduce the energy consumption of DNNs,this thesis proposes a novel Tree-based Unidirectional Neural Network (TRUNK) architec-ture. Instead of a single large DNN, multiple small DNNs in the form of a tree work togetherto perform computer vision tasks. The TRUNK architecture first finds thesimilaritybe-tween different object categories. Similar object categories are grouped intoclusters. Similarclusters are then grouped into a hierarchy, creating a tree. The small DNNs at every nodeof TRUNK classify between different clusters. During inference, for an input image, oncea DNN selects a cluster, another DNN further classifies among the children of the cluster(sub-clusters). The DNNs associated with other clusters are not used during the inferenceof that image. By doing so, only a small subset of the DNNs are used during inference,thus reducing redundant operations, memory accesses, and energy consumption. Since eachintermediate classification reduces the search space of possible object categories in the image,the small efficient DNNs still achieve high accuracy.</p> <p>In this thesis, we identify the computer vision applications and scenarios that are wellsuited for the TRUNK architecture. We develop methods to use TRUNK to improve the efficiency of the image classification, object counting, and object re-identification problems.We also present methods to adapt the TRUNK structure for different embedded/edge ap-plication contexts with different system architectures, accuracy requirements, and hardware constraints.</p> <p>Experiments with TRUNK using several image datasets reveal the effectiveness of theproposed solution to reduce memory requirement by∼50%, inference time by∼65%, energyconsumption by∼65%, and the number of operations by∼45% when compared with existingDNN architectures. These experiments are conducted on consumer-grade embedded systems:NVIDIA Jetson Nano, Raspberry Pi 3, and Raspberry Pi Zero. The TRUNK architecturehas only marginal losses in accuracy when compared with the state-of-the-art DNNs.</p> Computer Engineering Computer Software Computer Vision Computer System Architecture Computer Vision Low-Power Embedded Devices
610	Multi-Agent Neural Rearrangement Planning of Objects in Cluttered Environments Vivek Gupta (16642227) 27 July 2023 (has links) <p>Object rearrangement is a fundamental problem in robotics with various practical applications ranging from managing warehouses to cleaning and organizing home kitchens. While existing research has primarily focused on single-agent solutions, real-world scenarios often require multiple robots to work together on rearrangement tasks. We propose a comprehensive learning-based framework for multi-agent object rearrangement planning, addressing the challenges of task sequencing and path planning in complex environments. The proposed method iteratively selects objects, determines their relocation regions, and pairs them with available robots under kinematic feasibility and task reachability for execution to achieve the target arrangement. Our experiments on a diverse range of environments demonstrate the effectiveness and robustness of the proposed framework. Furthermore, results indicate improved performance in terms of traversal time and success rate compared to baseline approaches. The videos and supplementary material are available at https://sites.google.com/view/maner-supplementary</p> Intelligent robotics Computer vision Robotics Machine Learning Computer Vision Tranformers Rearrangement Task and Motion Planning

Search results