• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 340
  • 42
  • 19
  • 13
  • 10
  • 8
  • 4
  • 3
  • 3
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 528
  • 528
  • 242
  • 203
  • 164
  • 129
  • 110
  • 109
  • 108
  • 86
  • 86
  • 78
  • 74
  • 73
  • 70
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
191

Comparison of camera data types for AI tracking of humans in indoor combat training

Zenk, Viktor, Bach, Willy January 2022 (has links)
Multiple object tracking (MOT) can be an efficient tool for finding patterns in video monitoring data. In this thesis, we investigate which type of video data works best for MOT in an indoor combat training scenario. The three types of camera data evaluated are color data, near-infrared (NIR) data, and depth data. In order to evaluate which of these lend themselves best for MOT, we develop object tracking models based on YOLOv5 and DeepSORT, and train the models on the respective types of data. In addition to the individual models, ensembles of the three models are also developed, to see if any increase in performance can be gained. The models are evaluated using the well-established MOT evaluation metrics, as well as studying the frame rate performance of each model. The results are rigorously analyzed using statistical significance tests, to ensure only well-supported conclusions are drawn. These evaluations and analyses show mixed results. Regarding the MOT metrics, the performance of most models were not shown to be significantly different from most other models, so while a difference in performance was observed, it cannot be assumed to hold over larger sample sizes. Regarding frame rate, we find that the ensemble models are significantly slower than the individual models on their own.
192

Collision Avoidance for Complex and Dynamic Obstacles : A study for warehouse safety

Ljungberg, Sandra, Brandås, Ester January 2022 (has links)
Today a group of automated guided vehicles at Toyota Material Handling Manufacturing Sweden detect and avoid objects primarily by using 2D-LiDAR, with shortcomings being the limitation of only scanning the area in a 2D plane and missing objects close to the ground. Several dynamic obstacles exist in the environment of the vehicles. Protruding forks are one such obstacle, impossible to detect and avoid with the current choice of sensor and its placement. This thesis investigates possible solutions and limitations of using a single RGB camera for obstacle detection, tracking, and avoidance. The obstacle detection uses the deep learning model YOLOv5s. A solution for semi-automatic data gathering and labeling is designed, and pre-trained weights are chosen to minimize the amount of labeled data needed. Two different approaches are implemented for the tracking of the object. The YOLOv5s detection is the foundation of the first, where 2D-bounding boxes are used as measurements in an Extended Kalman Filter (EKF). Fiducial markers build up the second approach, used as measurements in another EKF. A state lattice motion planner is designed to find a feasible path around the detected obstacle. The chosen graph search algorithm is ARA*, designed to initially find a suboptimal path and improve it if time allows.  The detection works successfully with an average precision of 0.714. The filter using 2D-bounding boxes can not differentiate between a clockwise and counterclockwise rotation, but the performance is improved when a measurement of rotation is included. Using ARA* in the motion planner, the solution sufficiently avoids the obstacles.
193

3D Shape Detection for Augmented Reality / 3D form-detektion för förstärkt verklighet

Anadon Leon, Hector January 2018 (has links)
In previous work, 2D object recognition has shown exceptional results. However, it is not possible to sense the environment spatial information, where the objects are and what they are. Having this knowledge could imply improvements in several fields like Augmented Reality by allowing virtual characters to interact more realistically with the environment and Autonomous cars by being able to make better decisions knowing where the objects are in a 3D space. The proposed work shows that it is possible to predict 3D bounding boxes with semantic labels for 3D object detection and a set of primitives for 3D shape recognition from multiple objects in a indoors scene using an algorithm that receives as input an RGB image and its 3D information. It uses Deep Neural Networks with novel architectures for point cloud feature extraction. It uses a unique feature vector capable of representing the latent space of the object that models its shape, position, size and orientation for multi-task prediction trained end-to-end with unbalanced datasets. It runs in real time (5 frames per second) in a live video feed. The method is evaluated in the NYU Depth Dataset V2 using Average Precision for object detection and 3D Intersection over Union and surface-to-surface distance for 3D shape. The results confirm that it is possible to use a shared feature vector for more than one prediction task and it generalizes for unseen objects during the training process achieving state-of-the-art results for 3D object detection and 3D shape prediction for the NYU Depth Dataset V2. Qualitative results are shown in real particular captured data showing that there could be navigation in a real-world indoor environment and that there could be collisions between the animations and the detected objects improving the interaction character-environment in Augmented Reality applications. / 2D-objektigenkänning har i tidigare arbeten uppvisat exceptionella resultat. Dessa modeller gör det dock inte möjligt att erhålla rumsinformation, så som föremåls position och information om vad föremålen är. Sådan kunskap kan leda till förbättringar inom flera områden så som förstärkt verklighet, så att virtuella karaktärer mer realistiskt kan interagera med miljön, samt för självstyrande bilar, så att de kan fatta bättre beslut och veta var objekt är i ett 3D-utrymme. Detta arbete visar att det är möjligt att modellera täckande rätblock med semantiska etiketter för 3D-objektdetektering, samt underliggande komponenter för 3D-formigenkänning, från flera objekt i en inomhusmiljö med en algoritm som verkar på en RGB-bild och dess 3D-information. Modellen konstrueras med djupa neurala nätverk med nya arkitekturer för Point Cloud-representationsextraktion. Den använder en unik representationsvektor som kan representera det latenta utrymmet i objektet som modellerar dess form, position, storlek och orientering för komplett träning med flera uppgifter, med obalanserade dataset. Den körs i realtid (5 bilder per sekund) i realtidsvideo. Metoden utvärderas med NYU Depth Dataset V2 med Genomsnittlig Precision för objektdetektering, 3D-Skärning över Union, samt avstånd mellan ytorna för 3D-form. Resultaten bekräftar att det är möjligt att använda en delad representationsvektor för mer än en prediktionsuppgift, och generaliserar för föremål som inte observerats under träningsprocessen. Den uppnår toppresultat för 3D-objektdetektering samt 3D-form-prediktion för NYU Depth Dataset V2. Kvalitativa resultat baserade på särskilt anskaffade data visar potential inom navigering i en verklig inomhusmiljö, samt kollision mellan animationer och detekterade objekt, vilka kan förbättra interaktonen mellan karaktär och miljö inom förstärkt verklighet-applikationer.
194

Mobile Object Detection using TensorFlow Lite and Transfer Learning / Objektigenkänning i mobila enheter med TensorFlow Lite

Alsing, Oscar January 2018 (has links)
With the advancement in deep learning in the past few years, we are able to create complex machine learning models for detecting objects in images, regardless of the characteristics of the objects to be detected. This development has enabled engineers to replace existing heuristics-based systems in favour of machine learning models with superior performance. In this report, we evaluate the viability of using deep learning models for object detection in real-time video feeds on mobile devices in terms of object detection performance and inference delay as either an end-to-end system or feature extractor for existing algorithms. Our results show a significant increase in object detection performance in comparison to existing algorithms with the use of transfer learning on neural networks adapted for mobile use. / Utvecklingen inom djuplärning de senaste åren innebär att vi är kapabla att skapa mer komplexa maskininlärningsmodeller för att identifiera objekt i bilder, oavsett objektens attribut eller karaktär. Denna utveckling har möjliggjort forskare att ersätta existerande heuristikbaserade algoritmer med maskininlärningsmodeller med överlägsen prestanda. Den här rapporten syftar till att utvärdera användandet av djuplärningsmodeller för exekvering av objektigenkänning i video på mobila enheter med avseende på prestanda och exekveringstid. Våra resultat visar på en signifikant ökning i prestanda relativt befintliga heuristikbaserade algoritmer vid användning  av djuplärning och överförningsinlärning i artificiella neurala nätverk.
195

Vision based indoor object detection for a drone / Bildbaserad detektion av inomhusobjekt för drönare

Grip, Linnea January 2017 (has links)
Drones are a very active area of research and object detection is a crucial part in achieving full autonomy of any robot. We investigated how state-of-the-art object detection algorithms perform on image data from a drone. For the evaluation we collected a number of datasets in an indoor office environment with different cameras and camera placements. We surveyed the literature of object detection and selected to research the algorithm R-FCN (Region based Fully Convolutional Network) for the evaluation. The performances on the different datasets were then compared, showing that using footage from a drone may be advantageous in scenarios where the goal is to detect as many objects as possible. Further, it was shown that the network, even if trained on normal angled images, can be used for detecting objects in fish eye images and that usage of a fisheye camera can increase the total number of detected objects in a scene. / Drönare är ett mycket aktivt forskningsområde och objektigenkänning är en viktig del för att uppnå full självstyrning för robotar. Vi undersökte hur dagens bästa objektigenkänningsalgoritmer presterar på bilddata från en drönare. Vi gjorde en literatturstudie och valde att undersöka algoritmen R-FCN (Region based Fully Convolutional Network). För att evaluera algoritmen spelades flera dataset in i en kontorsmiljö med olika kameror och kameraplaceringar. Prestandan på de olika dataseten jämfördes sedan och det visades att användningen av bilder från en drönare kan vara fördelaktig då målet är att hitta så många objekt som möjligt. Vidare visades att nätverket, även om det är tränat på bilder från en vanlig kamera, kan användas för att hitta objekt i vidvinklade bilder och att användningen av en vidvinkelkamera kan öka det totala antalet detekterade objekt i en scen.
196

Evaluating rain removal image processing solutions for fast and accurate object detection / Utvärdering av regnborttagningsalgoritmer för snabboch pålitlig objektigenkänning

Köylüoglu, Tugay, Hennicks, Lukas January 2019 (has links)
Autonomous vehicles are an important topic in modern day research, both for the private and public sector. One of the reasons why self-driving cars have not yet reached consumer market is because of levels of uncertainty. This is often tackled with multiple sensors of different kinds which helps gaining robust- ness in the vehicle’s system. Radars, lidars and cameras are often the sensors used and the expenses can rise up quickly, which is not always feasible for different markets. This could be addressed with using fewer, but more robust sensors for visualization. This thesis addresses the issue of one particular failure mode for camera sensors, which is reduced view range affected by rainy weather. Kalman filter and discrete wavelet transform with bilateral filtering are evaluated as rain removal algorithms and tested with the state-of-the-art object detection algorithm, You Only Look Once (YOLOv3). Filtered videos in daylight and evening light were tested with YOLOv3 and results show that the accuracy is not improved enough to be worth implementing in autonomous vehicles. With the graphics card available for this thesis YOLOv3 is not fast enough for a vehicle to stop in time when driving in 110km/h and an obstacle appears 80m ahead, however an Nvidia Titan X is assumed to be fast enough. There is potential within the research area and this thesis suggests that other object detection methods are evaluated as future work. / Autonoma fordon är för privat samt offentlig sektor ett viktigt område i modern forskning. Osäkerheten med autonoma fordon är en viktig anledning till varför de idag inte nått konsumentmarknaden. Systemen för autonoma fordon blir mer robusta med inkludering av flera sensorer av olika typer, vilka oftast är kameror, radar och lidars. Fordon med dessa sensorer kan snabbt öka i pris vilket gör dem mindre tillgängliga för olika marknader. Detta skulle kunna lösas med färre sensorer som däremot är mer robusta. Denna avhandling diskuterar problemet med en specific felmodell för kameror, vilket är minskat synfält som påverkas av regnigt väder. Kalman filter och diskret vågkomponent-transformation med bilateral filtrering utvärderades som regnborttagningsalgoritmer och testades med You Only Look Once (YOLOv3), en modern objektigenkänningsmetod. Filtrerade videofilmer i dagstid och kvällstid testades med YOLOv3 och resultaten visade att noggrannheten inte ökade tillräckligt mycket för att vara användbara för autonoma fordon. Med grafikkorten tillgängliga för denna avhandling är inte YOLOv3 snabb nog för ett fordon att hinna stanna i tid före kollision om bilen kör i 110km/h och ett föremål dyker upp 80m framför. Däremot antas det att fordon utrustade med Nvidias Titan X borde hinna stanna i tid före kollision. Avhandlingen ser däremot potential inom detta forskningsområde och föreslår att liknande test fast med andra objektigenkänningsmetoder bör utföras.
197

Human Detection, Tracking and Segmentation in Surveillance Video

Shu, Guang 01 January 2014 (has links)
This dissertation addresses the problem of human detection and tracking in surveillance videos. Even though this is a well-explored topic, many challenges remain when confronted with data from real world situations. These challenges include appearance variation, illumination changes, camera motion, cluttered scenes and occlusion. In this dissertation several novel methods for improving on the current state of human detection and tracking based on learning scene-specific information in video feeds are proposed. Firstly, we propose a novel method for human detection which employs unsupervised learning and superpixel segmentation. The performance of generic human detectors is usually degraded in unconstrained video environments due to varying lighting conditions, backgrounds and camera viewpoints. To handle this problem, we employ an unsupervised learning framework that improves the detection performance of a generic detector when it is applied to a particular video. In our approach, a generic DPM human detector is employed to collect initial detection examples. These examples are segmented into superpixels and then represented using Bag-of-Words (BoW) framework. The superpixel-based BoW feature encodes useful color features of the scene, which provides additional information. Finally a new scene-specific classifier is trained using the BoW features extracted from the new examples. Compared to previous work, our method learns scene-specific information through superpixel-based features, hence it can avoid many false detections typically obtained by a generic detector. We are able to demonstrate a significant improvement in the performance of the state-of-the-art detector. Given robust human detection, we propose a robust multiple-human tracking framework using a part-based model. Human detection using part models has become quite popular, yet its extension in tracking has not been fully explored. Single camera-based multiple-person tracking is often hindered by difficulties such as occlusion and changes in appearance. We address such problems by developing an online-learning tracking-by-detection method. Our approach learns part-based person-specific Support Vector Machine (SVM) classifiers which capture articulations of moving human bodies with dynamically changing backgrounds. With the part-based model, our approach is able to handle partial occlusions in both the detection and the tracking stages. In the detection stage, we select the subset of parts which maximizes the probability of detection. This leads to a significant improvement in detection performance in cluttered scenes. In the tracking stage, we dynamically handle occlusions by distributing the score of the learned person classifier among its corresponding parts, which allows us to detect and predict partial occlusions and prevent the performance of the classifiers from being degraded. Extensive experiments using the proposed method on several challenging sequences demonstrate state-of-the-art performance in multiple-people tracking. Next, in order to obtain precise boundaries of humans, we propose a novel method for multiple human segmentation in videos by incorporating human detection and part-based detection potential into a multi-frame optimization framework. In the first stage, after obtaining the superpixel segmentation for each detection window, we separate superpixels corresponding to a human and background by minimizing an energy function using Conditional Random Field (CRF). We use the part detection potentials from the DPM detector, which provides useful information for human shape. In the second stage, the spatio-temporal constraints of the video is leveraged to build a tracklet-based Gaussian Mixture Model for each person, and the boundaries are smoothed by multi-frame graph optimization. Compared to previous work, our method could automatically segment multiple people in videos with accurate boundaries, and it is robust to camera motion. Experimental results show that our method achieves better segmentation performance than previous methods in terms of segmentation accuracy on several challenging video sequences. Most of the work in Computer Vision deals with point solution; a specific algorithm for a specific problem. However, putting different algorithms into one real world integrated system is a big challenge. Finally, we introduce an efficient tracking system, NONA, for high-definition surveillance video. We implement the system using a multi-threaded architecture (Intel Threading Building Blocks (TBB)), which executes video ingestion, tracking, and video output in parallel. To improve tracking accuracy without sacrificing efficiency, we employ several useful techniques. Adaptive Template Scaling is used to handle the scale change due to objects moving towards a camera. Incremental Searching and Local Frame Differencing are used to resolve challenging issues such as scale change, occlusion and cluttered backgrounds. We tested our tracking system on a high-definition video dataset and achieved acceptable tracking accuracy while maintaining real-time performance.
198

Scene Monitoring With A Forest Of Cooperative Sensors

Javed, Omar 01 January 2005 (has links)
In this dissertation, we present vision based scene interpretation methods for monitoring of people and vehicles, in real-time, within a busy environment using a forest of co-operative electro-optical (EO) sensors. We have developed novel video understanding algorithms with learning capability, to detect and categorize people and vehicles, track them with in a camera and hand-off this information across multiple networked cameras for multi-camera tracking. The ability to learn prevents the need for extensive manual intervention, site models and camera calibration, and provides adaptability to changing environmental conditions. For object detection and categorization in the video stream, a two step detection procedure is used. First, regions of interest are determined using a novel hierarchical background subtraction algorithm that uses color and gradient information for interest region detection. Second, objects are located and classified from within these regions using a weakly supervised learning mechanism based on co-training that employs motion and appearance features. The main contribution of this approach is that it is an online procedure in which separate views (features) of the data are used for co-training, while the combined view (all features) is used to make classification decisions in a single boosted framework. The advantage of this approach is that it requires only a few initial training samples and can automatically adjust its parameters online to improve the detection and classification performance. Once objects are detected and classified they are tracked in individual cameras. Single camera tracking is performed using a voting based approach that utilizes color and shape cues to establish correspondence in individual cameras. The tracker has the capability to handle multiple occluded objects. Next, the objects are tracked across a forest of cameras with non-overlapping views. This is a hard problem because of two reasons. First, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, the system learns the inter-camera relationships to constrain track correspondences. These relationships are learned in the form of multivariate probability density of space-time variables (object entry and exit locations, velocities, and inter-camera transition times) using Parzen windows. To handle the appearance change of an object as it moves from one camera to another, we show that all color transfer functions from a given camera to another camera lie in a low dimensional subspace. The tracking algorithm learns this subspace by using probabilistic principal component analysis and uses it for appearance matching. The proposed system learns the camera topology and subspace of inter-camera color transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework using both the location and appearance cues. Extensive experiments and deployment of this system in realistic scenarios has demonstrated the robustness of the proposed methods. The proposed system was able to detect and classify targets, and seamlessly tracked them across multiple cameras. It also generated a summary in terms of key frames and textual description of trajectories to a monitoring officer for final analysis and response decision. This level of interpretation was the goal of our research effort, and we believe that it is a significant step forward in the development of intelligent systems that can deal with the complexities of real world scenarios.
199

Learning Hierarchical Representations For Video Analysis Using Deep Learning

Yang, Yang 01 January 2013 (has links)
With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations. However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos “in a wild”. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions. First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts iii and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification. Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models.
200

Multi-view Approaches To Tracking, 3d Reconstruction And Object Class Detection

Khan, Saad 01 January 2008 (has links)
Multi-camera systems are becoming ubiquitous and have found application in a variety of domains including surveillance, immersive visualization, sports entertainment and movie special effects amongst others. From a computer vision perspective, the challenging task is how to most efficiently fuse information from multiple views in the absence of detailed calibration information and a minimum of human intervention. This thesis presents a new approach to fuse foreground likelihood information from multiple views onto a reference view without explicit processing in 3D space, thereby circumventing the need for complete calibration. Our approach uses a homographic occupancy constraint (HOC), which states that if a foreground pixel has a piercing point that is occupied by foreground object, then the pixel warps to foreground regions in every view under homographies induced by the reference plane, in effect using cameras as occupancy detectors. Using the HOC we are able to resolve occlusions and robustly determine ground plane localizations of the people in the scene. To find tracks we obtain ground localizations over a window of frames and stack them creating a space time volume. Regions belonging to the same person form contiguous spatio-temporal tracks that are clustered using a graph cuts segmentation approach. Second, we demonstrate that the HOC is equivalent to performing visual hull intersection in the image-plane, resulting in a cross-sectional slice of the object. The process is extended to multiple planes parallel to the reference plane in the framework of plane to plane homologies. Slices from multiple planes are accumulated and the 3D structure of the object is segmented out. Unlike other visual hull based approaches that use 3D constructs like visual cones, voxels or polygonal meshes requiring calibrated views, ours is purely-image based and uses only 2D constructs i.e. planar homographies between views. This feature also renders it conducive to graphics hardware acceleration. The current GPU implementation of our approach is capable of fusing 60 views (480x720 pixels) at the rate of 50 slices/second. We then present an extension of this approach to reconstructing non-rigid articulated objects from monocular video sequences. The basic premise is that due to motion of the object, scene occupancies are blurred out with non-occupancies in a manner analogous to motion blurred imagery. Using our HOC and a novel construct: the temporal occupancy point (TOP), we are able to fuse multiple views of non-rigid objects obtained from a monocular video sequence. The result is a set of blurred scene occupancy images in the corresponding views, where the values at each pixel correspond to the fraction of total time duration that the pixel observed an occupied scene location. We then use a motion de-blurring approach to de-blur the occupancy images and obtain the 3D structure of the non-rigid object. In the final part of this thesis, we present an object class detection method employing 3D models of rigid objects constructed using the above 3D reconstruction approach. Instead of using a complicated mechanism for relating multiple 2D training views, our approach establishes spatial connections between these views by mapping them directly to the surface of a 3D model. To generalize the model for object class detection, features from supplemental views (obtained from Google Image search) are also considered. Given a 2D test image, correspondences between the 3D feature model and the testing view are identified by matching the detected features. Based on the 3D locations of the corresponding features, several hypotheses of viewing planes can be made. The one with the highest confidence is then used to detect the object using feature location matching. Performance of the proposed method has been evaluated by using the PASCAL VOC challenge dataset and promising results are demonstrated.

Page generated in 0.0679 seconds