Global ETD Search

201	Learning Hierarchical Representations For Video Analysis Using Deep Learning Yang, Yang 01 January 2013 (has links) With the exponential growth of the digital data, video content analysis (e.g., action, event recognition) has been drawing increasing attention from computer vision researchers. Effective modeling of the objects, scenes, and motions is critical for visual understanding. Recently there has been a growing interest in the bio-inspired deep learning models, which has shown impressive results in speech and object recognition. The deep learning models are formed by the composition of multiple non-linear transformations of the data, with the goal of yielding more abstract and ultimately more useful representations. The advantages of the deep models are three fold: 1) They learn the features directly from the raw signal in contrast to the hand-designed features. 2) The learning can be unsupervised, which is suitable for large data where labeling all the data is expensive and unpractical. 3) They learn a hierarchy of features one level at a time and the layerwise stacking of feature extraction, this often yields better representations. However, not many deep learning models have been proposed to solve the problems in video analysis, especially videos “in a wild”. Most of them are either dealing with simple datasets, or limited to the low-level local spatial-temporal feature descriptors for action recognition. Moreover, as the learning algorithms are unsupervised, the learned features preserve generative properties rather than the discriminative ones which are more favorable in the classification tasks. In this context, the thesis makes two major contributions. First, we propose several formulations and extensions of deep learning methods which learn hierarchical representations for three challenging video analysis tasks, including complex event recognition, object detection in videos and measuring action similarity. The proposed methods are extensively demonstrated for each work on the state-of-the-art challenging datasets. Besides learning the low-level local features, higher level representations are further designed to be learned in the context of applications. The data-driven concept representations and sparse representation of the events are learned for complex event recognition; the representations for object body parts iii and structures are learned for object detection in videos; and the relational motion features and similarity metrics between video pairs are learned simultaneously for action verification. Second, in order to learn discriminative and compact features, we propose a new feature learning method using a deep neural network based on auto encoders. It differs from the existing unsupervised feature learning methods in two ways: first it optimizes both discriminative and generative properties of the features simultaneously, which gives our features a better discriminative ability. Second, our learned features are more compact, while the unsupervised feature learning methods usually learn a redundant set of over-complete features. Extensive experiments with quantitative and qualitative results on the tasks of human detection and action verification demonstrate the superiority of our proposed models. Video analysis deep learning complex event object detection action verification Electrical and Computer Engineering Electrical and Electronics Engineering
202	Multi-view Approaches To Tracking, 3d Reconstruction And Object Class Detection Khan, Saad 01 January 2008 (has links) Multi-camera systems are becoming ubiquitous and have found application in a variety of domains including surveillance, immersive visualization, sports entertainment and movie special effects amongst others. From a computer vision perspective, the challenging task is how to most efficiently fuse information from multiple views in the absence of detailed calibration information and a minimum of human intervention. This thesis presents a new approach to fuse foreground likelihood information from multiple views onto a reference view without explicit processing in 3D space, thereby circumventing the need for complete calibration. Our approach uses a homographic occupancy constraint (HOC), which states that if a foreground pixel has a piercing point that is occupied by foreground object, then the pixel warps to foreground regions in every view under homographies induced by the reference plane, in effect using cameras as occupancy detectors. Using the HOC we are able to resolve occlusions and robustly determine ground plane localizations of the people in the scene. To find tracks we obtain ground localizations over a window of frames and stack them creating a space time volume. Regions belonging to the same person form contiguous spatio-temporal tracks that are clustered using a graph cuts segmentation approach. Second, we demonstrate that the HOC is equivalent to performing visual hull intersection in the image-plane, resulting in a cross-sectional slice of the object. The process is extended to multiple planes parallel to the reference plane in the framework of plane to plane homologies. Slices from multiple planes are accumulated and the 3D structure of the object is segmented out. Unlike other visual hull based approaches that use 3D constructs like visual cones, voxels or polygonal meshes requiring calibrated views, ours is purely-image based and uses only 2D constructs i.e. planar homographies between views. This feature also renders it conducive to graphics hardware acceleration. The current GPU implementation of our approach is capable of fusing 60 views (480x720 pixels) at the rate of 50 slices/second. We then present an extension of this approach to reconstructing non-rigid articulated objects from monocular video sequences. The basic premise is that due to motion of the object, scene occupancies are blurred out with non-occupancies in a manner analogous to motion blurred imagery. Using our HOC and a novel construct: the temporal occupancy point (TOP), we are able to fuse multiple views of non-rigid objects obtained from a monocular video sequence. The result is a set of blurred scene occupancy images in the corresponding views, where the values at each pixel correspond to the fraction of total time duration that the pixel observed an occupied scene location. We then use a motion de-blurring approach to de-blur the occupancy images and obtain the 3D structure of the non-rigid object. In the final part of this thesis, we present an object class detection method employing 3D models of rigid objects constructed using the above 3D reconstruction approach. Instead of using a complicated mechanism for relating multiple 2D training views, our approach establishes spatial connections between these views by mapping them directly to the surface of a 3D model. To generalize the model for object class detection, features from supplemental views (obtained from Google Image search) are also considered. Given a 2D test image, correspondences between the 3D feature model and the testing view are identified by matching the detected features. Based on the 3D locations of the corresponding features, several hypotheses of viewing planes can be made. The one with the highest confidence is then used to detect the object using feature location matching. Performance of the proposed method has been evaluated by using the PASCAL VOC challenge dataset and promising results are demonstrated. Computer vision visual tracking 3D reconstruction object detection Computer Sciences Engineering
203	Object Tracking in Games Using Convolutional Neural Networks Venkatesh, Anirudh 01 June 2018 (has links) (PDF) Computer vision research has been growing rapidly over the last decade. Recent advancements in the field have been widely used in staple products across various industries. The automotive and medical industries have even pushed cars and equipment into production that use computer vision. However, there seems to be a lack of computer vision research in the game industry. With the advent of e-sports, competitive and casual gaming have reached new heights with regard to players, viewers, and content creators. This has allowed for avenues of research that did not exist prior. In this thesis, we explore the practicality of object detection as applied in games. We designed a custom convolutional neural network detection model, SmashNet. The model was improved through classification weights generated from pre-training on the Caltech101 dataset with an accuracy of 62.29%. It was then trained on 2296 annotated frames from the competitive 2.5-dimensional fighting game Super Smash Brothers Melee to track coordinate locations of 4 specific characters in real-time. The detection model performs at a 68.25% accuracy across all 4 characters. In addition, as a demonstration of a practical application, we designed KirbyBot, a black-box adaptive bot which performs basic commands reactively based only on the tracked locations of two characters. It also collects very simple data on player habits. KirbyBot runs at a rate of 6-10 fps. Object detection has several practical applications with regard to games, ranging from better AI design, to collecting data on player habits or game characters for competitive purposes or improvement updates. Convolutional Neural Networks YOLO Games CNNs Neural Networks Object Detection Computer Engineering
204	Automating Deep-Sea Video Annotation Egbert, Hanson 01 June 2021 (has links) (PDF) As the world explores opportunities to develop offshore renewable energy capacity, there will be a growing need for pre-construction biological surveys and post-construction monitoring in the challenging marine environment. Underwater video is a powerful tool to facilitate such surveys, but the interpretation of the imagery is costly and time-consuming. Emerging technologies have improved automated analysis of underwater video, but these technologies are not yet accurate or accessible enough for widespread adoption in the scientific community or industries that might benefit from these tools. To address these challenges, prior research developed a website that allows to: (1) Quickly play and annotate underwater videos, (2) Create a short tracking video for each annotation that shows how an annotated concept moves in time, (3) Verify the accuracy of existing annotations and tracking videos, (4) Create a neural network model from existing annotations, and (5) Automatically annotate unwatched videos using a model that was previously created. It uses both validated and unvalidated annotations and automatically generated annotations from trackings to count the number of Rathbunaster californicus (starfish) and Strongylocentrotus fragilis (sea urchin) with count accuracy of 97% and 99%, respectively, and F1 score accuracy of 0.90 and 0.81, respectively. The thesis explores several improvements to the model above. First, a method to sync JavaScript video frames to a stable Python environment. Second, reinforcement training using marine biology experts and the verification feature. Finally, a hierarchical method that allows the model to combine predictions of related concepts. On average, this method improved the F1 scores from 0.42 to 0.45 (a relative increase of 7%) and count accuracy from 58% to 69% (a relative increase of 19%) for the concepts Umbellula Lindahli and Funiculina. Object Detection Hierarchical Classification Biodiversity Monitoring Object Tracking Object Counting Automatic Video Annotation Other Computer Engineering
205	Semi-Automatic ImageAnnotation Tool Alvenkrona, Miranda, Hylander, Tilda January 2023 (has links) Annotation is essential in machine learning. Building an accurate object detec-tion model requires a large, diverse dataset, which poses challenges due to thetime-consuming nature of manual annotation. This thesis was made in collabora-tion with Project Ngulia, which aims at developing technical solutions to protectand monitor wild animals. A contribution of this work was to integrate an effi-cient semi-automatic image annotation tool within the Ngulia system, with theaim of streamlining the annotation process and improving the employed objectdetection models. Through research into available annotation tools, a custom toolwas deemed the most cost-effective and flexible option. It utilizes object detec-tion model predictions as annotation suggestions, improving the efficiency of theannotation process. The efficiency was evaluated through a user test, with partic-ipants achieving an average reduction of approximately 2 seconds in annotationspeed when utilizing suggestions. This reduction was supported as statisticallysignificant through a one-way ANOVA test. Additionally, it was investigated which images should be prioritized for an-notation in order to obtain the the most accurate predictions. Different samplingmethods were investigated and compared. The performance of the obtained mod-els remained relatively consistent, although with the even distribution methodat top. This indicate that the choice of sampling method may not substantiallyimpact the accuracy of the model, as the performance of the methods was rela-tively comparable. Moreover, different methods of selecting training data in there-training process was compared. The difference in performance was consider-ately small, likely due to the limited and balanced data pool. The experimentsdid however indicate that incorporating previously seen data with unseen datacould be beneficial, and that a reduced dataset can be sufficient. However, furtherinvestigation is required to fully understand the extent of these benefits. annotation machine learning annotation tool image annotation object detection selective annotation re-training Control Engineering Reglerteknik
206	The research of background removal applied to fashion data : The necessity analysis of background removal for fashion data / Forskningen av bakgrundsborttagning tillämpas på modedata : Nödvändighetsanalysen av bakgrundsborttagning för modedata Liang, Junhui January 2022 (has links) Fashion understanding is a hot topic in computer vision, with many applications having a great business value in the market. It remains a difficult challenge for computer vision due to the immense diversity of garments and a wide range of scenes and backgrounds. In this work, we try to remove the background of fashion images to boost data quality and ultimately increase model performance. Thanks to the fashion image consisting of evident persons in full garments visible, we can utilize Salient Object Detection (SOD) to achieve the background removal of fashion data to our expectations. The fashion image with removing the background is claimed as the “rembg” image, contrasting with the original one in the fashion dataset. We conduct comparative experiments between these two types of images on multiple aspects of model training, including model architectures, model initialization, compatibility with other training tricks and data augmentations, and target task types. Our experiments suggested that background removal can significantly work for fashion data in simple and shallow networks that are not susceptible to overfitting. It can improve model accuracy by up to 5% in the classification of FashionStyle14 when training models from scratch. However, background removal does not perform well in the deep network due to its incompatibility with other regularization techniques like batch normalization, pre-trained initialization, and data augmentations introducing randomness. The loss of background pixels invalidates many existing training tricks in the model training, adding the risk of overfitting for deep models. / Modeförståelse är ett hett ämne inom datorseende, med många applikationer som har ett stort affärsvärde på marknaden. Det är fortfarande en svår utmaning för datorseende på grund av den enorma mångfalden av plagg och ett brett utbud av scener och bakgrunder. I det här arbetet försöker vi ta bort bakgrunden från modebilder för att öka datakvaliteten och i slutändan öka modellens prestanda. Tack vare modebilden som består av synliga personer i helt synliga plagg, kan vi använda framträdande objektivdetektion för att uppnå bakgrundsborttagning av modedata enligt våra förväntningar. Modebilden med att ta bort bakgrunden hävdas vara “rembg”-bilden, i kontrast till den ursprungliga i modedatasetet. Vi genomför jämförande experiment mellan dessa två typer av bilder på flera aspekter av modellträning, inklusive modellarkitekturer, modellinitiering , kompatibilitet med andra träningsknep och dataökningar och måluppgiftstyper. Våra experiment antydde att bakgrundsborttagning avsevärt kan fungera för modedata i enkla och ytliga nätverk som inte är mottagliga för överanpassning. Det kan förbättra modellens noggrannhet med upp till 5 % i klassificeringen av FashionStyle14 när man tränar modeller från grunden. Bakgrundsborttagning fungerar dock inte bra i det djupa nätverket på grund av dess inkompatibilitet med andra regulariseringstekniker som batchnormalisering, förtränad initialisering och dataförstärkningar som introducerar slumpmässighet. Förlusten av bakgrundspixlar ogiltigförklarar många befintliga träningsknep i modellträningen, lägg till risken för övermontering för djupa modeller. Background Removal Fashion Analysis Salient Object Detection Computer and Information Sciences Data- och informationsvetenskap
207	Automation of Closed-Form and Spectral Matting Methods for Intelligent Surveillance Applications Alrabeiah, Muhammad 16 December 2015 (has links) Machine-driven analysis of visual data is the hard core of intelligent surveillance systems. Its main goal is to recognize di erent objects in the video sequence and their behaviour. Such operation is very challenging due to the dynamic nature of the scene and the lack of semantic-comprehension for visual data in machines. The general ow of the recognition process starts with the object extraction task. For so long, this task has been performed using image segmentation. However, recent years have seen the emergence of another contender, image matting. As a well-known process, matting has a very rich literature, most of which is designated to interactive approaches for applications like movie editing. Thus, it was conventionally not considered for visual data analysis operations. Following the new shift toward matting as a means to object extraction, two methods have stood out for their foreground-extraction accuracy and, more importantly, their automation potential. These methods are Closed-Form Matting (CFM) and Spectral Matting (SM). They pose the matting process as either a constrained optimization problem or a segmentation-like component selection process. This di erence of formulation stems from an interesting di erence of perspective on the matting process, opening the door for more automation possibilities. Consequently, both of these methods have been the subject of some automation attempts that produced some intriguing results. For their importance and potential, this thesis will provide detailed discussion and analysis on two of the most successful techniques proposed to automate the CFM and SM methods. In the beginning, focus will be on introducing the theoretical grounds of both matting methods as well as the automatic techniques. Then, it will be shifted toward a full analysis and assessment of the performance and implementation of these automation attempts. To conclude the thesis, a brief discussion on possible improvements will be presented, within which a hybrid technique is proposed to combine the best features of the reviewed two techniques. / Thesis / Master of Applied Science (MASc) Video processing Visual-content analysis Image processing Video matting Object detection image matting
208	Exploration of performance evaluation metrics with deep-learning-based generic object detection for robot guidance systems Gustafsson, Helena January 2023 (has links) Robots are often used within the industry for automated tasks that are too dangerous, complex, or strenuous for humans, which leads to time and cost benefits. Robots can have an arm and a gripper to manipulate the world and sensors for eyes to be able to perceive the world. Human vision can be seen as an effortless task, but machine vision requires substantial computation in an attempt to be as effective as human vision. Visual object recognition is a common goal for machine vision, and it is often applied using deep learning and generic object detection. This thesis has a focus on robot guidance systems that include a robot with its gripper on the robot arm, a camera that acquires images of the world, boxes to detect in one or more layers, and the software that applies a generic object detection model to detect the boxes. Robot guidance systems’ performance is impacted by many variables such as different environmental, camera, object, and robot gripper aspects. A survey was constructed to receive feedback from professionals on what thresholds that can be defined for detection from the model to be counted as correct, with the aspect of the detection referring to an actual object that needs to be able to be picked up by a robot. This thesis has implemented precision, recall, average precision at a specific threshold, average precision at a range of thresholds, localization-recall-precision error, and a manually constructed counter based on survey results for the robot’s ability to pick up an object from the information provided by the detection, called pickability score. The metrics from this thesis are implemented within a tool intended for analyzing different models’ performance on varying datasets. The values of all the metrics for the applied dataset are presented in the results. The metrics are discussed with regards to what information they portray together with a robot guidance system. The conclusion is to see the metrics for what they are best at by themselves. Use the average precision metrics for the performance evaluation of the models, and the pickability scores with extended features for the robot gripper pickability evaluation. performance evaluation robot guidance system deep learning object detection instance segmentation Computer and Information Sciences Data- och informationsvetenskap
209	Smartphone Based Object Detection for Shark Spotting Oliver, Darrick W 01 November 2023 (has links) (PDF) Given concern over shark attacks in coastal regions, the recent use of unmanned aerial vehicles (UAVs), or drones, has increased to ensure the safety of beachgoers. However, much of city officials' process remains manual, with drone operation and review of footage still playing a significant role. In pursuit of a more automated solution, researchers have turned to the usage of neural networks to perform detection of sharks and other marine life. For on-device solutions, this has historically required assembling individual hardware components to form an embedded system to utilize the machine learning model. This means that the camera, neural processing unit, and central processing unit are purchased and assembled separately, requiring specific drivers and involves a lengthy setup process. Addressing these issues, we look to the usage of smartphones as a novel integrated solution for shark detection. This paper looks at using an iPhone 14 Pro as the driving force for a YOLOv5 based model, and comparing our results to previous literature in shark-based object detection. We find that our system outperforms previous methods at both higher throughput and increased accuracy. Smartphone Object Detection Computer Vision Shark iPhone Artificial Intelligence Artificial Intelligence and Robotics Engineering
210	Edge Machine Learning for Wildlife Conservation : A part of the Ngulia project / Maskininlärning i Noden för Bevarandet av Djurlivet på Savannen : En del av Ngulia projektet Gotthard, Richard, Broström, Marcus January 2023 (has links) The prominence of Edge Machine Learning is increasing swiftly as the performance of microcontrollers continues to improve. By deploying object detection and classification models on edge devices with camera sensors, it becomes possible to locate and identify objects in their vicinity. This technology finds valuable applications in wildlife conservation, particularly in camera traps used in African sanctuaries, and specifically in the Ngulia sanctuary, to monitor endangered species and provide early warnings for potential intruders. When an animal crosses the path of a an edge device equipped with a camera sensor, an image is captured, and the animal's presence and identity are subsequently determined. The performance of three distinct object detection models: SSD MobileNetV2, FOMO MobileNetV2, and YOLOv5 is evaluated. Furthermore, the compatibility of these models with three different microcontrollers ESP32 TimerCam from M5Stack, Sony Spresence, and LILYGO T-Camera S3 ESP32-S is explored. The deployment of Over-The-Air updates to edge devices stationed in remote areas is presented. It illustrates how an edge device, initially deployed with a model, can collect field data and be iteratively updated using an active learning pipeline. This project evaluates the performance of three different microcontrollers in conjunction with their respective camera sensors. A contribution of this work is a successful field deployment of a LILYGO T-Camera S3 ESP32-S running the FOMO MobileNetV2 model. The data captured by this setup fuels an active learning pipeline that can be iteratively retrain the FOMO MobileNetV2 model and update the LILYGO T-Camera S3 ESP32-S with new firmware through Over-The-Air updates. / Project Ngulia Edge Machine Learning Object Detection Classification Wildlife Conservation Control Engineering Reglerteknik

Search results