Global ETD Search

11	3D Gaze Estimation on RGB Images using Vision Transformers Li, Jing January 2023 (has links) Gaze estimation, a vital component in numerous applications such as humancomputer interaction, virtual reality, and driver monitoring systems, is the process of predicting the direction of an individual’s gaze. The predominant methods for gaze estimation can be broadly classified into intrusive and nonintrusive approaches. Intrusive methods necessitate the use of specialized hardware, such as eye trackers, while non-intrusive methods leverage images or recordings obtained from cameras to make gaze predictions. This thesis concentrates on appearance-based gaze estimation, specifically within the non-intrusive domain, employing various deep learning models. The primary focus of this study is to compare the efficacy of Vision Transformers (ViTs), a recently introduced architecture, with Convolutional Neural Networks (CNNs) for gaze estimation on RGB images. Performance evaluations of the models are conducted based on metrics such as the angular gaze error, stimulus distance error, and model size. Within the realm of ViTs, two variants are explored: pure ViTs and hybrid ViTs, which combine both CNN and ViT architectures. Throughout the project, both variants are examined in different sizes. Experimental results demonstrate that all pure ViTs underperform in comparison to the baseline ResNet-18 model. However, the hybrid ViT consistently emerges as the best-performing model across all evaluation datasets. Nonetheless, the discussion regarding whether to deploy the hybrid ViT or stick with the baseline model remains unresolved. This uncertainty arises because utilizing an exceedingly large and slow model, albeit highly accurate, may not be the optimal solution. Hence, the selection of an appropriate model may vary depending on the specific use case. / Ögonblicksbedömning, en avgörande komponent inom flera tillämpningar såsom människa-datorinteraktion, virtuell verklighet och övervakningssystem för förare, är processen att förutsäga riktningen för en individs blick. De dominerande metoderna för ögonblicksbedömning kan i stort sett indelas i påträngande och icke-påträngande tillvägagångssätt. Påträngande metoder kräver användning av specialiserad hårdvara, såsom ögonspårare, medan ickepåträngande metoder utnyttjar bilder eller inspelningar som erhållits från kameror för att göra bedömningar av blicken. Denna avhandling fokuserar på utseendebaserad ögonblicksbedömning, specifikt inom det icke-påträngande området, genom att använda olika djupinlärningsmodeller. Studiens huvudsakliga fokus är att jämföra effektiviteten hos Vision Transformers (ViTs), en nyligen introducerad arkitektur, med Convolutional Neural Networks (CNNs) för ögonblicksbedömning på RGB-bilder. Prestandautvärderingar av modellerna utförs baserat på metriker som den vinkelmässiga felbedömningen av blicken, felbedömning av stimulusavstånd och modellstorlek. Inom ViTs-området utforskas två varianter: rena ViTs och hybrid-ViT, som kombinerar både CNN- och ViT-arkitekturer. Under projektet undersöks båda varianterna i olika storlekar. Experimentella resultat visar att alla rena ViTs presterar sämre jämfört med basmodellen ResNet-18. Hybrid-ViT framstår dock konsekvent som den bäst presterande modellen över alla utvärderingsdatauppsättningar. Diskussionen om huruvida hybrid-ViT ska användas eller om man ska hålla sig till basmodellen förblir dock olöst. Denna osäkerhet uppstår eftersom användning av en extremt stor och långsam modell, även om den är mycket exakt, kanske inte är den optimala lösningen. Valet av en lämplig modell kan därför variera beroende på det specifika användningsområdet. 3D Gaze Estimation Vision Transformers (ViTs) Convolutional Neural Networks (CNNs) Multi-Head Attention Red-Green-Blue (RGB) Images 3D Blickriktningsestimering Vision Transformers (ViTs) Konvolutionsneurala Nätverk (CNNs) Multi-Head Attention Röd-Grön-Blå (RGB) Bilder Computer and Information Sciences Data- och informationsvetenskap
12	GVT-BDNet : Convolutional Neural Network with Global Voxel Transformer Operators for Building Damage Assessment / GVT-BDNet : Convolutional Neural Network med Global Voxel Transformer Operators för Building Damage Assessment Remondini, Leonardo January 2021 (has links) Natural disasters strike anywhere, disrupting local communication and transportation infrastructure, making the process of assessing specific local damage difficult, dangerous, and slow. The goal of Building Damage Assessment (BDA) is to quickly and accurately estimate the location, cause, and severity of the damage to maximize the efficiency of rescuers and saved lives. In current machine learning BDA solutions, attention operators are the most recent innovations adopted by researchers to increase generalizability and overall performances of Convolutional Neural Networks for the BDA task. However, the latter, nowadays exploit attention operators tailored to the specific task and specific neural network architecture, leading them to be hard to apply to other scenarios. In our research, we want to contribute to the BDA literature while also addressing this limitation. We propose Global Voxel Transformer Operators (GVTOs): flexible attention-operators originally proposed for Augmented Microscopy that can replace up-sampling, down-sampling, and size-preserving convolutions within either a U-Net or a general CNN architecture without any limitation. Dissimilar to local operators, like convolutions, GVTOs can aggregate global information and have input-specific weights during inference time, improving generalizability performance, as already proved by recent literature. We applied GVTOs on a state-of-the-art BDA model and named it GVT-BDNet. We trained and evaluated our proposal neural network on the xBD dataset; the largest and most complete dataset for BDA. We compared GVT-BDNet performance with the baseline architecture (BDNet) and observed that the former improves damaged buildings segmentation by a factor of 0.11. Moreover, GVT-BDNet achieves state-of-the-art performance on a 10% split of the xBD training dataset and on the xBD test dataset with an overall F1- score of 0.80 and 0.79, respectively. To evaluate the architecture consistency, we have also tested BDNet’s and GVT-BDNet’s generalizability performance on another segmentation task: Tree & Shadow segmentation. Results showed that both models achieved overall good performances, scoring an F1-score of 0.79 and 0.785, respectively. / Naturkatastrofer sker överallt, stör lokal kommunikations- och transportinfrastruktur, vilket gör bedömningsprocessen av specifika lokala skador svår, farlig och långsam. Målet med Building Damage Assessment (BDA) är att snabbt och precist uppskatta platsen, orsaken och allvarligheten av skadorna för att maximera effektiviteten av räddare och räddade liv. Nuvarande BDA-lösningar använder Convolutional Neural Network (CNN) och ad-hoc Attention Operators för att förbättra generaliseringsprestanda. Nyligen föreslagna attention operators är dock specifikt skräddarsydda för uppgiften och kan sakna flexibilitet för andra scenarier eller neural nätverksarkitektur. I vår forskning bidrar vi till BDA -litteraturen genom att föreslå Global Voxel Transformer Operators (GVTO): flexibla attention operators som kan appliceras på en CNN -arkitektur utan att vara bundna till en viss uppgift. Nyare litteratur visar dessutom att de kan öka utvinningen av global information och därmed generaliseringsprestanda. Vi tillämpade GVTO på en toppmodern CNN-modell för BDA. GVTO: er förbättrade skadessegmenteringsprestandan med en faktor av 0,11. Dessutom förbättrade de den senaste tekniken för xBD-testdatauppsättningen och nådde toppmodern prestanda på en 10% delning av xBD-träningsdatauppsättningen. Vi har också utvärderat generaliserbarheten av det föreslagna neurala nätverket på en annan segmenteringsuppgift (Tree Shadow segmentering), vilket uppnådde över lag bra prestationer. Attention Operators Convolutional Neural Networks (CNNs) Deep Learning Building Damage Assessment Generalizability Attention Operators Convolutional Neural Networks (CNNs) Deep Learning Building Damage Assessment Generalizability Computer and Information Sciences Data- och informationsvetenskap
13	Bidirectional LSTM-CNNs-CRF Models for POS Tagging Tang, Hao January 2018 (has links) In order to achieve state-of-the-art performance for part-of-speech(POS) tagging, the traditional systems require a significant amount of hand-crafted features and data pre-processing. In this thesis, we present a discriminative word embedding, character embedding and byte pair encoding (BPE) hybrid neural network architecture to implement a true end-to-end system without feature engineering and data pre-processing. The neural network architecture is a combination of bidirectional LSTM, CNNs, and CRF, which can achieve a state-of-the-art performance for a wide range of sequence labeling tasks. We evaluate our model on Universal Dependencies (UD) dataset for English, Spanish, and German POS tagging. It outperforms other models with 95.1%, 98.15%, and 93.43% accuracy on testing datasets respectively. Moreover, the largest improvements of our model appear on out-of-vocabulary corpora for Spanish and German. According to statistical significance testing, the improvements of English on testing and out-of-vocabulary corpora are not statistically significant. However, the improvements of the other more morphological languages are statistically significant on their corresponding corpora. bidirectional LSTM part of speech CNNs CRF byte pair encoding (BPE)
14	E-scooter Rider Detection System in Driving Environments Apurv, Kumar 08 1900 (has links) Indianapolis / E-scooters are ubiquitous and their number keeps escalating, increasing their interactions with other vehicles on the road. E-scooter riders have an atypical behavior that varies enormously from other vulnerable road users, creating new challenges for vehicle active safety systems and automated driving functionalities. The detection of e-scooter riders by other vehicles is the first step in taking care of the risks. This research presents a novel vision-based system to differentiate between e-scooter riders and regular pedestrians and a benchmark dataset for e-scooter riders in natural environments. An efficient system pipeline built using two existing state-of-the-art convolutional neural networks (CNN), You Only Look Once (YOLOv3) and MobileNetV2, performs detection of these vulnerable e-scooter riders. Object detection method Data collection procedures E-scooter safety Ego vehicle Artificial Intelligence Research Image classification CNNs Driving Environment
15	HBONEXT: AN EFFICIENT DNN FOR LIGHT EDGE EMBEDDED DEVICES Sanket Ramesh Joshi (10716561) 10 May 2021 (has links) <div>Every year the most effective Deep learning models, CNN architectures are showcased based on their compatibility and performance on the embedded edge hardware, especially for applications like image classification. These deep learning models necessitate a significant amount of computation and memory, so they can only be used on high-performance computing systems like CPUs or GPUs. However, they often struggle to fulfill portable specifications due to resource, energy, and real-time constraints. Hardware accelerators have recently been designed to provide the computational resources that AI and machine learning tools need. These edge accelerators have high-performance hardware which helps maintain the precision needed to accomplish this mission. Furthermore, this classification dilemma that investigates channel interdependencies using either depth-wise or group-wise convolutional features, has benefited from the inclusion of Bottleneck modules. Because of its increasing use in portable applications, the classic inverted residual block, a well-known architecture technique, has gotten more recognition. This work takes it a step forward by introducing a design method for porting CNNs to low-resource embedded systems, essentially bridging the difference between deep learning models and embedded edge systems. To achieve these goals, we use closer computing strategies to reduce the computer's computational load and memory usage while retaining excellent deployment efficiency. This thesis work introduces HBONext, a mutated version of Harmonious Bottlenecks (DHbneck) combined with a Flipped version of Inverted Residual (FIR), which outperforms the current HBONet architecture in terms of accuracy and model size miniaturization. Unlike the current definition of inverted residual, this FIR block performs identity mapping and spatial transformation at its higher dimensions. The HBO solution, on the other hand, focuses on two orthogonal dimensions: spatial (H/W) contraction-expansion and later channel (C) expansion-contraction, which are both organized in a bilaterally symmetric manner. HBONext is one of those versions that was designed specifically for embedded and mobile applications. In this research work, we also show how to use NXP Bluebox 2.0 to build a real-time HBONext image classifier. The integration of the model into this hardware has been a big hit owing to the limited model size of 3 MB. The model was trained and validated using CIFAR10 dataset, which performed exceptionally well due to its smaller size and higher accuracy. The validation accuracy of the baseline HBONet architecture is 80.97%, and the model is 22 MB in size. The proposed architecture HBONext variants, on the other hand, gave a higher validation accuracy of 89.70% and a model size of 3.00 MB measured using the number of parameters. The performance metrics of HBONext architecture and its various variants are compared in the following chapters.</div> Computer Engineering Convolution Neural Networks Artificial Intelligence CIFAR10 Embedded Systems Deep Learning Neural Networks image classification CNNs
16	Bee Shadow Recognition in Video Analysis of Omnidirectional Bee Traffic Alavala, Laasya 01 August 2019 (has links) Over a decade ago, beekeepers noticed that the bees were dying or disappearing without any prior health disorder. Colony Collapse Disorder (CCD) has been a major threat to bee colonies around the world which affects vital human crop pollination. Possible instigators of CCD include viral and fungal diseases, decreased genetic diversity, pesticides and a variety of other factors. The interaction among any of these potential facets may be resulting in immunity loss for honey bees and the increased likelihood of collapse. It is essential to rescue honey bees and improve the health of bee colony. Monitoring the traffic of bees helps to track the status of hive remotely. An Electronic beehive monitoring system extracts video, audio and temperature data without causing any interruption to the bee hives. This data could provide vital information on colony behavior and health. This research uses Artificial Intelligence and Computer Vision methodologies to develop and analyze technologies to monitor omnidirectional bee traffic of hives without disrupting the colony. Bee traffic means the number of bees moving in a given area in front of the hive over a given period of time. Forager traffic is the number of bees coming in and/or leaving the hive over a time. Forager traffic is a significant component in monitoring food availability and demand, colony age structure, impacts of pests and diseases, etc on hives. The goal of this research is to estimate and keep track of bee traffic by eliminating unnecessary information from video samples. Shadow bee recognition Laasya Alavala video analysis CNNs State of the art models Feature extraction in images PCA Machine Learning techniques Computer Sciences
17	E-scooter Rider Detection System in Driving Environments Kumar Apurv (11184732) 06 August 2021 (has links) E-scooters are ubiquitous and their number keeps escalating, increasing their interactions with other vehicles on the road. E-scooter riders have an atypical behavior that varies enormously from other vulnerable road users, creating new challenges for vehicle active safety systems and automated driving functionalities. The detection of e-scooter riders by other vehicles is the first step in taking care of the risks. This research presents a novel vision-based system to differentiate between e-scooter riders and regular pedestrians and a benchmark dataset for e-scooter riders in natural environments. An efficient system pipeline built using two existing state-of-the-art convolutional neural networks (CNN), You Only Look Once (YOLOv3) and MobileNetV2, performs detection of these vulnerable e-scooter riders.<br> Computer Vision object detection method data collection procedures e-scooter safety Artificial Intelligence research ego vehicle image classification CNNs driving environment
18	Modeling the intronic regulation of Alternative Splicing using Deep Convolutional Neural Nets / En metod baserad på djupa neurala nätverk för att modellera regleringen av Alternativ Splicing Linder, Johannes January 2015 (has links) This paper investigates the use of deep Convolutional Neural Networks for modeling the intronic regulation of Alternative Splicing on the basis of DNA sequence. By training the CNN on massively parallel synthetic DNA libraries of Alternative 5'-splicing and Alternatively Skipped exon events, the model is capable of predicting the relative abundance of alternatively spliced mRNA isoforms on held-out library data to a very high accuracy (R2 = 0.77 for Alt. 5'-splicing). Furthermore, the CNN is shown to generalize alternative splicing across cell lines efficiently. The Convolutional Neural Net is tested against a Logistic regression model and the results show that while prediction accuracy on the synthetic library is notably higher compared to the LR model, the CNN is worse at generalizing to new intronic contexts. Tests on non-synthetic human SNP genes suggest the CNN is dependent on the relative position of the intronic region it was trained for, a problem which is alleviated with LR. The increased library prediction accuracy of the CNN compared to Logistic regression is concluded to come from the non-linearity introduced by the deep layer architecture. It adds the capacity to model complex regulatory interactions and combinatorial RBP effects which studies have shown largely affect alternative splicing. However, the architecture makes interpreting the CNN hard, as the regulatory interactions are encoded deep within the layers. Nevertheless, high-performance modeling of alternative splicing using CNNs may still prove useful in numerous Synthetic biology applications, for example to model differentially spliced genes as is done in this paper. / Den här uppsatsen undersöker hur djupa neurala nätverk baserade på faltning ("Convolutions") kan användas för att modellera den introniska regleringen av Alternativ Splicing med endast DNA-sekvensen som indata. Nätverket tränas på ett massivt parallelt bibliotek av syntetiskt DNA innehållandes Alternativa Splicing-event där delar av de introniska regionerna har randomiserats. Uppsatsen visar att nätverksarkitekturen kan förutspå den relativa mängden alternativt splicat RNA till en mycket hög noggrannhet inom det syntetiska biblioteket. Modellen generaliserar även alternativ splicing mellan mänskliga celltyper väl. Hursomhelst, tester på icke-syntetiska mänskliga gener med SNP-mutationer visar att nätverkets prestanda försämras när den introniska region som används som indata flyttas i jämförelse till den relativa position som modellen tränats på. Uppsatsen jämför modellen med Logistic regression och drar slutsatsen att nätverkets förbättrade prestanda grundar sig i dess förmåga att modellera icke-linjära beroenden i datan. Detta medför dock svårigheter i att tolka vad modellen faktiskt lärt sig, eftersom interaktionen mellan reglerande element är inbäddat i nätverkslagren. Trots det kan högpresterande modellering av alternativ splicing med hjälp av neurala nät vara användbart, exempelvis inom Syntetisk biologi där modellen kan användas för att kontrollera regleringen av splicing när man konstruerar syntetiska gener. Machine Learning Deep Learning CNN CNNs Convolutional Neural Networks Convolutional Neural Nets Neural Networks Neural Nets Synthetic Biology Alternative Splicing AS Splicing DNA Regulation Synthetic DNA Massively Parallel Library Maskinlärning CNN CNNs Convolutional Neural Networks Convolutional Neural Nets Faltning Neurala Nätverk Neurala Nät Syntetisk Biologi Alternativ Splicing Splicing AS DNA Massivt Parallellt Bibliotek Syntetiskt DNA Computer Sciences Datavetenskap (datalogi)
19	Visual Flow Analysis and Saliency Prediction Srinivas, Kruthiventi S S January 2016 (has links) (PDF) Nowadays, we have millions of cameras in public places such as traffic junctions, railway stations etc., and capturing video data round the clock. This humongous data has resulted in an increased need for automation of visual surveillance. Analysis of crowd and traffic flows is an important step towards achieving this goal. In this work, we present our algorithms for identifying and segmenting dominant ows in surveillance scenarios. In the second part, we present our work aiming at predicting the visual saliency. The ability of humans to discriminate and selectively pay attention to few regions in the scene over the others is a key attentional mechanism. Here, we present our algorithms for predicting human eye fixations and segmenting salient objects in the scene. (i) Flow Analysis in Surveillance Videos: We propose algorithms for segmenting flows of static and dynamic nature in surveillance videos in an unsupervised manner. In static flows scenarios, we assume the motion patterns to be consistent over the entire duration of video and analyze them in the compressed domain using H.264 motion vectors. Our approach is based on modeling the motion vector field as a Conditional Random Field (CRF) and obtaining oriented motion segments which are merged to obtain the final flow segments. This approach in compressed domain is shown to be both accurate and computationally efficient. In the case of dynamic flow videos (e.g. flows at a traffic junction), we propose a method for segmenting the individual object flows over long durations. This long-term flow segmentation is achieved in the framework of CRF using local color and motion features. We propose a Dynamic Time Warping (DTW) based distance measure between flow segments for clustering them and generate representative dominant ow models. Using these dominant flow models, we perform path prediction for the vehicles entering the camera's field-of-view and detect anomalous motions. (ii) Visual Saliency Prediction using Deep Convolutional Neural Networks: We propose a deep fully convolutional neural network (CNN) - DeepFix, for accurately predicting eye fixations in the form of saliency maps. Unlike classical works which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts saliency map in an end-to-end manner. DeepFix is designed to capture visual semantics at multiple scales while taking global context into account. Generally, fully convolutional nets are spatially invariant which prevents them from modeling location dependent patterns (e.g. centre-bias). Our network overcomes this limitation by incorporating a novel Location Biased Convolutional layer. We experimentally show that our network outperforms other recent approaches by a significant margin. In general, human eye fixations correlate with locations of salient objects in the scene. However, only a handful of approaches have attempted to simultaneously address these related aspects of eye fixations and object saliency. In our work, we also propose a deep convolutional network capable of simultaneously predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. Our network shows a significant improvement over the current state-of-the-art for both eye fixation prediction and salient object segmentation across a number of challenging datasets. Visual Flow Analysis Saliency Prediction Visual Saliency Static Flow Analysis Surveillance Videos Dynamic Flow Analysis DeepFix Convolutional Network Eye Fixation Prediction Salient Object Segmentation Convolutional Neural Networks (CNNs) Saliency Unified Computational and Data Sciences
20	Spatio-Temporal Networks for Human Activity Recognition based on Optical Flow in Omnidirectional Image Scenes Seidel, Roman 29 February 2024 (has links) The ability of human beings to perceive the environment around them with their visual system is called motion perception. This means that the attention of our visual system is primarily focused on those objects that are moving. The property of human motion perception is used in this dissertation to infer human activity from data using artificial neural networks. One of the main aims of this thesis is to discover which modalities, namely RGB images, optical flow and human keypoints, are best suited for HAR in omnidirectional data. Since these modalities are not yet available for omnidirectional cameras, they are synthetically generated and captured with an omnidirectional camera. During data generation, a distinction is made between synthetically generated omnidirectional data and a real omnidirectional dataset that was recorded in a Living Lab at Chemnitz University of Technology and subsequently annotated by hand. The synthetically generated dataset, called OmniFlow, consists of RGB images, optical flow in forward and backward directions, segmentation masks, bounding boxes for the class people, as well as human keypoints. The real-world dataset, OmniLab, contains RGB images from two top-view scenes as well as manually annotated human keypoints and estimated forward optical flow. In this thesis, the generation of the synthetic and real-world datasets is explained. The OmniFlow dataset is generated using the 3D rendering engine Blender, in which a fully configurable 3D indoor environment is created with artificially textured rooms, human activities, objects and different lighting scenarios. A randomly placed virtual camera following the omnidirectional camera model renders the RGB images, all other modalities and 15 predefined activities. The result of modelling the 3D indoor environment is the OmniFlow dataset. Due to the lack of omnidirectional optical flow data, the OmniFlow dataset is validated using Test-Time Augmentation (TTA). Compared to the baseline, which contains Recurrent All-Pairs Field Transforms (RAFT) trained on the FlyingChairs and FlyingThings3D datasets, it was found that only about 1000 images need to be used for fine-tuning to obtain a very low End-point Error (EE). Furthermore, it was shown that the influence of TTA on the test dataset of OmniFlow affects EE by about a factor of three. As a basis for generating artificial keypoints on OmniFlow with action labels, the Carnegie Mellon University motion capture database is used with a large number of sports and household activities as skeletal data defined in the BVH format. From the BVH-skeletal data, the skeletal points of the people performing the activities can be directly derived or extrapolated by projecting these points from the 3D world into an omnidirectional 2D image. The real-world dataset, OmniLab, was recorded in two rooms of the Living Lab with five different people mimicking the 15 actions of OmniFlow. Human keypoint annotations were added manually in two iterations to reduce the error rate of incorrect annotations. The activity-level evaluation was investigated using a TSN and a PoseC3D network. The TSN consists of two CNNs, a spatial component trained on RGB images and a temporal component trained on the dense optical flow fields of OmniFlow. The PoseC3D network, an approach to skeleton-based activity recognition, uses a heatmap stack of keypoints in combination with 3D convolution, making the network more effective at learning spatio-temporal features than methods based on 2D convolution. In the first step, the networks were trained and validated on the synthetically generated dataset OmniFlow. In the second step, the training was performed on OmniFlow and the validation on the real-world dataset OmniLab. For both networks, TSN and PoseC3D, three hyperparameters were varied and the top-1, top-5 and mean accuracy given. First, the learning rate of the stochastic gradient descent (Stochastic Gradient Descent (SGD)) was varied. Secondly, the clip length, which indicates the number of consecutive frames for learning the network, was varied, and thirdly, the spatial resolution of the input data was varied. For the spatial resolution variation, five different image sizes were generated from the original dataset by cropping from the original dataset of OmniFlow and OmniLab. It was found that keypoint-based HAR with PoseC3D performed best compared to human activity classification based on optical flow and RGB images. This means that the top-1 accuracy was 0.3636, the top-5 accuracy was 0.7273 and the mean accuracy was 0.3750, showing that the most appropriate output resolution is 128px × 128px and the clip length is at least 24 consecutive frames. The best results could be achieved with a learning rate of PoseC3D of 10-3. In addition, confusion matrices indicating the class-wise accuracy of the 15 activity classes have been given for the modalities RGB images, optical flow and human keypoints. The confusion matrix for the modality RGB images shows the best classification result of the TSN for the action walk with an accuracy of 1.00, but almost all other actions are also classified as walking in real-world data. The classification of human actions based on optical flow works best on the action sit in chair and stand up with an accuracy of 1.00 and walk with 0.50. Furthermore, it is noticeable that almost all actions are classified as sit in chair and stand up, which indicates that the intra-class variance is low, so that the TSN is not able to distinguish between the selected action classes. Validated on real-world data for the modality keypoint the actions rugpull (1.00) and cleaning windows (0.75) performs best. Therefore, the PoseC3D network on a time-series of human keypoints is less sensitive to variations in the image angle between the synthetic and real-world data than for the modalities RGB images and optical flow. The pipeline for the generation of synthetic data with regard to a more uniform distribution of the motion magnitudes needs to be investigated in future work. Random placement of the person and other objects is not sufficient for a complete coverage of all movement magnitudes. An additional improvement of the synthetic data could be the rotation of the person around their own axis, so that the person moves in a different direction while performing the activity and thus the movement magnitudes contain more variance. Furthermore, the domain transition between synthetic and real-world data should be considered further in terms of viewpoint invariance and augmentation methods. It may be necessary to generate a new synthetic dataset with only top-view data and re-train the TSN and PoseC3D. As an augmentation method, for example, the Fourier Domain Adaption (FDA) could reduce the domain gap between the synthetically generated and the real-world dataset.:1 Introduction 2 Theoretical Background 3 Related Work 4 Omnidirectional Synthetic Human Optical Flow 5 Human Keypoints for Pose in Omnidirectional Images 6 Human Activity Recognition in Indoor Scenarios 7 Conclusion and Future Work A Chapter 4: Flow Dataset Statistics B Chapter 5: 3D Rotation Matrices C Chapter 6: Network Training Parameters info:eu-repo/classification/ddc/000 ddc:000 info:eu-repo/classification/ddc/620 ddc:620

Search results