• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • 1
  • Tagged with
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

GVT-BDNet : Convolutional Neural Network with Global Voxel Transformer Operators for Building Damage Assessment / GVT-BDNet : Convolutional Neural Network med Global Voxel Transformer Operators för Building Damage Assessment

Remondini, Leonardo January 2021 (has links)
Natural disasters strike anywhere, disrupting local communication and transportation infrastructure, making the process of assessing specific local damage difficult, dangerous, and slow. The goal of Building Damage Assessment (BDA) is to quickly and accurately estimate the location, cause, and severity of the damage to maximize the efficiency of rescuers and saved lives. In current machine learning BDA solutions, attention operators are the most recent innovations adopted by researchers to increase generalizability and overall performances of Convolutional Neural Networks for the BDA task. However, the latter, nowadays exploit attention operators tailored to the specific task and specific neural network architecture, leading them to be hard to apply to other scenarios. In our research, we want to contribute to the BDA literature while also addressing this limitation. We propose Global Voxel Transformer Operators (GVTOs): flexible attention-operators originally proposed for Augmented Microscopy that can replace up-sampling, down-sampling, and size-preserving convolutions within either a U-Net or a general CNN architecture without any limitation. Dissimilar to local operators, like convolutions, GVTOs can aggregate global information and have input-specific weights during inference time, improving generalizability performance, as already proved by recent literature. We applied GVTOs on a state-of-the-art BDA model and named it GVT-BDNet. We trained and evaluated our proposal neural network on the xBD dataset; the largest and most complete dataset for BDA. We compared GVT-BDNet performance with the baseline architecture (BDNet) and observed that the former improves damaged buildings segmentation by a factor of 0.11. Moreover, GVT-BDNet achieves state-of-the-art performance on a 10% split of the xBD training dataset and on the xBD test dataset with an overall F1- score of 0.80 and 0.79, respectively. To evaluate the architecture consistency, we have also tested BDNet’s and GVT-BDNet’s generalizability performance on another segmentation task: Tree & Shadow segmentation. Results showed that both models achieved overall good performances, scoring an F1-score of 0.79 and 0.785, respectively. / Naturkatastrofer sker överallt, stör lokal kommunikations- och transportinfrastruktur, vilket gör bedömningsprocessen av specifika lokala skador svår, farlig och långsam. Målet med Building Damage Assessment (BDA) är att snabbt och precist uppskatta platsen, orsaken och allvarligheten av skadorna för att maximera effektiviteten av räddare och räddade liv. Nuvarande BDA-lösningar använder Convolutional Neural Network (CNN) och ad-hoc Attention Operators för att förbättra generaliseringsprestanda. Nyligen föreslagna attention operators är dock specifikt skräddarsydda för uppgiften och kan sakna flexibilitet för andra scenarier eller neural nätverksarkitektur. I vår forskning bidrar vi till BDA -litteraturen genom att föreslå Global Voxel Transformer Operators (GVTO): flexibla attention operators som kan appliceras på en CNN -arkitektur utan att vara bundna till en viss uppgift. Nyare litteratur visar dessutom att de kan öka utvinningen av global information och därmed generaliseringsprestanda. Vi tillämpade GVTO på en toppmodern CNN-modell för BDA. GVTO: er förbättrade skadessegmenteringsprestandan med en faktor av 0,11. Dessutom förbättrade de den senaste tekniken för xBD-testdatauppsättningen och nådde toppmodern prestanda på en 10% delning av xBD-träningsdatauppsättningen. Vi har också utvärderat generaliserbarheten av det föreslagna neurala nätverket på en annan segmenteringsuppgift (Tree Shadow segmentering), vilket uppnådde över lag bra prestationer.
2

Visual Flow Analysis and Saliency Prediction

Srinivas, Kruthiventi S S January 2016 (has links) (PDF)
Nowadays, we have millions of cameras in public places such as traffic junctions, railway stations etc., and capturing video data round the clock. This humongous data has resulted in an increased need for automation of visual surveillance. Analysis of crowd and traffic flows is an important step towards achieving this goal. In this work, we present our algorithms for identifying and segmenting dominant ows in surveillance scenarios. In the second part, we present our work aiming at predicting the visual saliency. The ability of humans to discriminate and selectively pay attention to few regions in the scene over the others is a key attentional mechanism. Here, we present our algorithms for predicting human eye fixations and segmenting salient objects in the scene. (i) Flow Analysis in Surveillance Videos: We propose algorithms for segmenting flows of static and dynamic nature in surveillance videos in an unsupervised manner. In static flows scenarios, we assume the motion patterns to be consistent over the entire duration of video and analyze them in the compressed domain using H.264 motion vectors. Our approach is based on modeling the motion vector field as a Conditional Random Field (CRF) and obtaining oriented motion segments which are merged to obtain the final flow segments. This approach in compressed domain is shown to be both accurate and computationally efficient. In the case of dynamic flow videos (e.g. flows at a traffic junction), we propose a method for segmenting the individual object flows over long durations. This long-term flow segmentation is achieved in the framework of CRF using local color and motion features. We propose a Dynamic Time Warping (DTW) based distance measure between flow segments for clustering them and generate representative dominant ow models. Using these dominant flow models, we perform path prediction for the vehicles entering the camera's field-of-view and detect anomalous motions. (ii) Visual Saliency Prediction using Deep Convolutional Neural Networks: We propose a deep fully convolutional neural network (CNN) - DeepFix, for accurately predicting eye fixations in the form of saliency maps. Unlike classical works which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts saliency map in an end-to-end manner. DeepFix is designed to capture visual semantics at multiple scales while taking global context into account. Generally, fully convolutional nets are spatially invariant which prevents them from modeling location dependent patterns (e.g. centre-bias). Our network overcomes this limitation by incorporating a novel Location Biased Convolutional layer. We experimentally show that our network outperforms other recent approaches by a significant margin. In general, human eye fixations correlate with locations of salient objects in the scene. However, only a handful of approaches have attempted to simultaneously address these related aspects of eye fixations and object saliency. In our work, we also propose a deep convolutional network capable of simultaneously predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. Our network shows a significant improvement over the current state-of-the-art for both eye fixation prediction and salient object segmentation across a number of challenging datasets.
3

3D Gaze Estimation on RGB Images using Vision Transformers

Li, Jing January 2023 (has links)
Gaze estimation, a vital component in numerous applications such as humancomputer interaction, virtual reality, and driver monitoring systems, is the process of predicting the direction of an individual’s gaze. The predominant methods for gaze estimation can be broadly classified into intrusive and nonintrusive approaches. Intrusive methods necessitate the use of specialized hardware, such as eye trackers, while non-intrusive methods leverage images or recordings obtained from cameras to make gaze predictions. This thesis concentrates on appearance-based gaze estimation, specifically within the non-intrusive domain, employing various deep learning models. The primary focus of this study is to compare the efficacy of Vision Transformers (ViTs), a recently introduced architecture, with Convolutional Neural Networks (CNNs) for gaze estimation on RGB images. Performance evaluations of the models are conducted based on metrics such as the angular gaze error, stimulus distance error, and model size. Within the realm of ViTs, two variants are explored: pure ViTs and hybrid ViTs, which combine both CNN and ViT architectures. Throughout the project, both variants are examined in different sizes. Experimental results demonstrate that all pure ViTs underperform in comparison to the baseline ResNet-18 model. However, the hybrid ViT consistently emerges as the best-performing model across all evaluation datasets. Nonetheless, the discussion regarding whether to deploy the hybrid ViT or stick with the baseline model remains unresolved. This uncertainty arises because utilizing an exceedingly large and slow model, albeit highly accurate, may not be the optimal solution. Hence, the selection of an appropriate model may vary depending on the specific use case. / Ögonblicksbedömning, en avgörande komponent inom flera tillämpningar såsom människa-datorinteraktion, virtuell verklighet och övervakningssystem för förare, är processen att förutsäga riktningen för en individs blick. De dominerande metoderna för ögonblicksbedömning kan i stort sett indelas i påträngande och icke-påträngande tillvägagångssätt. Påträngande metoder kräver användning av specialiserad hårdvara, såsom ögonspårare, medan ickepåträngande metoder utnyttjar bilder eller inspelningar som erhållits från kameror för att göra bedömningar av blicken. Denna avhandling fokuserar på utseendebaserad ögonblicksbedömning, specifikt inom det icke-påträngande området, genom att använda olika djupinlärningsmodeller. Studiens huvudsakliga fokus är att jämföra effektiviteten hos Vision Transformers (ViTs), en nyligen introducerad arkitektur, med Convolutional Neural Networks (CNNs) för ögonblicksbedömning på RGB-bilder. Prestandautvärderingar av modellerna utförs baserat på metriker som den vinkelmässiga felbedömningen av blicken, felbedömning av stimulusavstånd och modellstorlek. Inom ViTs-området utforskas två varianter: rena ViTs och hybrid-ViT, som kombinerar både CNN- och ViT-arkitekturer. Under projektet undersöks båda varianterna i olika storlekar. Experimentella resultat visar att alla rena ViTs presterar sämre jämfört med basmodellen ResNet-18. Hybrid-ViT framstår dock konsekvent som den bäst presterande modellen över alla utvärderingsdatauppsättningar. Diskussionen om huruvida hybrid-ViT ska användas eller om man ska hålla sig till basmodellen förblir dock olöst. Denna osäkerhet uppstår eftersom användning av en extremt stor och långsam modell, även om den är mycket exakt, kanske inte är den optimala lösningen. Valet av en lämplig modell kan därför variera beroende på det specifika användningsområdet.

Page generated in 0.1183 seconds