Spelling suggestions: "subject:"disision transformer"" "subject:"decisision transformer""
11 |
Visual Transformers for 3D Medical Images Classification: Use-Case Neurodegenerative DisordersKhorramyar, Pooriya January 2022 (has links)
A Neurodegenerative Disease (ND) is progressive damage to brain neurons, which the human body cannot repair or replace. The well-known examples of such conditions are Dementia and Alzheimer’s Disease (AD), which affect millions of lives each year. Although conducting numerous researches, there are no effective treatments for the mentioned diseases today. However, early diagnosis is crucial in disease management. Diagnosing NDs is challenging for neurologists and requires years of training and experience. So, there has been a trend to harness the power of deep learning, including state-of-the-art Convolutional Neural Network (CNN), to assist doctors in diagnosing such conditions using brain scans. The CNN models lead to promising results comparable to experienced neurologists in their diagnosis. But, the advent of transformers in the Natural Language Processing (NLP) domain and their outstanding performance persuaded Computer Vision (CV) researchers to adapt them to solve various CV tasks in multiple areas, including the medical field. This research aims to develop Vision Transformer (ViT) models using Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset to classify NDs. More specifically, the models can classify three categories (Cognitively Normal (CN), Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD)) using brain Fluorodeoxyglucose (18F-FDG) Positron Emission Tomography (PET) scans. Also, we take advantage of Automated Anatomical Labeling (AAL) brain atlas and attention maps to develop explainable models. We propose three ViTs, the best of which obtains an accuracy of 82% on the test dataset with the help of transfer learning. Also, we encode the AAL brain atlas information into the best performing ViT, so the model outputs the predicted label, the most critical region in its prediction, and overlaid attention map on the input scan with the crucial areas highlighted. Furthermore, we develop two CNN models with 2D and 3D convolutional kernels as baselines to classify NDs, which achieve accuracy of 77% and 73%, respectively, on the test dataset. We also conduct a study to find out the importance of brain regions and their combinations in classifying NDs using ViTs and the AAL brain atlas. / <p>This thesis was awarded a prize of 50,000 SEK by Getinge Sterilization for projects within Health Innovation.</p>
|
12 |
Instance Segmentation on depth images using Swin Transformer for improved accuracy on indoor images / Instans-segmentering på bilder med djupinformation för förbättrad prestanda på inomhusbilderHagberg, Alfred, Musse, Mustaf Abdullahi January 2022 (has links)
The Simultaneous Localisation And Mapping (SLAM) problem is an open fundamental problem in autonomous mobile robotics. One of the latest most researched techniques used to enhance the SLAM methods is instance segmentation. In this thesis, we implement an instance segmentation system using Swin Transformer combined with two of the state of the art methods of instance segmentation namely Cascade Mask RCNN and Mask RCNN. Instance segmentation is a technique that simultaneously solves the problem of object detection and semantic segmentation. We show that depth information enhances the average precision (AP) by approximately 7%. We also show that the Swin Transformer backbone model can work well with depth images. Our results also show that Cascade Mask RCNN outperforms Mask RCNN. However, the results are to be considered due to the small size of the NYU-depth v2 dataset. Most of the instance segmentation researches use the COCO dataset which has a hundred times more images than the NYU-depth v2 dataset but it does not have the depth information of the image.
|
13 |
OBJECT DETECTION USING VISION TRANSFORMED EFFICIENTDETShreyanil Kar (16285265) 30 August 2023 (has links)
<p>This research presents a novel approach for object detection by integrating Vision Transformers (ViT) into the EfficientDet architecture. The field of computer vision, encompassing artificial intelligence, focuses on the interpretation and analysis of visual data. Recent advancements in deep learning, particularly convolutional neural networks (CNNs), have significantly improved the accuracy and efficiency of computer vision systems. Object detection, a widely studied application within computer vision, involves the identification and localization of objects in images.</p>
<p>The ViT backbone, renowned for its success in image classification and natural language processing tasks, employs self-attention mechanisms to capture global dependencies in input images. However, ViT’s capability to capture fine-grained details and context information is limited. To address this limitation, the integration of ViT into the EfficientDet architecture is proposed. EfficientDet is recognized for its efficiency and accuracy in object detection. By combining the strengths of ViT and EfficientDet, the proposed integration enhances the network’s ability to capture fine-grained details and context information. It leverages ViT’s global dependency modeling alongside EfficientDet’s efficient object detection framework, resulting in highly accurate and efficient performance. Noteworthy object detection frameworks utilized in the industry, such as RetinaNet, EfficientNet, and EfficientDet, primarily employ convolution.</p>
<p>Experimental evaluations were conducted using the PASCAL VOC 2007 and 2012 datasets, widely acknowledged benchmarks for object detection. The integrated ViT-EfficientDet model achieved an impressive mean Average Precision (mAP) score of 86.27% when tested on the PASCAL VOC 2007 dataset, demonstrating its superior accuracy. These results underscore the potential of the proposed integration for real-world applications.</p>
<p>In conclusion, the research introduces a novel integration of Vision Transformers into the EfficientDet architecture, yielding significant improvements in object detection performance. By combining ViT’s ability to capture global dependencies with EfficientDet’s efficiency and accuracy, the proposed approach offers enhanced object detection capabilities. Future research directions may explore additional datasets and evaluate the performance of the proposed framework across various computer vision tasks.</p>
|
14 |
A MULTI-HEAD ATTENTION APPROACH WITH COMPLEMENTARY MULTIMODAL FUSION FOR VEHICLE DETECTIONNujhat Tabassum (18010969) 03 June 2024 (has links)
<p dir="ltr">In the realm of autonomous vehicle technology, the Multimodal Vehicle Detection Network (MVDNet) represents a significant leap forward, particularly in the challenging context of weather conditions. This paper focuses on the enhancement of MVDNet through the integration of a multi-head attention layer, aimed at refining its performance. The integrated multi-head attention layer in the MVDNet model is a pivotal modification, advancing the network's ability to process and fuse multimodal sensor information more efficiently. The paper validates the improved performance of MVDNet with multi-head attention through comprehensive testing, which includes a training dataset derived from the Oxford Radar Robotcar. The results clearly demonstrate that the Multi-Head MVDNet outperforms the other related conventional models, particularly in the Average Precision (AP) estimation, under challenging environmental conditions. The proposed Multi-Head MVDNet not only contributes significantly to the field of autonomous vehicle detection but also underscores the potential of sophisticated sensor fusion techniques in overcoming environmental limitations.</p>
|
15 |
Event-Cap – Event Ranking and Transformer-based Video Captioning / Event-Cap – Event rankning och transformerbaserad video captioningCederqvist, Gabriel, Gustafsson, Henrik January 2024 (has links)
In the field of video surveillance, vast amounts of data are gathered each day. To be able to identify what occurred during a recorded session, a human annotator has to go through the footage and annotate the different events. This is a tedious and expensive process that takes up a large amount of time. With the rise of machine learning and in particular deep learning, the field of both image and video captioning has seen large improvements. Contrastive Language-Image Pretraining is capable of efficiently learning a multimodal space, thus able to merge the understanding of text and images. This enables visual features to be extracted and processed into text describing the visual content. This thesis presents a system for extracting and ranking important events from surveillance videos as well as a way of automatically generating a description of the event. By utilizing the pre-trained models X-CLIP and GPT-2 to extract visual information from the videos and process it into text, a video captioning model was created that requires very little training. Additionally, the ranking system was implemented to extract important parts in video, utilizing anomaly detection as well as polynomial regression. Captions were evaluated using the metrics BLEU, METEOR, ROUGE and CIDEr, and the model receives scores comparable to other video captioning models. Additionally, captions were evaluated by experts in the field of video surveillance, who rated them on accuracy, reaching up to 62.9%, and semantic quality, reaching 99.2%. Furthermore the ranking system was also evaluated by the experts, where they agree with the ranking system 78% of the time. / Inom videoövervakning samlas stora mängder data in varje dag. För att kunna identifiera vad som händer i en inspelad övervakningsvideo så måste en människa gå igenom och annotera de olika händelserna. Detta är en långsam och dyr process som tar upp mycket tid. Under de senaste åren har det setts en enorm ökning av användandet av olika maskininlärningsmodeller. Djupinlärningsmodeller har fått stor framgång när det kommer till att generera korrekt och trovärdig text. De har också använts för att generera beskrivningar för både bilder och video. Contrastive Language-Image Pre-training har gjort det möjligt att träna en multimodal rymd som kombinerar förståelsen av text och bild. Detta gör det möjligt att extrahera visuell information och skapa textbeskrivningar. Denna master uppsatts beskriver ett system som kan extrahera och ranka viktiga händelser i en övervakningsvideo samt ett automatiskt sätt att generera beskrivningar till dessa. Genom att använda de förtränade modellerna X-CLIP och GPT-2 för att extrahera visuell information och textgenerering, har en videobeskrivningsmodell skapats som endast behöver en liten mängd träning. Dessutom har ett rankingsystem implementerats för att extrahera de viktiga delarna i en video genom att använda anomalidetektion och polynomregression. Video beskrivningarna utvärderades med måtten BLEU, METOER, ROUGE och CIDEr, där modellerna får resultat i klass med andra videobeskrivningsmodeller. Fortsättningsvis utvärderades beskrivningarna också av experter inom videoövervakningsområdet där de fick besvara hur bra beskrivningarna var i måtten: beskrivningsprecision som uppnådde 62.9% och semantisk kvalité som uppnådde 99.2%. Ranknignssystemet utvärderades också av experterna. Deras åsikter överensstämde till 78% med rankningssystemet.
|
16 |
Crime Detection From Pre-crime Video AnalysisSedat Kilic (18363729) 03 June 2024 (has links)
<p dir="ltr">his research investigates the detection of pre-crime events, specifically targeting behaviors indicative of shoplifting, through the advanced analysis of CCTV video data. The study introduces an innovative approach that leverages augmented human pose and emotion information within individual frames, combined with the extraction of activity information across subsequent frames, to enhance the identification of potential shoplifting actions before they occur. Utilizing a diverse set of models including 3D Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs), and a specially developed transformer architecture, the research systematically explores the impact of integrating additional contextual information into video analysis.</p><p dir="ltr">By augmenting frame-level video data with detailed pose and emotion insights, and focusing on the temporal dynamics between frames, our methodology aims to capture the nuanced behavioral patterns that precede shoplifting events. The comprehensive experimental evaluation of our models across different configurations reveals a significant improvement in the accuracy of pre-crime detection. The findings underscore the crucial role of combining visual features with augmented data and the importance of analyzing activity patterns over time for a deeper understanding of pre-shoplifting behaviors.</p><p dir="ltr">The study’s contributions are multifaceted, including a detailed examination of pre-crime frames, strategic augmentation of video data with added contextual information, the creation of a novel transformer architecture customized for pre-crime analysis, and an extensive evaluation of various computational models to improve predictive accuracy.</p>
|
Page generated in 0.1079 seconds