Spelling suggestions: "subject:"win transformer"" "subject:"win ransformer""
1 |
Evaluation of deep learning methods for industrial automationOnning, Ragnar January 2023 (has links)
The rise and adaptation of the transformer architecture from natural language processing to visual tasks have proven a useful and powerful tool. Subsequent architectures such as visual transformers (ViT) and shifting window (SWIN) transformers have proven to be comparable and oftentimes exceed convolutional neural networks (CNNs) in terms of accuracy. However, for mobile vision tasks and limited hardware, the computational complexity of the transformer architecture is an impediment. This project aims to answer the question of whether the Swin Transformer can be adapted towards lightweight and low latency classification as a basis for industrial automation, and how it compares to CNNs for a specific task. A case study from the logging industry, binary classification of wooden boards on chain conveyors, will serve as the basis of this evaluation. For these purposes, a novel dataset has been collected and annotated. The results of this project include an overview of the respective architectures and their performance for different implementations on the classification task. Both architectures exhibited sufficient accuracy, while the CNN models performed best for the specific case study.
|
2 |
Comparative Analysis of Transformer and CNN Based Models for 2D Brain Tumor SegmentationTräff, Henrik January 2023 (has links)
A brain tumor is an abnormal growth of cells within the brain, which can be categorized into primary and secondary tumor types. The most common type of primary tumors in adults are gliomas, which can be further classified into high-grade gliomas (HGGs) and low-grade gliomas (LGGs). Approximately 50% of patients diagnosed with HGG pass away within 1-2 years. Therefore, the early detection and prompt treatment of brain tumors are essential for effective management and improved patient outcomes. Brain tumor segmentation is a task in medical image analysis that entails distinguishing brain tumors from normal brain tissue in magnetic resonance imaging (MRI) scans. Computer vision algorithms and deep learning models capable of analyzing medical images can be leveraged for brain tumor segmentation. These algorithms and models have the potential to provide automated, reliable, and non-invasive screening for brain tumors, thereby enabling earlier and more effective treatment. For a considerable time, Convolutional Neural Networks (CNNs), including the U-Net, have served as the standard backbone architectures employed to address challenges in computer vision. In recent years, the Transformer architecture, which already has firmly established itself as the new state-of-the-art in the field of natural language processing (NLP), has been adapted to computer vision tasks. The Vision Transformer (ViT) and the Swin Transformer are two architectures derived from the original Transformer architecture that have been successfully employed for image analysis. The emergence of Transformer based architectures in the field of computer vision calls for an investigation whether CNNs can be rivaled as the de facto architecture in this field. This thesis compares the performance of four model architectures, namely the Swin Transformer, the Vision Transformer, the 2D U-Net, and the 2D U-Net which is implemented with the nnU-Net framework. These model architectures are trained using increasing amounts of brain tumor images from the BraTS 2020 dataset and subsequently evaluated on the task of brain tumor segmentation for both HGG and LGG together, as well as HGG and LGG individually. The model architectures are compared on total training time, segmentation time, GPU memory usage, and on the evaluation metrics Dice Coefficient, Jaccard Index, precision, and recall. The 2D U-Net implemented using the nnU-Net framework performs the best in correctly segmenting HGG and LGG, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The Transformer based architectures improve the least when going from 50% to 100% of training data. Furthermore, when data augmentation is applied during training, the nnU-Net outperforms the other model architectures, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The nnU-net benefited the least from employing data augmentation during training, while the Transformer based architectures benefited the most. In this thesis we were able to perform a successful comparative analysis effectively showcasing the distinct advantages of the four model architectures under discussion. Future comparisons could incorporate training the model architectures on a larger set of brain tumor images, such as the BraTS 2021 dataset. Additionally, it would be interesting to explore how Vision Transformers and Swin Transformers, pre-trained on either ImageNet- 21K or RadImageNet, compare to the model architectures of this thesis on brain tumor segmentation.
|
3 |
Teaching an AI to recycle by looking at scrap metal : Semantic segmentation through self-supervised learning with transformers / Lär en AI att källsortera genom att kolla på metallskrotForsberg, Edwin, Harris, Carl January 2022 (has links)
Stena Recycling is one of the leading recycling companies in Sweden and at their facility in Halmstad, 300 tonnes of refuse are handled every day where aluminium is one of the most valuable materials they sort. Today, most of the sorting process is done automatically, but there are still parts of the refuse that are not correctly sorted. Approximately 4\% of the aluminium is currently not properly sorted and goes to waste. Earlier works have investigated using machine vision to help in the sorting process at Stena Recycling. However, consistently through all these previous works, there is a problem in gathering enough annotated data to train the machine learning models. This thesis aims to investigate how machine vision could be used in the recycling process and if pre-training models using self-supervised learning can alleviate the problem of gathering annotated data and yield an improvement. The results show that machine vision models could viably be used in an information system to assist operators. This thesis also shows that pre-training models with self-supervised learning may yield a small increase in performance. Furthermore, we show that models pre-trained using self-supervised learning also appear to transfer the knowledge learned from images created in a lab environment to images taken at the recycling plant.
|
4 |
Instance Segmentation on depth images using Swin Transformer for improved accuracy on indoor images / Instans-segmentering på bilder med djupinformation för förbättrad prestanda på inomhusbilderHagberg, Alfred, Musse, Mustaf Abdullahi January 2022 (has links)
The Simultaneous Localisation And Mapping (SLAM) problem is an open fundamental problem in autonomous mobile robotics. One of the latest most researched techniques used to enhance the SLAM methods is instance segmentation. In this thesis, we implement an instance segmentation system using Swin Transformer combined with two of the state of the art methods of instance segmentation namely Cascade Mask RCNN and Mask RCNN. Instance segmentation is a technique that simultaneously solves the problem of object detection and semantic segmentation. We show that depth information enhances the average precision (AP) by approximately 7%. We also show that the Swin Transformer backbone model can work well with depth images. Our results also show that Cascade Mask RCNN outperforms Mask RCNN. However, the results are to be considered due to the small size of the NYU-depth v2 dataset. Most of the instance segmentation researches use the COCO dataset which has a hundred times more images than the NYU-depth v2 dataset but it does not have the depth information of the image.
|
5 |
Machine Learning for Automatic Annotation and Recognition of Demographic Characteristics in Facial Images / Maskininlärning för Automatisk Annotering och Igenkänning av Demografiska Egenskaper hos AnsiktsbilderGustavsson Roth, Ludvig, Rimér Högberg, Camilla January 2024 (has links)
Recent increase in widespread use of facial recognition technologies have accelerated the utilization of demographic information, as extracted from facial features, yet it is accompanied by ethical concerns. It is therefore crucial, for ethical reasons, to ensure that algorithms like face recognition algorithms employed in legal proceedings are equitable and thoroughly documented across diverse populations. Accurate classification of demographic traits are therefore essential for enabling a comprehensive understanding of other algorithms. This thesis explores how classical machine learning algorithms compare to deep-learning models in predicting sex, age and skin color, concluding that the more compute-heavy deep-learning models, where the best performing models achieved an MCC of 0.99, 0.48 and 0.85 for sex, age and skin color respectively, significantly outperform their classical machine learning counterparts which achieved an MCC of 0.57, 0.22 and 0.54 at best. Once establishing that the deep-learning models are superior, further methods such as semi-supervised learning, a multi-characteristic classifier, sex-specific age classifiers and using tightly cropped facial images instead of upper-body images were employed to try and improve the deep-learning results. Throughout all deep-learning experiments the state of the art vision transformer and convolutional neural network were compared. Whilst the different architectures performed remarkably alike, a slight edge was seen for the convolutional neural network. The results further show that using cropped facial images generally improve the model performance and that more specialized models achieve modest improvements as compared to their less specialized counterparts. Semi-supervised learning showed potential in slightly improving the models further. The predictive performances achieved in this thesis indicate that the deep-learning models can reliably predict demographic features close to, or surpassing, a human.
|
6 |
Siamese Network with Dynamic Contrastive Loss for Semantic Segmentation of Agricultural LandsPendotagaya, Srinivas 07 1900 (has links)
This research delves into the application of semantic segmentation in precision agriculture, specifically targeting the automated identification and classification of various irrigation system types within agricultural landscapes using high-resolution aerial imagery. With irrigated agriculture occupying a substantial portion of US land and constituting a major freshwater user, the study's background highlights the critical need for precise water-use estimates in the face of evolving environmental challenges, the study utilizes advanced computer vision for optimal system identification. The outcomes contribute to effective water management, sustainable resource utilization, and informed decision-making for farmers and policymakers, with broader implications for environmental monitoring and land-use planning.
In this geospatial evaluation research, we tackle the challenge of intraclass variability and a limited dataset. The research problem centers around optimizing the accuracy in geospatial analyses, particularly when confronted with intricate intraclass variations and constraints posed by a limited dataset. Introducing a novel approach termed "dynamic contrastive learning," this research refines the existing contrastive learning framework. Tailored modifications aim to improve the model's accuracy in classifying and segmenting geographic features accurately. Various deep learning models, including EfficientNetV2L, EfficientNetB7, ConvNeXtXLarge, ResNet-50, and ResNet-101, serve as backbones to assess their performance in the geospatial context. The data used for evaluation consists of high-resolution aerial imagery from the National Agriculture Imagery Program (NAIP) captured in 2015. It includes four bands (red, green, blue, and near-infrared) with a 1-meter ground sampling distance. The dataset covers diverse landscapes in Lonoke County, USA, and is annotated for various irrigation system types. The dataset encompasses diverse geographic features, including urban, agricultural, and natural landscapes, providing a representative and challenging scenario for model assessment.
The experimental results underscore the efficacy of the modified contrastive learning approach in mitigating intraclass variability and improving performance metrics. The proposed method achieves an average accuracy of 96.7%, a BER of 0.05, and an mIoU of 88.4%, surpassing the capabilities of existing contrastive learning methods. This research contributes a valuable solution to the specific challenges posed by intraclass variability and limited datasets in the realm of geospatial feature classification. Furthermore, the investigation extends to prominent deep learning architectures such as Segformer, Swin Transformer, Convexnext, and Convolution Vision Transformer, shedding light on their impact on geospatial image analysis. ConvNeXtXLarge emerges as a robust backbone, demonstrating remarkable accuracy (96.02%), minimal BER (0.06), and a high MIOU (85.99%).
|
Page generated in 0.0413 seconds