Spelling suggestions: "subject:"disision transformer"" "subject:"decisision transformer""
1 |
Object Detection with Swin Vision Transformers from Raw ADC Radar SignalsGiroux, James 15 August 2023 (has links)
Object detection utilizing frequency modulated continuous wave radar is becoming increasingly popular in the field of autonomous vehicles. Radar does not possess the same drawbacks seen by other emission-based sensors such as LiDAR, primarily the degradation or loss of return signals due to weather conditions such as rain or snow. Thus, there is a necessity for fully autonomous systems to utilize radar sensing applications in downstream decision-making tasks, generally handled by deep learning algorithms. Commonly, three transformations have been used to form range-azimuth-Doppler cubes in which deep learning algorithms could perform object detection. This method has drawbacks, specifically the pre-processing costs associated with performing multiple Fourier Transforms and normalization. We develop a network utilizing raw radar analog-to-digital converter output capable of operating in near real-time given the removal of all pre-processing. We obtain inference time estimates one-fifth of the traditional range-Doppler pipeline, decreasing from $\SI{156}{\milli\second}$ to $\SI{30}{\milli\second}$, and similar decreases in comparison to the full range-azimuth-Doppler cube. Moreover, we introduce hierarchical Swin Vision transformers to the field of radar object detection and show their capability to operate on inputs varying in pre-processing, along with different radar configurations, \textit{i.e.}, relatively low and high numbers of transmitters and receivers. Our network increases both average recall, and mean intersection over union performance by $\sim 6-7\%$, obtaining state-of-the-art F1 scores as a result on high-definition radar. On low-definition radar, we note an increase in mean average precision of $\sim 2.5\%$ over state-of-the-art range-Doppler networks when raw analog-to-digital converter data is used, and a $\sim5\%$ increase over networks using the full range-azimuth-Doppler cube.
|
2 |
Convolution-compacted visiontransformers forprediction of localwall heat flux atmultiple Prandtlnumbers in turbulentchannel flowWang, Yuning January 2023 (has links)
Predicting wall heat flux accurately in wall-bounded turbulent flows is critical for a variety of engineering applications, including thermal management systems and energy-efficient designs. Traditional methods, which rely on expensive numerical simulations, are hampered by increasing complexity and extremly high computation cost. Recent advances in deep neural networks (DNNs), however, offer an effective solution by predicting wall heat flux using non-intrusive measurements derived from off-wall quantities. This study introduces a novel approach, the convolution-compacted vision transformer (ViT), which integrates convolutional neural networks (CNNs) and ViT to predict instantaneous fields of wall heat flux accurately based on off-wall quantities including velocity components at three directions and temperature. Our method is applied to an existing database of wall-bounded turbulent flows obtained from direct numerical simulations (DNS). We first conduct an ablation study to examine the effects of incorporating convolution-based modules into ViT architectures and report on the impact of different modules. Subsequently, we utilize fully-convolutional neural networks (FCNs) with various architectures to identify the distinctions between FCN models and the convolution-compacted ViT. Our optimized ViT model surpasses the FCN models in terms of instantaneous field predictions, learning turbulence statistics, and accurately capturing energy spectra. Finally, we undertake a sensitivity analysis using a gradient map to enhance the understanding of the nonlinear relationship established by DNN models, thus augmenting the interpretability of these models. / <p>Presentation online</p>
|
3 |
Convolution- compacted vision transformers for prediction of local wall heat flux at multiple Prandtl numbers in turbulent channel flowWang, Yuning January 2023 (has links)
Predicting wall heat flux accurately in wall-bounded turbulent flows is critical fora variety of engineering applications, including thermal management systems andenergy-efficient designs. Traditional methods, which rely on expensive numericalsimulations, are hampered by increasing complexity and extremly high computationcost. Recent advances in deep neural networks (DNNs), however, offer an effectivesolution by predicting wall heat flux using non-intrusive measurements derivedfrom off-wall quantities. This study introduces a novel approach, the convolution-compacted vision transformer (ViT), which integrates convolutional neural networks(CNNs) and ViT to predict instantaneous fields of wall heat flux accurately based onoff-wall quantities including velocity components at three directions and temperature.Our method is applied to an existing database of wall-bounded turbulent flowsobtained from direct numerical simulations (DNS). We first conduct an ablationstudy to examine the effects of incorporating convolution-based modules into ViTarchitectures and report on the impact of different modules. Subsequently, we utilizefully-convolutional neural networks (FCNs) with various architectures to identify thedistinctions between FCN models and the convolution-compacted ViT. Our optimizedViT model surpasses the FCN models in terms of instantaneous field predictions,learning turbulence statistics, and accurately capturing energy spectra. Finally, weundertake a sensitivity analysis using a gradient map to enhance the understandingof the nonlinear relationship established by DNN models, thus augmenting theinterpretability of these models
|
4 |
Toward Robust Class-Agnostic Object CountingJiban, Md Jibanul Haque 01 January 2024 (has links) (PDF)
Object counting is a process of determining the quantity of specific objects in images. Accurate object counting is key for various applications in image understanding. The common applications are traffic monitoring, crowd management, wildlife migration monitoring, cell counting in medical images, plant and insect counting in agriculture, etc. Occlusions, complex backgrounds, changes in scale, and variations in object appearance in real-world settings make object counting challenging. This dissertation explores a progression of techniques to achieve robust localization and counting under diverse image modalities.
The exploration initiates with addressing the challenges of vehicular target localization in cluttered environments using infrared (IR) imagery. We propose a network, called TCRNet-2, that processes target and clutter information in two parallel channels and then combines them to optimize the target-to-clutter ratio (TCR) metric. Next, we explore class-agnostic object counting in RGB images using vision transformers. The primary motivation for this work is that most current methods excel at counting known object types but struggle with unseen categories. To solve these drawbacks, we propose a class-agnostic object counting method. We introduce a dual-branch architecture with interconnected cross-attention that generates feature pyramids for robust object representations, and a dedicated feature aggregator module that further improves performance. Finally, we propose a novel framework that leverages vision-language models (VLM) for zero-shot object counting. While our earlier class-agnostic counting method demonstrates high efficacy in generalized counting tasks, it relies on user-defined exemplars of target objects, presenting a limitation. Additionally, the previous zero-shot counting method was a reference-less approach, which limits the ability to control the selection of the target object of interest in multi-class scenarios. To address these shortcomings, we propose to utilize vision-language models for zero-shot counting where object categories of interest can be specified by text prompts.
|
5 |
Histogram of Oriented Gradients in a Vision TransformerMalmsten, Jakob, Cengiz, Heja, Lood, David January 2022 (has links)
This study aims to modify Vision Transformer (ViT) to achieve higher accuracy. ViT is a model used in computer vision to, among other things, classify images. By applying ViT to the MNIST data set, an accuracy of approximately 98% is achieved. ViT is modified by implementing a method called Histogram of Oriented Gradients (HOG) in two different ways. The results show that the first approach with HOG gives an accuracy of 98,74% (setup 1) and the second approach gives an accuracy of 96,87% (patch size 4x4 pixels). The study shows that when HOG is applied on the entire image, a better accuracy is obtained. However, no systematic optimization has taken place, which makes it difficult to draw conclusions with certainty.
|
6 |
Multiclass Brain Tumour Tissue Classification on Histopathology Images Using Vision TransformersSpyretos, Christoforos January 2023 (has links)
Histopathology refers to inspecting and analysing tissue samples under a microscope to identify and examine signs of diseases. The manual investigation procedure of histology slides by pathologists is time-consuming and susceptible to misconceptions. Deep learning models have demonstrated outstanding performance in digital histopathology, providing doctors and clinicians with immediate and reliable decision-making assistance in their workflow. In this study, deep learning models, including vision transformers (ViT) and convolutional neural networks (CNN), were employed to compare their performance in patch-level classification task on feature annotations of glioblastoma multiforme in H\&E histology whole slide images (WSI). The dataset utilised in this study was obtained from the Ivy Glioblastoma Atlas Project (IvyGAP). The pre-processing steps included stain normalisation of the images, and patches of size 256x256 pixels were extracted from the WSIs. In addition, the per-subject split method was implemented to prevent data leakage between the training, validation and test sets. Three models were employed to perform the classification task on the IvyGAP data image, two scratch-trained models, a ViT and a CNN (variant of VGG16), and a pre-trained ViT. The models were assessed using various metrics such as accuracy, f1-score, confusion matrices, Matthews correlation coefficient (MCC), area under the curve (AUC) and receiver operating characteristic (ROC) curves. In addition, experiments were conducted to calibrate the models to reflect the ground truth of the task using the temperature scale technique, and their uncertainty was estimated through the Monte Carlo dropout approach. Lastly, the models were statistically compared using the Wilcoxon signed-rank test. Among the evaluated models, the scratch-trained ViT exhibited the best test accuracy of 67%, with an MCC of 0.45. The scratch-trained CNN obtained a test accuracy of 49% and an MCC of 0.15. However, the pre-trained ViT only achieved a test accuracy of 28% and an MCC of 0.034. The reliability diagrams and metrics indicated that the scratch-trained ViT demonstrated better calibration. After applying temperature scaling, only the scratch-trained CNN showed improved calibration. Therefore, the calibrated CNN was used for subsequent experiments. The scratch-trained ViT and calibrated CNN illustrated different uncertainty levels. The scratch-trained ViT had moderate uncertainty, while the calibrated CNN exhibited modest to high uncertainty across classes. The pre-trained ViT had an overall high uncertainty. Finally, the results of the statistical tests reported that the scratch-trained ViT model performed better among the three models at a significant level of approximately 0.0167 after applying the Bonferroni correction. In conclusion, the scratch-trained ViT model achieved the highest test accuracy and better class discrimination. In contrast, the scratch-trained CNN and pre-trained ViT performed poorly and were comparable to random classifiers. The scratch-trained ViT demonstrated better calibration, while the calibrated CNN showed varying levels of uncertainty. The statistical tests demonstrated no statistical difference among the models.
|
7 |
Comparative Analysis of Transformer and CNN Based Models for 2D Brain Tumor SegmentationTräff, Henrik January 2023 (has links)
A brain tumor is an abnormal growth of cells within the brain, which can be categorized into primary and secondary tumor types. The most common type of primary tumors in adults are gliomas, which can be further classified into high-grade gliomas (HGGs) and low-grade gliomas (LGGs). Approximately 50% of patients diagnosed with HGG pass away within 1-2 years. Therefore, the early detection and prompt treatment of brain tumors are essential for effective management and improved patient outcomes. Brain tumor segmentation is a task in medical image analysis that entails distinguishing brain tumors from normal brain tissue in magnetic resonance imaging (MRI) scans. Computer vision algorithms and deep learning models capable of analyzing medical images can be leveraged for brain tumor segmentation. These algorithms and models have the potential to provide automated, reliable, and non-invasive screening for brain tumors, thereby enabling earlier and more effective treatment. For a considerable time, Convolutional Neural Networks (CNNs), including the U-Net, have served as the standard backbone architectures employed to address challenges in computer vision. In recent years, the Transformer architecture, which already has firmly established itself as the new state-of-the-art in the field of natural language processing (NLP), has been adapted to computer vision tasks. The Vision Transformer (ViT) and the Swin Transformer are two architectures derived from the original Transformer architecture that have been successfully employed for image analysis. The emergence of Transformer based architectures in the field of computer vision calls for an investigation whether CNNs can be rivaled as the de facto architecture in this field. This thesis compares the performance of four model architectures, namely the Swin Transformer, the Vision Transformer, the 2D U-Net, and the 2D U-Net which is implemented with the nnU-Net framework. These model architectures are trained using increasing amounts of brain tumor images from the BraTS 2020 dataset and subsequently evaluated on the task of brain tumor segmentation for both HGG and LGG together, as well as HGG and LGG individually. The model architectures are compared on total training time, segmentation time, GPU memory usage, and on the evaluation metrics Dice Coefficient, Jaccard Index, precision, and recall. The 2D U-Net implemented using the nnU-Net framework performs the best in correctly segmenting HGG and LGG, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The Transformer based architectures improve the least when going from 50% to 100% of training data. Furthermore, when data augmentation is applied during training, the nnU-Net outperforms the other model architectures, followed by the Swin Transformer, 2D U-Net, and Vision Transformer. The nnU-net benefited the least from employing data augmentation during training, while the Transformer based architectures benefited the most. In this thesis we were able to perform a successful comparative analysis effectively showcasing the distinct advantages of the four model architectures under discussion. Future comparisons could incorporate training the model architectures on a larger set of brain tumor images, such as the BraTS 2021 dataset. Additionally, it would be interesting to explore how Vision Transformers and Swin Transformers, pre-trained on either ImageNet- 21K or RadImageNet, compare to the model architectures of this thesis on brain tumor segmentation.
|
8 |
Industrial 3D Anomaly Detection and Localization Using Unsupervised Machine LearningBärudde, Kevin, Gandal, Marcus January 2023 (has links)
Detecting defects in industrially manufactured products is crucial to ensure their safety and quality. This process can be both expensive and error-prone if done manually, making automated solutions desirable. There is extensive research on industrial anomaly detection in images, but recent studies have shown that adding 3D information can increase the performance. This thesis aims to extend the 2D anomaly detection framework, PaDiM, to incorporate 3D information. The proposed methods combine RGB with depth maps or point clouds and the effects of using PointNet++ and vision transformers to extract features are investigated. The methods are evaluated on the MVTec 3D-AD public dataset using the metrics image AUROC, pixel AUROC and AUPRO, and on a small dataset collected with a Time-of-Flight sensor. This thesis concludes that the addition of 3D information improves the performance of PaDiM and vision transformers achieve the best results, scoring an average image AUROC of 86.2±0.2 on MVTec 3D-AD.
|
9 |
Hybrid Deep Learning approach for Lane Detection : Combining convolutional and transformer networks with a post-processing temporal information mechanism, for efficient road lane detection on a road image sceneZarogiannis, Dimitrios, Bompai, Stelio January 2023 (has links)
Lane detection is a crucial task in the field of autonomous driving and advanced driver assistance systems. In recent years, convolutional neural networks (CNNs) have been the primary approach for solving this problem. However, interesting findings from recent research works regarding the use of Transformer models and attention-based mechanisms have shown to be beneficial in the task of semantic segmentation of the road lane markings. In this work, we investigate the effectiveness of incorporating a Vision Transformer (ViT) to process feature maps extracted by a CNN network for lane detection. We compare the performance of a baseline CNN-based lane detection model with that of a hybrid CNN-ViT pipeline and test the model over a well known dataset. Furthermore, we explore the impact of incorporating temporal information from a road scene on a lane detection model’s predictive performance. We propose a post-processing technique that utilizes information from previous frames to improve the accuracy of the lane detection model. Our results show that incorporating temporal information noticeably improves the model’s performance, and manages to make effective corrections over the originally predicted lane masks. Our SegNet backbone, exploiting the proposed post-processing mechanism, reached an F1 scoreof 0.52 and Intersection-over-Union (IoU) of 0.36 over the TuSimple test set. However, the findings from the testing of our CNN-ViT pipeline and a relevant ablation study, do indicate that this hybrid approach might not be a good fit for lane detection. More specifically, the ViT module fails to exploit the feature sextracted by our CNN backbone and therefore, our hybrid pipeline results in less accurate lane marking spredictions.
|
10 |
Automatic usability assessment of CR images using deep learningWårdemark, Erik, Unell, Olle January 2024 (has links)
Computed Radiography exams are rarely performed by the same physicians who will interpret the image. Therefore, if the image does not help the physician diagnose the patient, the image can be rejected by the interpreting physician. The rejection normally happens after the patient has already left the hospital meaning that they will have to return to retake the exam. This leads to unnecessary work for the physicians and for the patient. In order to solve this problem we have explored deep learning algorithms to automatically analyze the images and distinguish between usable and unusable images. The deep learning algorithms include convolutional neural networks, vision transformers and fusion networks utilizing different types of data. In total, seven architectures were used to train 42 models. The models were trained on a dataset of 61 127 DICOM files containing images and metadata collected from a clinical setting and labeled based on if the images were deemed usable in the clinical setting. The complete dataset was used for training generalized models and subsets containing specific body parts were used for training specialized models. Three architectures were used for classification using images only, where two architectures used a ResNet-50 backbone and one architecture used a ViT-B/16 backbone. These architectures created 15 specialized models and three generalized models. Four architectures implementing joint fusion created 20 specialized models and four generalized models. Two of these architectures had a backbone of ResNet-50 and the other two utilized a ViT-B/16 backbone. For each of the backbones used, two types of joint fusion were implemented, type I and type II, which had different structures. The two modalities utilized were images and metadata from the DICOM files. The best image only model had a ViT-B/16 backbone and was trained on a specialized dataset containing hands and feet. This model reached an AUC score of 0.842 and MCC of 0.545. The two fusion models trained on the same dataset reached an AUC score of 0.843 and 0.834 respectively and an MCC of 0.547 and 0.546 respectively. We concluded that it is possible to perform automatic rejections with deep learning models even though the results of this study are not good enough for clinical use. The models using ViT-B/16 performed better than the ones using ResNet-50 for all models. The generalized and specialized models performed equally well in most cases with the exception of the smaller subsets of the full dataset. Utilizing metadata from the DICOM files did not improve the models compared to the image only models.
|
Page generated in 0.0822 seconds