Global ETD Search

1	[en] VISION TRANSFORMERS AND MASKED AUTOENCONDERS FOR SEISMIC FACEIS SEGMENTATION / [pt] VISION TRANSFORMERS E MASKED AUTOENCONDERS PARA SEGMENTAÇÃO DE FÁCIES SÍSMICAS DANIEL CESAR BOSCO DE MIRANDA 12 January 2024 (has links) [pt] O desenvolvimento de técnicas de aprendizado auto-supervisionado vem ganhando muita visibilidade na área de Visão Computacional pois possibilita o pré-treinamento de redes neurais profundas sem a necessidade de dados anotados. Em alguns domínios, as anotações são custosas, pois demandam muito trabalho especializado para a rotulação dos dados. Esse problema é muito comum no setor de Óleo e Gás, onde existe um vasto volume de dados não interpretados. O presente trabalho visa aplicar a técnica de aprendizado auto-supervisionado denominada Masked Autoencoders para pré-treinar modelos Vision Transformers com dados sísmicos. Para avaliar o pré-treino, foi aplicada a técnica de transfer learning para o problema de segmentação de fácies sísmicas. Na fase de pré-treinamento foram empregados quatro volumes sísmicos distintos. Já para a segmentação foi utilizado o dataset Facies-Mark e escolhido o modelo da literatura Segmentation Transformers. Para avaliação e comparação da performance da metodologia foram empregadas as métricas de segmentação utilizadas pelo trabalho de benchmarking de ALAUDAH (2019). As métricas obtidas no presente trabalho mostraram um resultado superior. Para a métrica frequency weighted intersection over union, por exemplo, obtivemos um ganho de 7.45 por cento em relação ao trabalho de referência. Os resultados indicam que a metodologia é promissora para melhorias de problemas de visão computacional em dados sísmicos. / [en] The development of self-supervised learning techniques has gained a lot of visibility in the field of Computer Vision as it allows the pre-training of deep neural networks without the need for annotated data. In some domains, annotations are costly, as they require a lot of specialized work to label the data. This problem is very common in the Oil and Gas sector, where there is a vast amount of uninterpreted data. The present work aims to apply the self-supervised learning technique called Masked Autoencoders to pre-train Vision Transformers models with seismic data. To evaluate the pre-training, transfer learning was applied to the seismic facies segmentation problem. In the pre-training phase, four different seismic volumes were used. For the segmentation, the Facies-Mark dataset was used and the Segmentation Transformers model was chosen from the literature. To evaluate and compare the performance of the methodology, the segmentation metrics used by the benchmarking work of ALAUDAH (2019) were used. The metrics obtained in the present work showed a superior result. For the frequency weighted intersection over union (FWIU) metric, for example, we obtained a gain of 7.45 percent in relation to the reference work. The results indicate that the methodology is promising for improving computer vision problems in seismic data. Read more [pt] SISMICA [pt] MASKED AUTOENCODERS [pt] VISION TRANSFORMERS [pt] APRENDIZADO AUTO-SUPERVISIONADO [en] SEISMIC [en] MASKED AUTOENCODERS [en] VISION TRANSFORMERS [en] SELF-SUPERVISED LEARNING
2	Regularizing Vision-Transformers Using Gumbel-Softmax Distributions on Echocardiography Data / Regularisering av Vision-Transformers med hjälp av Gumbel-Softmax-fördelningar på ekokardiografidata Nilsson, Alfred January 2023 (has links) This thesis introduces an novel approach to model regularization in Vision Transformers (ViTs), a category of deep learning models. It employs stochastic embedded feature selection within the context of echocardiography video analysis, specifically focusing on the EchoNet-Dynamic dataset. The proposed method, termed Gumbel Vision-Transformer (G-ViT), combines ViTs and Concrete Autoencoders (CAE) to enhance the generalization of models predicting left ventricular ejection fraction (LVEF). The model comprises a ViT frame encoder for spatial representation and a transformer sequence model for temporal aspects, forming a Video ViT (V-ViT) architecture that, when used without feature selection, serves as a baseline on LVEF prediction performance. The key contribution lies in the incorporation of stochastic image patch selection in video frames during training. The CAE method is adapted for this purpose, achieving approximately discrete patch selections by sampling from the Gumbel-Softmax distribution, a relaxation of the categorical. The experiments conducted on EchoNetDynamic demonstrate a consistent and notable regularization effect. The G-ViT model, trained with learned feature selection, achieves a test R² of 0.66 outperforms random masking baselines and the full-input V-ViT counterpart with an R² of 0.63, and showcasing improved generalization in multiple evaluation metrics. The G-ViT is compared against recent related work in the application of ViTs on EchoNet-Dynamic, notably outperforming the application of Swin-transformers, UltraSwin, which achieved an R² of 0.59. Moreover, the thesis explores model explainability by visualizing selected patches, providing insights into how the G-ViT utilizes regions known to be crucial for LVEF prediction for humans. This proposed approach extends beyond regularization, offering a unique explainability tool for ViTs. Efficiency aspects are also considered, revealing that the G-ViT model, trained with a reduced number of input tokens, yields comparable or superior results while significantly reducing GPU memory and floating-point operations. This efficiency improvement holds potential for energy reduction during training. / Detta examensarbete introducerar en ny metod för att uppnå regularisering av Vision-Transformers (ViTs), en kategori av deep learning-modeller. Den använder sig stokastisk inbäddad feature selection i kontexten av analys av ekokardiografivideor, specifikt inriktat på datasetet EchoNet-Dynamic. Den föreslagna metoden, kallad Gumbel Vision-Transformer (G-ViT), kombinerar ViTs och Concrete Autoencoders (CAE) för att förbättra generaliseringen av modeller som förutspår ejektionsfraktion i vänstra ventrikeln (left ventricular ejection fraction, LVEF). Modellen inbegriper en ViT frame encoder för spatiella representationer och en transformer-sekvensmodell för tidsaspekter, vilka bilder en arkitektur, Video-ViT (V-ViT), som tränad utan feature selection utgör en utgångspunkt (baseline) för jämförelse vid prediktion av LVEF. Det viktigaste bidraget ligger i införandet av stokastiskt urval av bild-patches i videobilder under träning. CAE-metoden anpassas för detta ändamål, och uppnår approxmativt diskret patch-selektion genom att dra stickprov från Gumbel-Softmax-fördelningen, en relaxation av den kategoriska fördelningen. Experimenten utförda på EchoNet-Dynamic visar en konsekvent och anmärkningsvärd regulariseringseffekt. G-ViTmodellen, tränad med inlärd feature selection, uppnår ett R² på 0,66 och överträffar slumpmässigt urval och V-ViT-motsvarigheten som använder sig av hela bilder med ett R² på 0,63, och uppvisar förbättrad generalisering i flera utvärderingsmått. G-ViT jämförs med nyligen publicerat arbete i tillämpningen av ViTs på EchoNet-Dynamic och överträffar bland annat en tillämpning av Swin-transformers, UltraSwin, som uppnådde en R² på 0,59. Dessutom utforskar detta arbete modellförklarbarhet genom att visualisera utvalda bild-patches, vilket ger insikter i hur G-ViT använder regioner som är kända för att vara avgörande för LVEF-estimering för människor. Denna föreslagna metod sträcker sig bortom regularisering och erbjuder ett unikt förklaringsverktyg för ViTs. Effektivitetsaspekter beaktas också, vilket avslöjar att G-ViT-modellen, tränad med ett reducerat antal inmatningstokens, ger jämförbara eller överlägsna resultat samtidigt som den avsevärt minskar GPU-minnet och flyttalsoperationer. Denna effektivitetsförbättring har potential för energireduktion under träning. Read more Deep Learning Vision-Transformers Echocardiography Feature Selection Gumbel-Softmax Concrete Autoencoders Regression Djupinlärning Vision-Transformers Ekokardiografi Feature Selection GumbelSoftmax Concrete Autoencoders Regression Computer and Information Sciences Data- och informationsvetenskap
3	Video Retargeting using Vision Transformers : Utilizing deep learning for video aspect ratio change / Video Retargeting med hjälp av Vision Transformers : Användning av djupinlärning för ändring av videobildförhållanden Laufer, Gil January 2022 (has links) The diversity of video material, where a video is shot and produced using a single aspect ratio, and the variety of devices that can play video with screens in different aspect ratios make video retargeting a relevant topic. The process of fitting a video filmed in one aspect ratio to a screen in another aspect ratio is called video retargeting, and the retargeted video should ideally preserve the important content and structure of the original video as well as be free of visual artifacts. Important content and important structure are vague and subjective definitions, which makes this problem more difficult to solve. The video retargeting problem has been a challenge for researchers from the computer vision, computer graphics and human-computer interaction areas, and successful retargeting can improve the viewing experience and the content’s aesthetic value. Video retargeting is done by four tools: cropping, scaling, seam carving and seam adding. Previous research showed that one of the keys to successful retargeting is to use a suitable combination of operators. This study makes use of a vision transformer, a deep learning model which is trained to discriminate between original and retargeted videos. Solving an optimization problem using beam search, the transformer assists in choosing a combination of operators that will result in the best possible retargeted video. The retargeted videos were examined in a user A/B-test, where users had to choose their preferred variant of a video shot: the transformer’s output using beam search, or a singular version where the video underwent a single retargeting operation. The model and user preferences were compared to check if the model indeed can make retargeting decisions that are appealing for humans to watch. A significance test showed that no conclusion can be made, probably due to lack of enough test data. However, the study revealed patterns in the preferences of the users and the model that could be further fine-tuned or combined with other computer vision mechanisms in order to output better retargeted videos. / Variation av videomaterial, där olika videor är inspelade och producerade i olika bildförhållande, samt variation i apparater och skärmar som spelar upp videor i olika bildförhållanden gör ändring av videobildförhållande till en relevant fråga. Processen där en videos bildförhållande ändras heter video retargeting. När video retargeting används bör den nya videon helst bevara strukturen och viktigt innehåll från originalvideon samt vara artefaktfri. Struktur och viktigt innehåll är subjektiva definitioner vilket gör frågan svårlöst, och frågan har varit en utmaning för forskare inom datorseende, datorgrafik och människa-datorinteraktion. Lyckad ändring av en videos bildförhållande kan förbättra tittarupplevelsen och innehållets estetiska värde. Video retargeting kan göras med hjälp av fyra funktioner: klippning, skalning, seam carving och seam adding. Tidigare studier visar att en av nycklarna till lyckad retargeting är att hitta en lämplig kombination av funktionerna. I denna studie används Vision Transformer, en djupinlärningsmodell som tränas för att skilja mellan original och omvandlade videor. Genom att lösa ett optimeringsproblem med strålsökning hjälper modellen välja den kombination av funktionerna som resulterar i den bästa möjliga omvandlade videon. De omvandlade videorna testades genom ett användartest där användare valde vilket videoklipp de tyckte bättre om: modellens output som skapades med hjälp av strålsökning, eller en version där klippet genomgick en enklare retargeting med hjälp av endast en av funktionerna. Modellens och användarnas preferenser jämfördes för att se om modellen kan fatta beslut som användare upplever som bra. Ett signifikanstest visar att ingen slutsats kan dras, förmodligen på grund av det begränsade antalet videoklipp och data som användes i studien. Däremot visar studien mönster i användarnas och modellens preferenser som kan användas för att vidareutveckla problemlösningen inom området. Read more Video retargeting Aspect ratio Computer vision Deep learning Vision transformers. Video retargeting Bildförhållande Datorseende Djupinlärning Vision transformers. Computer and Information Sciences Data- och informationsvetenskap
4	Classification for Diseases in Potatoes Leaf Using Yolov8 : master's thesis Моргуе-Ансах, М., Morgue-Ansah, M. January 2024 (has links) Болезни листьев картофеля представляют значительную угрозу для глобальной продовольственной безопасности, влияя на урожайность и качество. Точные и эффективные методы классификации болезней имеют решающее значение для своевременного вмешательства и управления урожаем. В этом исследовании изучается эффективность современной архитектуры глубокого обучения YOLOv8 для классификации болезней листьев картофеля. Архитектура YOLOv8, известная своими возможностями обнаружения объектов в реальном времени, адаптирована для многоклассовой классификации болезней листьев картофеля. Благодаря трансферному обучению модель предварительно обучается на крупномасштабном наборе данных и настраивается на конкретном наборе данных о болезнях листьев картофеля. YOLOv8 использует одноступенчатую структуру обнаружения объектов, применяя ряд сверточных слоев для обнаружения и классификации болезней непосредственно на изображениях. Аналогичным образом для сравнения использовались Vision Transformers, которые показали многообещающие результаты в задачах классификации изображений. Экспериментальные результаты показали, что YOLOv8 показал точность 97,9%. Набор данных, используемый в этом исследовании, состоит из изображений листьев картофеля с высоким разрешением, пораженных различными болезнями, включая фитофтороз, раннюю гниль и здоровые листья. Для повышения надежности и обобщения модели были применены методы предварительной обработки, такие как дополнение и нормализация данных. Был проведен дополнительный анализ для понимания сильных и слабых сторон каждого подхода. YOLOv8 продемонстрировал превосходную производительность при обнаружении небольших поражений и сложных узоров на листьях картофеля благодаря своим возможностям обнаружения объектов. Это исследование способствует развитию компьютерного зрения в сельском хозяйстве, предоставляя информацию о производительности архитектур глубокого обучения для классификации болезней листьев картофеля. Результаты дают ценное руководство для исследователей и практиков, стремящихся разработать надежные и эффективные системы обнаружения болезней для поддержки устойчивых методов управления урожаем. / Potato leaf diseases pose a significant threat to global food security, affecting yield and quality. Accurate and efficient disease classification methods are crucial for timely intervention and crop management. This study investigates the efficacy state-of-the-art deep learning architecture, YOLOv8 for potato leaf disease classification. The YOLOv8 architecture, renowned for its real-time object detection capabilities, is adapted for multi-class classification of potato leaf diseases. Through transfer learning, the model is pre-trained on a large-scale dataset and fine-tuned on a specific potato leaf disease dataset. YOLOv8 leverages a single-stage object detection framework, employing a series of convolutional layers to detect and classify diseases directly from images. Similarly, Vision Transformers, which have shown promising results in image classification tasks, were employed for comparison. Experimental results revealed that YOLOv8 exhibited an accuracy of 97.9%. The dataset utilized in this research consists of high-resolution images of potato leaves affected by various diseases, including late blight, early blight, and healthy leaves. Preprocessing techniques such as data augmentation and normalization were applied to enhance model robustness and generalization. Further analysis was conducted to understand the strengths and limitations of each approach. YOLOv8 demonstrated superior performance in detecting small lesions and intricate patterns on potato leaves, owing to its object detection capabilities. This study contributes to advancing the field of agricultural computer vision by providing insights into the performance of deep learning architectures for potato leaf disease classification. The findings offer valuable guidance for researchers and practitioners seeking to develop robust and efficient disease detection systems to support sustainable crop management practices. Read more MASTER'S THESIS AGRICULTURAL COMPUTER VISION MACHINE LEARNING TRANSFER LEARNING VISION TRANSFORMERS YOLOV8 МАШИННОЕ ОБУЧЕНИЕ VISION TRANSFORMERS YOLOV8
5	Few-Shot Learning for Quality Inspection Palmér, Jesper, Alsalehy, Ahmad January 2023 (has links) The goal of this project is to find a suitable Few-Shot Learning (FSL) model that can be used in a fault detection system for use in an industrial setting. A dataset of Printed Circuit Board (PCB) images has been created to train different FSL models. This dataset is meant for evaluating FSL models in the specialized setting of fault detection in PCB manufacturing. FSL is a part of deep learning that has seen a large amount of development recently. Few-shot learning allows neural networks to learn on small datasets. In this thesis, various state-of-the-art FSL algorithms are implemented and tested on the custom PCB dataset. Different backbones are used to establish a benchmark for the tested FSL algorithms on three different datasets. Those datasets are ImageNet, PCB Defects, and the created PCB dataset. Our results show that ProtoNets combined with ResNet12 backbone achieved the highest accuracy in two test scenarios. In those tests, the model combination achieved 87.20%and 92.27% in 1-shot and 5-shot test scenarios, respectively. This thesis presents a Few-Shot Anomaly Detection (FSAD) model based on Vision Transformers (ViT). The model is compared to the state-of-the-art FSAD model DevNet on the MVTec-AD dataset. DevNet and ViT are chosen for comparison because they both approach the problem by dividing images into patches. How the models handle the image patches is however very different. The results indicate that ViT Deviation does not obtain as high AUC-ROC and AUC-PR scores as DevNet. This is because of the use of the very deep ViT architecture in the ViT Deviation model. A shallower transformer-based model is believed to be better suited for FSAD. Improvements for ViT Deviation are suggested for future work. The most notable suggested improvement is the use of the FS-CT architecture as a FSAD model because of the high accuracy it achieves in classification. / Målet med detta projekt är att hitta en lämplig Few-Shot Learning(FSL) modell som kan användas i ett feldetekteringssystem för användning i en industriell miljö. Ett dataset av Printed Circuit Board(PCB) bilder har skapats för att träna olika FSL-modeller. Detta datasetär avsedd för att utvärdera FSL-modeller i det specialiserade områdetfeldetektering vid PCB-tillverkning. FSL är en del av djupinlärningsom har utvecklats mycket den senaste tiden. FSL tillåter neuralanätverk att lära sig på små datamängder.I detta examensarbete implementeras och testas olika state-of-theart FSL algoritmer på det anpassade PCB-datasetet. Olika ryggradsmodeller används för att upprätta ett riktmärke för de testade FSL-algoritmernapå tre olika dataset. Dessa dataset är ImageNet[6], PCB Defects[14]och det skapade PCB-datasetet. Våra resultat visar att ProtoNets ikombination med ResNet12-ryggraden uppnådde den högsta noggrannheten i två testscenarier. I dessa tester uppnådde modellkombinationen 87,20% och 92,27% i testscenarier med 1-shot respektive5-shot.Detta examensarbete presenterar en Few-Shot Anomaly Detectionmodell (FSAD) baserad på Vision Transformers (ViT). Modellen jämförs med FSAD-modellen DevNet på MVTec-AD-datasetet. DevNetoch ViT väljs för jämförelse eftersom de båda angriper problemetgenom att dela upp bilder i mindre lappar. Hur modellerna hanterarlapparna är dock väldigt olika. Resultaten indikerar att ViT-Deviationinte får lika hög AUC-ROC och AUC-PR som DevNet. Detta beror påanvändningen av den mycket djupa ViT-arkitekturen i ViT Deviationmodellen. En grundare ViT-baserad modell tros vara bättre lämpadför FSAD. Förbättringar för ViT-Deviation föreslås för framtida arbete.Den mest anmärkningsvärda föreslagna förbättringen är användningen av FS-CT-arkitekturen som en FSAD-modell på grund av de lovande resultaten den uppnår i klassificering. Read more Few-Shot Learning AI Transformers ViT Deviation Vision Transformers Computer Sciences Datavetenskap (datalogi)
6	SAMPLS: A prompt engineering approach using Segment-Anything-Model for PLant Science research Sivaramakrishnan, Upasana 30 May 2024 (has links) Comparative anatomical studies of diverse plant species are vital for the understanding of changes in gene functions such as those involved in solute transport and hormone signaling in plant roots. The state-of-the-art method for confocal image analysis called PlantSeg utilized U-Net for cell wall segmentation. U-Net is a neural network model that requires training with a large amount of manually labeled confocal images and lacks generalizability. In this research, we test a foundation model called the Segment Anything Model (SAM) to evaluate its zero-shot learning capability and whether prompt engineering can reduce the effort and time consumed in dataset annotation, facilitating a semi-automated training process. Our proposed method improved the detection rate of cells and reduced the error rate as compared to state-of-the-art segmentation tools. We also estimated the IoU scores between the proposed method and PlantSeg to reveal the trade-off between accuracy and detection rate for different quality of data. By addressing the challenges specific to confocal images, our approach offers a robust solution for studying plant structure. Our findings demonstrated the efficiency of SAM in confocal image segmentation, showcasing its adaptability and performance as compared to existing tools. Overall, our research highlights the potential of foundation models like SAM in specialized domains and underscores the importance of tailored approaches for achieving accurate semantic segmentation in confocal imaging. / Master of Science / Studying different plant species' anatomy is crucial for understanding how genes work, especially those related to moving substances and signaling in plant roots. Scientists often use advanced techniques like confocal microscopy to examine plant tissues in detail. Traditional techniques like PlantSeg in automatically segmenting plant cells require a lot of computational resources and manual effort in preparing the dataset and training the model. In this study, we develop a novel technique using Segment-Anything-Model that could learn to identify cells without needing as much training data. We found that SAM performed better than other methods, detecting cells more accurately and making fewer mistakes. By comparing SAM with PlantSeg, we could see how well they worked with different types of images. Our results show that SAM is a reliable option for studying plant structures using confocal imaging. This research highlights the importance of using tailored approaches like SAM to get accurate results from complex images, offering a promising solution for plant scientists. Read more Segment-Anything-Model Large-Vision-Models Vision Transformers Semantic segmentation Prompt Segmentation Interactive machine learning
7	3D Gaze Estimation on RGB Images using Vision Transformers Li, Jing January 2023 (has links) Gaze estimation, a vital component in numerous applications such as humancomputer interaction, virtual reality, and driver monitoring systems, is the process of predicting the direction of an individual’s gaze. The predominant methods for gaze estimation can be broadly classified into intrusive and nonintrusive approaches. Intrusive methods necessitate the use of specialized hardware, such as eye trackers, while non-intrusive methods leverage images or recordings obtained from cameras to make gaze predictions. This thesis concentrates on appearance-based gaze estimation, specifically within the non-intrusive domain, employing various deep learning models. The primary focus of this study is to compare the efficacy of Vision Transformers (ViTs), a recently introduced architecture, with Convolutional Neural Networks (CNNs) for gaze estimation on RGB images. Performance evaluations of the models are conducted based on metrics such as the angular gaze error, stimulus distance error, and model size. Within the realm of ViTs, two variants are explored: pure ViTs and hybrid ViTs, which combine both CNN and ViT architectures. Throughout the project, both variants are examined in different sizes. Experimental results demonstrate that all pure ViTs underperform in comparison to the baseline ResNet-18 model. However, the hybrid ViT consistently emerges as the best-performing model across all evaluation datasets. Nonetheless, the discussion regarding whether to deploy the hybrid ViT or stick with the baseline model remains unresolved. This uncertainty arises because utilizing an exceedingly large and slow model, albeit highly accurate, may not be the optimal solution. Hence, the selection of an appropriate model may vary depending on the specific use case. / Ögonblicksbedömning, en avgörande komponent inom flera tillämpningar såsom människa-datorinteraktion, virtuell verklighet och övervakningssystem för förare, är processen att förutsäga riktningen för en individs blick. De dominerande metoderna för ögonblicksbedömning kan i stort sett indelas i påträngande och icke-påträngande tillvägagångssätt. Påträngande metoder kräver användning av specialiserad hårdvara, såsom ögonspårare, medan ickepåträngande metoder utnyttjar bilder eller inspelningar som erhållits från kameror för att göra bedömningar av blicken. Denna avhandling fokuserar på utseendebaserad ögonblicksbedömning, specifikt inom det icke-påträngande området, genom att använda olika djupinlärningsmodeller. Studiens huvudsakliga fokus är att jämföra effektiviteten hos Vision Transformers (ViTs), en nyligen introducerad arkitektur, med Convolutional Neural Networks (CNNs) för ögonblicksbedömning på RGB-bilder. Prestandautvärderingar av modellerna utförs baserat på metriker som den vinkelmässiga felbedömningen av blicken, felbedömning av stimulusavstånd och modellstorlek. Inom ViTs-området utforskas två varianter: rena ViTs och hybrid-ViT, som kombinerar både CNN- och ViT-arkitekturer. Under projektet undersöks båda varianterna i olika storlekar. Experimentella resultat visar att alla rena ViTs presterar sämre jämfört med basmodellen ResNet-18. Hybrid-ViT framstår dock konsekvent som den bäst presterande modellen över alla utvärderingsdatauppsättningar. Diskussionen om huruvida hybrid-ViT ska användas eller om man ska hålla sig till basmodellen förblir dock olöst. Denna osäkerhet uppstår eftersom användning av en extremt stor och långsam modell, även om den är mycket exakt, kanske inte är den optimala lösningen. Valet av en lämplig modell kan därför variera beroende på det specifika användningsområdet. Read more 3D Gaze Estimation Vision Transformers (ViTs) Convolutional Neural Networks (CNNs) Multi-Head Attention Red-Green-Blue (RGB) Images 3D Blickriktningsestimering Vision Transformers (ViTs) Konvolutionsneurala Nätverk (CNNs) Multi-Head Attention Röd-Grön-Blå (RGB) Bilder Computer and Information Sciences Data- och informationsvetenskap
8	Evaluating Transfer Learning Capabilities of Neural NetworkArchitectures for Image Classification Darouich, Mohammed, Youmortaji, Anton January 2022 (has links) Training a deep neural network from scratch can be very expensive in terms of resources.In addition, training a neural network on a new task is usually done by training themodel form scratch. Recently there are new approaches in machine learning which usesthe knowledge from a pre-trained deep neural network on a new task. The technique ofreusing the knowledge from previously trained deep neural networks is called Transferlearning. In this paper we are going to evaluate transfer learning capabilities of deep neuralnetwork architectures for image classification. This research attempts to implementtransfer learning with different datasets and models in order to investigate transfer learningin different situations. Residual Neural Network Convolutional Neural Network Vision Transformers Transfer Learning Neural Networks Image Classification. Computer Sciences Datavetenskap (datalogi)
9	Learning Embeddings for Fashion Images Hermansson, Simon January 2023 (has links) Today the process of sorting second-hand clothes and textiles is mostly manual. In this master’s thesis, methods for automating this process as well as improving the manual sorting process have been investigated. The methods explored include the automatic prediction of price and intended usage for second-hand clothes, as well as different types of image retrieval to aid manual sorting. Two models were examined: CLIP, a multi-modal model, and MAE, a self-supervised model. Quantitatively, the results favored CLIP, which outperformed MAE in both image retrieval and prediction. However, MAE may still be useful for some applications in terms of image retrieval as it returns items that look similar, even if they do not necessarily have the same attributes. In contrast, CLIP is better at accurately retrieving garments with as many matching attributes as possible. For price prediction, the best model was CLIP. When fine-tuned on the dataset used, CLIP achieved an F1-Score of 38.08 using three different price categories in the dataset. For predicting the intended usage (either reusing the garment or exporting it to another country) the best model managed to achieve an F1-Score of 59.04. Computer Vision Machine Learning Image Retrieval CLIP Masked Autoencoders (MAE) Vision Transformers Image Captioning Price Prediction AI for Fashion
10	Mutual Enhancement of Environment Recognition and Semantic Segmentation in Indoor Environment Challa, Venkata Vamsi January 2024 (has links) Background:The dynamic field of computer vision and artificial intelligence has continually evolved, pushing the boundaries in areas like semantic segmentation andenvironmental recognition, pivotal for indoor scene analysis. This research investigates the integration of these two technologies, examining their synergy and implicayions for enhancing indoor scene understanding. The application of this integrationspans across various domains, including smart home systems for enhanced ambientliving, navigation assistance for Cleaning robots, and advanced surveillance for security. Objectives: The primary goal is to assess the impact of integrating semantic segmentation data on the accuracy of environmental recognition algorithms in indoor environments. Additionally, the study explores how environmental context can enhance the precision and accuracy of contour-aware semantic segmentation. Methods: The research employed an extensive methodology, utilizing various machine learning models, including standard algorithms, Long Short-Term Memorynetworks, and ensemble methods. Transfer learning with models like EfficientNet B3, MobileNetV3 and Vision Tranformer was a key aspect of the experimentation. The experiments were designed to measure the effect of semantic segmentation on environmental recognition and its reciprocal influence. Results: The findings indicated that the integration of semantic segmentation data significantly enhanced the accuracy of environmental recognition algorithms. Conversely, incorporating environmental context into contour-aware semantic segmentation led to notable improvements in precision and accuracy, reflected in metrics such as Mean Intersection over Union(MIoU). Conclusion: This research underscores the mutual enhancement between semantic segmentation and environmental recognition, demonstrating how each technology significantly boosts the effectiveness of the other in indoor scene analysis. The integration of semantic segmentation data notably elevates the accuracy of environmental recognition algorithms, while the incorporation of environmental context into contour-aware semantic segmentation substantially improves its precision and accuracy.The results also open avenues for advancements in automated annotation processes, paving the way for smarter environmental interaction. Read more Semantic Segmentation Scene Classification Environment Recognition Machine Learning Deep Learning Image Classification Vision Transformers SAM(Segment Anything Model) Image Segmentation Contour-aware semantic segmentation Computer Sciences Datavetenskap (datalogi)

Search results