• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 35
  • 4
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 49
  • 49
  • 25
  • 23
  • 23
  • 21
  • 21
  • 16
  • 16
  • 15
  • 14
  • 13
  • 13
  • 12
  • 11
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Detección de objetos usando redes neuronales convolucionales junto con Random Forest y Support Vector Machines

Campanini García, Diego Alejandro January 2018 (has links)
Ingeniero Civil Eléctrico / En el presente trabajo de título se desarrolla un sistema de detección de objetos (localización y clasificación), basado en redes neuronales convolucionales (CNN por su sigla en inglés) y dos métodos clásicos de machine learning como Random Forest (RF) y Support Vector Machines (SVMs). La idea es mejorar, con los mencionados clasificadores, el rendimiento del sistema de detección conocido como Faster R-CNN (su significado en inglés es: Regions with CNN features). El sistema Faster R-CNN, se fundamenta en el concepto de region proposal para generar muestras candidatas a ser objetos y posteriormente producir dos salidas: una con la regresión que caracteriza la localización de los objetos y otra con los puntajes de confianza asociados a los bounding boxes predichos. Ambas salidas son generadas por capas completamente conectadas. En este trabajo se interviene la salida que genera los puntajes de confianza, tal que, en este punto se conecta un clasificador (RF o SVM), para generar con estos los puntajes de salida del sistema. De esta forma se busca mejorar el rendimiento del sistema Faster R-CNN. El entrenamiento de los clasificadores se realiza con los vectores de características extraídos, desde una de las capas completamente conectadas del sistema Faster R-CNN, específicamente se prueban las tres que contempla la arquitectura, para evaluar cuál de estas permite obtener los mejores resultados. Para definir, entre otras cosas, el número de capas convolucionales a utilizar y el tamaño de los filtros presentes en las primeras capas del sistema Faster R-CNN, se emplean los modelos de redes convolucionales ZF y VGG16, estas redes son solamente de clasificación, y son las mismas ocupados originalmente. Para desarrollar los sistemas propuestos se utilizan distintas implementaciones o librerías para las cuales se dispone de su código de forma abierta. Para el detector Faster R-CNN se utiliza una implementación desarrollado en Python, para RF se comparan dos librerías: randomForest escrita en R y scikit-learn en Python. Por su parte para SVM se utiliza la librería conocida como LIBSVM escrita en C. Las principales tareas de programación consisten en desarrollar los algoritmos de etiquetado de los vectores de características extraídos desde las capas completamente conectadas; unir los clasificadores con el sistema base, para el análisis \textit{online} de las imágenes en la etapa de prueba; programar un algoritmo para el entrenamiento eficiente en tiempo y en memoria para SVM (algoritmo conocido como hard negative mining) Al evaluar los sistemas desarrollados se concluye que los mejores resultados se obtienen con la red VGG16, específicamente para el caso en que se implementa el sistema Faster R-CNN+SVM con kernel RBF (radial basis function), logrando un mean Average Precision (mAP) de 68.9%. El segundo mejor resultado se alcanza con Faster R-CNN+RF con 180 árboles y es de 67.8%. Con el sistema original Faster R-CNN se consigue un mAP de 69.3%.
2

A Novel Animal Detection Technique for Intelligent Vehicles

Zhao, Weihong 29 August 2018 (has links)
The animal-vehicle collision has been a topic of concern for years, especially in North America. To mitigate the problem, this thesis focuses on animal detection based on the onboard camera for intelligent vehicles. In the domain of image classification and object detection, the methods of shape matching and local feature crafting have reached the technical plateau for decades. The development of Convolutional Neural Network (CNN) brings a new breakthrough. The evolution of CNN architectures has dramatically improved the performance of image classification. Effective frameworks on object detection through CNN structures are thus boosted. Notably, the family of Region-based Convolutional Neural Networks (R-CNN) perform well by combining region proposal with CNN. In this thesis, we propose to apply a new region proposal method|Maximally Stable Extremal Regions (MSER) in Fast R-CNN to construct the animal detection framework. MSER algorithm detects stable regions which are invariant to scale, rotation and viewpoint changes. We generate regions of interest by dealing with the result of MSER algorithm in two ways: by enclosing all the pixels from the resulted pixel-list with a minimum enclosing rectangle (the PL MSER) and by fitting the resulted elliptical region to an approximate box (the EL MSER). We then preprocess the bounding boxes of PL MSER and EL MSER to improve the recall of detection. The preprocessing steps consist of filtering out undesirable regions by aspect ratio model, clustering bounding boxes to merge the overlapping regions, modifying and then enlarging the regions to cover the entire animal. We evaluate the two region proposal methods by the measurement of recall over IOU-threshold curve. The proposed MSER method can cover the expected regions better than Edge Boxes and Region Proposal Network (RPN) in Faster R-CNN. We apply the MSER region proposal method to the framework of R-CNN and Fast R-CNN. The experiments on the animal database with moose, deer, elk, and horses show that Fast R-CNN with MSER achieves better accuracy and faster speed than R-CNN with MSER. Concerning the two ways of MSER, the experimental results show that PL MSER is faster than EL MSER and EL MSER gains higher precision than PL MSER. Also, by altering the structure of network used in Fast R-CNN, we verify that network stacking more layers achieves higher accuracy and recall. In addition, we compare the Fast R-CNN framework using MSER region proposal with the state-of-the-art Faster R-CNN by evaluating the experimental results of on our animal database. Using the same CNN structure, the proposed Fast R-CNN with MSER gains a higher average accuracy of the animal detection 0.73, compared to 0.42 of Faster R-CNN. In terms of detection quality, the proposed Fast R-CNN with MSER achieves better IoU histogram than that of Faster R-CNN.
3

Cascade Mask R-CNN and Keypoint Detection used in Floorplan Parsing

Eklund, Anton January 2020 (has links)
Parsing floorplans have been a problem in automatic document analysis for long and have up until recent years been approached with algorithmic methods. With the rise of convolutional neural networks (CNN), this problem too has seen an upswing in performance. In this thesis the task is to recover, as accurately as possible, spatial and geometric information from floorplans. This project builds around instance segmentation models like Cascade Mask R-CNN to extract the bulk of information from a floorplan image. To complement the segmentation, a new style of using keypoint-CNN is presented to find precise locations of corners. These are then combined in a post-processing step to give the resulting segmentation. The resulting segmentation scores exceed the current baseline of the CubiCasa5k floorplan dataset with a mean IoU of 72.7% compared to 57.5%. Further, the mean IoU for individual classes is also improved for almost every class. It is also shown that Cascade Mask R-CNN is better suited than Mask R-CNN for this task.
4

Reconocimiento rápido de objetos usando objects proposals y deep learning

Soto Barra, Claudia Naiomi January 2017 (has links)
Ingeniera Civil Eléctrica / El reconocimiento (o detección) de objetos es un área activa y en continua mejora de la visión computacional. Recientemente se han introducido distintas estrategias para mejorar el desempeño y disminuir los costos y el tiempo de detección. Entre estas, se encuentran la generación de Object Proposals (regiones en la imágen donde hay alta probabilidad de encontrar un objeto) para acelerar la etapa de localización, como respuesta al paradigma de ventana deslizante; el cada vez más popular uso de redes Deep Learning y, en particular, para la clasi cación y detección de imágenes, las redes convolucionales (CNN). Si bien existen diversos trabajos que utilizan ambas técnicas, todos ellos se centran en tener una buena performance en conocidas bases de datos y competencias en lugar de estudiar su comportamiento en problemas reales y el efecto que tiene la modi cación de arquitecturas de redes convencionales y la elección adecuada de un sistema de generación de proposals. En este trabajo de título, entonces, se tiene como objetivo principal el caracterizar métodos de generación de proposals para su uso en el reconocimiento de objetos con redes CNN, comparando el desempeño tanto de los proposals generados como del sistema completo en bases de datos fabricadas manualmente. Para estudiar el sistema completo, se comparan dos estructuras conocidas, llamadas R-CNN y Fast R-CNN, que utilizan de distintas formas ambas técnicas (generación de proposals y detección) y donde se considera en el estado del arte mejor Fast R-CNN. Se propone en este trabajo que esta hipótesis no es del todo cierta en el caso de que se trabaje con un número su cientemente bajo de proposals (donde las bases de datos acá construidas se enfocan en precisamente asegurar una cantidad baja de objetos de tamaños similares presentes en cada una: objetos sobre super cies y objetos de una sala de estar) y se acelere el proceso de clasi cación alterando el tamaño de entrada de la red convolucional utilizada. Se eligieron tres métodos de generación de Proposals de la literatura a partir de su desempe ño reportado, y fueron comparados en distintos escenarios sus tiempos de procesamiento, calidad de proposals generados (mediante análisis visual y numérico) en función del número generados de estos. El método llamado BING presenta una ventaja sustancial en términos del tiempo de procesamiento y tiene un desempeño competitivo medido con el recall (fracción de los objetos del ground truth correctamente detectados) para las aplicaciones escogidas. Para implementar R-CNN se entrenan dos redes del tipo SqueezeNet pero con entradas reducidas y seleccionando los 50 mejores proposals generados por BING se encuentra que para una red de entrada 64x64 se alcanza casi el mismo recall (~ 40%) que se obtiene con el Fast R-CNN original y con una mejor precisión, aunque es 5 veces más lento (0.75s versus 0.14s). El sistema R-CNN implementado en este trabajo, entonces, no sólo acelera entre 10 y 20 veces la etapa de generación de proposals en comparación a su implementación original, si no que el efecto de reducir la entrada de la red utilizada logra disminuir el tiempo de detección a uno que es sólo 5 veces más lento que Fast R-CNN cuando antes era hasta 100 veces más lento y con un desempeño equivalente.
5

Faster R-CNN based CubeSat Close Proximity Detection and Attitude Estimation

Sujeewa Samarawickrama, N G I 09 August 2019 (has links)
Automatic detection of space objects in optical images is important to close proximity operations, relative navigation, and situational awareness. To better protect space assets, it is very important not only to know where a space object is, but also what the object is. In this dissertation, a method for detecting multiple 1U, 2U, 3U, and 6U CubeSats based on the faster region-based convolutional neural network (Faster R-CNN) is described. CubeSats detection models are developed using Web-searched and computer-aided design images. In addition, a two-step method is presented for detecting a rotating CubeSat in close proximity from a sequence of images without the use of intrinsic or external camera parameters. First, a Faster R-CNN trained on synthetic images of 1U, 2U, 3U, and 6U CubeSats locates the CubeSat in each image and assigns a weight to each CubeSat class. Then, these classification results are combined using Dempster's rule. The method is tested on simulated scenarios where the rotating 3U and 6U CubeSats are in unfavorable views or in dark environments. Faster R-CNN detection results contain useful information for tracking, navigation, pose estimation, and simultaneous localization and mapping. A coarse single-point attitude estimation method is proposed utilizing the centroids of the bounding boxes surrounding the CubeSats in the image. The centroids define the line-of-sight (LOS) vectors to the detected CubeSats in the camera frame, and the LOS vectors in the reference frame are assumed to be obtained from global positioning system (GPS). The three-axis attitude is determined from the vector observations by solving Wahba's problem. The attitude estimation concept is tested on simulated scenarios using Autodesk Maya.
6

News article segmentation using multimodal input : Using Mask R-CNN and sentence transformers / Artikelsegmentering med multimodala artificiella neuronnätverk : Med hjälp av Mask R-CNN och sentence transformers

Henning, Gustav January 2022 (has links)
In this century and the last, serious efforts have been made to digitize the content housed by libraries across the world. In order to open up these volumes to content-based information retrieval, independent elements such as headlines, body text, bylines, images and captions ideally need to be connected semantically as article-level units. To query on facets such as author, section, content type or other metadata, further processing of these documents is required. Even though humans have shown exceptional ability to segment different types of elements into related components, even in languages foreign to them, this task has proven difficult for computers. The challenge of semantic segmentation in newspapers lies in the diversity of the medium: Newspapers have vastly different layouts, covering diverse content, from news articles to ads to weather reports. State-of-the-art object detection and segmentation models have been trained to detect and segment real-world objects. It is not clear whether these architectures can perform equally well when applied to scanned images of printed text. In the domain of newspapers, in addition to the images themselves, we have access to textual information through Optical Character Recognition. The recent progress made in the field of instance segmentation of real-world objects using deep learning techniques begs the question: Can the same methodology be applied in the domain of newspaper articles? In this thesis we investigate one possible approach to encode the textual signal into the image in an attempt to improve performance. Based on newspapers from the National Library of Sweden, we investigate the predictive power of visual and textual features and their capacity to generalize across different typographic designs. Results show impressive mean Average Precision scores (>0:9) for test sets sampled from the same newspaper designs as the training data when using only the image modality. / I detta och det förra århundradet har kraftiga åtaganden gjorts för att digitalisera traditionellt medieinnehåll som tidigare endast tryckts i pappersformat. För att kunna stödja sökningar och fasetter i detta innehåll krävs bearbetning påsemantisk nivå, det vill säga att innehållet styckas upp påartikelnivå, istället för per sida. Trots att människor har lätt att dela upp innehåll påsemantisk nivå, även påett främmande språk, fortsätter arbetet för automatisering av denna uppgift. Utmaningen i att segmentera nyhetsartiklar återfinns i mångfalden av utseende och format. Innehållet är även detta mångfaldigt, där man återfinner allt ifrån faktamässiga artiklar, till debatter, listor av fakta och upplysningar, reklam och väder bland annat. Stora framsteg har gjorts inom djupinlärning just för objektdetektering och semantisk segmentering bara de senaste årtiondet. Frågan vi ställer oss är: Kan samma metodik appliceras inom domänen nyhetsartiklar? Dessa modeller är skapta för att klassificera världsliga ting. I denna domän har vi tillgång till texten och dess koordinater via en potentiellt bristfällig optisk teckenigenkänning. Vi undersöker ett sätt att utnyttja denna textinformation i ett försök att förbättra resultatet i denna specifika domän. Baserat pådata från Kungliga Biblioteket undersöker vi hur väl denna metod lämpar sig för uppstyckandet av innehåll i tidningar längsmed tidsperioder där designen förändrar sig markant. Resultaten visar att Mask R-CNN lämpar sig väl för användning inom domänen nyhetsartikelsegmentering, även utan texten som input till modellen.
7

Using Mask R-CNN for Instance Segmentation of Eyeglass Lenses / Användning av Mask R-CNN för instanssegmentering av glasögonlinser

Norrman, Marcus, Shihab, Saad January 2021 (has links)
This thesis investigates the performance of Mask R-CNN when utilizing transfer learning on a small dataset. The aim was to instance segment eyeglass lenses as accurately as possible from self-portrait images. Five different models were trained, where the key difference was the types of eyeglasses the models were trained on. The eyeglasses were grouped into three types, fully rimmed, semi-rimless, and rimless glasses. 1550 images were used for training, validation, and testing. The model's performances were evaluated using TensorBoard training data and mean Intersection over Union scores (mIoU). No major differences in performance were found in four of the models, which grouped all three types of glasses into one class. Their mIoU scores range from 0.913 to 0.94 whereas the model with one class for each group of glasses, performed worse, with a mIoU of 0.85. The thesis revealed that one can achieve great instance segmentation results using a limited dataset when taking advantage of transfer learning. / Denna uppsats undersöker prestandan för Mask R-CNN vid användning av överföringsinlärning på en liten datamängd. Syftet med arbetet var att segmentera glasögonlinser så exakt som möjligt från självporträttbilder. Fem olika modeller tränades, där den viktigaste skillnaden var de typer av glasögon som modellerna tränades på. Glasögonen delades in i 3 typer, helbåge, halvbåge och båglösa. Totalt samlades 1550 träningsbilder in, dessa annoterades och användes för att träna modellerna.  Modellens prestanda utvärderades med TensorBoard träningsdata samt genomsnittlig Intersection over Union (IoU). Inga större skillnader i prestanda hittades mellan modellerna som endast tränades på en klass av glasögon. Deras genomsnittliga IoU varierar mellan 0,913 och 0,94. Modellen där varje glasögonkategori representerades som en unik klass, presterade sämre med en genomsnittlig IoU på 0,85. Resultatet av uppsatsen påvisar att goda instanssegmenteringsresultat går att uppnå med hjälp av en begränsad datamängd om överföringsinlärning används.
8

[en] METHOD FOR AUTOMATIC DETECTION OF STAMPS IN SCANNED DOCUMENTS USING DEEP LEARNING AND SYNTHETIC DATA GENERATION BY INSTANCE AUGMENTATION / [pt] MÉTODO PARA DETECÇÃO AUTOMÁTICA DE CARIMBOS EM DOCUMENTOS ESCANEADOS USANDO DEEP LEARNING E GERAÇÃO DE DADOS SINTÉTICOS ATRAVÉS DE INSTANCE AUGMENTATION

THALES LEVI AZEVEDO VALENTE 11 August 2022 (has links)
[pt] Documentos digitalizados em ambientes de negócios substituíram grandes volumes de papéis. Profissionais autorizados usam carimbos para certificar informações críticas nesses documentos. Muitas empresas precisam verificar o carimbo adequado de documentos de entrada e saída. Na maioria das situações de inspeção, as pessoas realizam inspeção visual para identificar carimbos. Assim sendo, a verificação manual de carimbos é cansativa, suscetível a erros e ineficiente em termos de tempo gasto e resultados esperados. Erros na verificação manual de carimbos podem gerar multas de órgãos reguladores, interrupção de operações e até mesmo comprometer fluxos de trabalho e transações financeiras. Este trabalho propõe dois métodos que combinados podem resolver esse problema, automatizando totalmente a detecção de carimbos em documentos digitalizados do mundo real. Os métodos desenvolvidos podem lidar com conjuntos de dados contendo muitos tipos de carimbos de tamanho de amostra pequena, com múltiplas sobreposições, combinações diferentes por página e dados ausentes. O primeiro método propõe uma arquitetura de rede profunda projetada a partir da relação entre os problemas identificados em carimbos do mundo real e os desafios e soluções da tarefa de detecção de objetos apontados na literatura. O segundo método propõe um novo pipeline de aumento de instâncias de conjuntos de dados de carimbos a partir de dados reais e investiga se é possível detectar tipos de carimbos com amostras insuficientes. Este trabalho avalia os hiperparâmetros da abordagem de aumento de instâncias e os resultados obtidos usando um método Deep Explainability. Foram alcançados resultados de última geração para a tarefa de detecção de carimbos combinando com sucesso esses dois métodos, alcançando 97.3 por cento de precisão e 93.2 por cento de recall. / [en] Scanned documents in business environments have replaced large volumes of papers. Authorized professionals use stamps to certify critical information in these documents. Many companies need to verify the adequate stamping of incoming and outgoing documents. In most inspection situations, people perform a visual inspection to identify stamps. Therefore, manual stamp checking is tiring, susceptible to errors, and inefficient in terms of time spent and expected results. Errors in manual checking for stamps can lead to fines from regulatory bodies, interruption of operations, and even compromise workflows and financial transactions. This work proposes two methods that combined can address this problem, by fully automating stamp detection in real-world scanned documents. The developed methods can handle datasets containing many small sample-sized types of stamps, multiples overlaps, different combinations per page, and missing data. The first method proposes a deep network architecture designed from the relationship between the problems identified in real-world stamps and the challenges and solutions of the object detection task pointed out in the literature. The second method proposes a novel instance augmentation pipeline of stamp datasets from real data to investigate whether it is possible to detect stamp types with insufficient samples. We evaluate the hyperparameters of the instance augmentation approach and the obtained results through a Deep Explainability method. We achieve state-of-the-art results for the stamp detection task by successfully combining these two methods, achieving 97.3 percent of precision and 93.2 percent of recall.
9

Scene Recognition for Safety Analysis in Collaborative Robotics

Wang, Shaolei January 2018 (has links)
In modern industrial environments, human-robot collaboration is a trend in automation to improve performance and productivity. Instead of isolating robot from human to guarantee safety, collaborative robotics allows human and robot working in the same area at the same time. New hazards and risks, such as the collision between robot and human, arise in this situation. Safety analysis is necessary to protect both human and robot when using a collaborative robot.To perform safety analysis, robots need to perceive the surrounding environment in realtime. This surrounding environment is perceived and stored in the form of scene graph, which is a direct graph with semantic representation of the environment, the relationship between the detected objects and properties of these objects. In order to generate the scene graph, a simulated warehouse is used: robots and humans work in a common area for transferring products between shelves and conveyor belts. Each robot generates its own scene graph from the attached camera sensor. In the graph, each detected object is represented by a node and edges are used to denote the relationship among the identified objects. The graph node includes values like velocity, bounding box sizes, orientation, distance and directions between the object and the robot.We generate scene graph in a simulated warehouse scenario with the frequency of 7 Hz and present a study of Mask R-CNN based on the qualitative comparison. Mask R-CNN is a method for object instance segmentation to get the properties of the objects. It uses ResNetFPN for feature extraction and adds a branch to Faster R-CNN for predicting segmentation mask for each object. And its results outperform almost all existing, single-model entries on instance segmentation and bounding-box object detection. With the help of this method, the boundaries of the detected object are extracted from the camera images. We initialize Mask R-CNN model using three different types of weights: COCO pre-trained weight, ImageNet pre-trained weight and random weight, and the results of these three different weights are compared w.r.t. precision and recall.Results showed that Mask R-CNN is also suitable for simulated environments and can meet requirements in both detection precision and speed. Moreover, the model trained used the COCO pre-trained weight outperformed the model with ImageNet and randomly assigned initial weights. The calculated Mean Average Precision (mAP) value for validation dataset reaches 0.949 with COCO pre-trained weights and execution speed of 11.35 fps. / I modern industriella miljöer, för att förbättra prestanda och produktivitet i automatisering är human-robot samarbete en trend. Istället för att isolera roboten från människan för att garantera säkerheten, möjliggör samarbets robotar att man och robot arbetar i samma område samtidigt. Nya risker, såsom kollisionen mellan robot och människa, uppstår i denna situation. Säkerhetsanalys är nödvändig för att skydda både människa och robot när man använder en samarbets robot.För att utföra säkerhetsanalys måste robotar uppfatta omgivningen i realtid. Denna omgivande miljö uppfattas och lagras i form av scen graf, som är ett direkt diagram med semantisk representation av miljön, samt förhållandet mellan de detekterade objekten och egenskaperna hos dessa objekt. För att skapa scen grafen används ett simulerat lager: robotar och människor arbetar i ett gemensamt område för överföring av produkter mellan hyllor och transportband. Varje robot genererar sin egen scen grafik från den medföljande kamerasensorn. I diagrammet presenteras varje detekterat objekt av en nod och kanterna används för att beteckna förhållandet mellan de identifierade objekten. Diagram noden innehåller värden som hastighet, gränsvärde, orientering, avstånd och riktningar mellan objektet och roboten.Vi genererar scen graf i ett simulerat lager scenario med frekvensen 7 Hz och presenterar en studie av Mask R-CNN baserat på den kvalitativa jämförelsen. Mask R-CNN är ett sätt att segmentera objekt exempel för att få objektens egenskaper. Det använder ResNetFPN för funktion extraktion och lägger till en gren till Snabbare R-CNN för att förutsäga segmenterings mask för varje objekt. Och dess resultat överträffar nästan alla befintliga, enkel modell poster, till exempel segmentering och avgränsning av objektiv detektering. Med hjälp av denna metod extraheras kanterna för det detekterade objektet från kamerabilderna. Vi initierar Mask R-CNN-modellen med tre olika typer av vikter: COCO-utbildade vikter, ImageNet-tränade vikter och slumpmässiga vikter, och resultaten av dessa tre olika vikter jämförs med avseende på precision och återkallelse.Resultaten visade att Mask R-CNN också är lämplig för simulerade miljöer och kan uppfylla kraven i både detekterings precision och hastighet. Dessutom använde den utbildade modellen de COCO-tränade vikterna överträffat modellen med slumpmässigt tilldelade initial vikter. Det beräknade medelvärdet för precision (mAP) för validerings dataset når 0.949 med COCO-pre-utbildade vikter och körhastighet på 11.35 fps.
10

Image-Text context relation using Machine Learning : Research on performance of different datasets

Sun, Yuqi January 2022 (has links)
Based on the progress in Computer Vision and Natural Language Processing fields, Vision-Language (VL) models are designed to process information from images and texts. The thesis focused on the performance of a model, Oscar, on different datasets. Oscar is a State-of-The-Art VL representation learning model based on a pre-trained model for Object Detection and a pre-trained Bert model. By comparing the performance of datasets, we could understand the relationship between the properties of datasets and the performance of models. The conclusions could provide the direction for future work on VL datasets and models. In this thesis, I collected five VL datasets that have at least one main difference from each other and generated 8 subsets from these datasets. I trained the same model with different subsets to classify whether an image is related to a text. In common sense, clear datasets have better performance because their images are of everyday scenes and annotated by human annotators. Thus, the size of clear datasets is always limited. However, an interesting phenomenon in the thesis is that the dataset generated by models trained on different datasets has achieved as good performance as clear datasets. This would encourage the research on models for data collection. The experiment results also indicated that future work on the VL model could focus on improving feature extraction from images, as the images have a great influence on the performance of VL models. / Baserat på prestationerna inom Computer Vision och Natural Language Processing-fält, är Vision-Language (VL)-modeller utformade för att bearbeta information från bilder och texter. Projektet fokuserade på prestanda av en modell, Oscar, på olika datamängder. Oscar är en State-of-The-Art VL-representationsinlärningsmodell baserad på en förutbildad modell för Objektdetektion och en förutbildad Bert-modell. Genom att jämföra datauppsättningarnas prestanda kunde vi förstå sambandet mellan datauppsättningarnas egenskaper och modellernas prestanda. Slutsatserna skulle kunna ge riktning för framtida arbete med VL-datauppsättningar och modeller. I detta projekt samlade jag fem VL-datauppsättningar som har minst en huvudskillnad från varandra och genererade 8 delmängder från dessa datauppsättningar. Jag tränade samma modell med olika delmängder för att klassificera om en bild är relaterad till en text. I sunt förnuft har tydliga datauppsättningar bättre prestanda eftersom deras bilder är av vardagliga scener och kommenterade av människor. Storleken på tydliga datamängder är därför alltid begränsad. Ett intressant fenomen i projektet är dock att den datauppsättning som genereras av modeller har uppnått lika bra prestanda som tydliga datauppsättningar. Detta skulle uppmuntra forskning om modeller för datainsamling. Experimentresultaten indikerade också att framtida arbete med VL-modellen kan fokusera på att förbättra funktionsextraktion från bilder, eftersom bilderna har ett stort inflytande på prestandan hos VL-modeller.

Page generated in 0.0358 seconds