Global ETD Search

21	Writer identification using semi-supervised GAN and LSR method on offline block characters Hagström, Adrian, Stanikzai, Rustam January 2020 (has links) Block characters are often used when filling out forms, for example when writing ones personal number. The question of whether or not there is recoverable, biometric (identity related) information within individual digits of hand written personal numbers is then relevant. This thesis investigates the question by using both handcrafted features and extracting features via Deep learning (DL) models, and successively limiting the amount of available training samples. Some recent works using DL have presented semi-supervised methods using Generative adveserial network (GAN) generated data together with a modified Label smoothing regularization (LSR) function. Using this training method might improve performance on a baseline fully supervised model when doing authentication. This work additionally proposes a novel modified LSR function named Bootstrap label smooting regularizer (BLSR) designed to mitigate some of the problems of previous methods, and is compared to the others. The DL feature extraction is done by training a ResNet50 model to recognize writers of a personal numbers and then extracting the feature vector from the second to last layer of the network.Results show a clear indication of recoverable identity related information within the hand written (personal number) digits in boxes. Our results indicate an authentication performance, expressed in Equal error rate (EER), of around 25% with handcrafted features. The same performance measured in EER was between 20-30% when using the features extracted from the DL model. The DL methods, while showing potential for greater performance than the handcrafted, seem to suffer from fluctuation (noisiness) of results, making conclusions on their use in practice hard to draw. Additionally when using 1-2 training samples the handcrafted features easily beat the DL methods.When using the LSR variant semi-supervised methods there is no noticeable performance boost and BLSR gets the second best results among the alternatives. Computer vision Deep learning GAN Generative adveserial network Semi-supervised learning LSR Machine learning ResNet AI Writer authentication Writer identification Datorseende Djup lärning GAN Maskininlärning AI Skrvning autentisering Skrvning identifiering Computer Sciences Datavetenskap (datalogi) Computer Engineering Datorteknik
22	Deep Learning Models for Human Activity Recognition Albert Florea, George, Weilid, Filip January 2019 (has links) AMI Meeting Corpus (AMI) -databasen används för att undersöka igenkännande av gruppaktivitet. AMI Meeting Corpus (AMI) -databasen ger forskare fjärrstyrda möten och naturliga möten i en kontorsmiljö; mötescenario i ett fyra personers stort kontorsrum. För attuppnågruppaktivitetsigenkänninganvändesbildsekvenserfrånvideosoch2-dimensionella audiospektrogram från AMI-databasen. Bildsekvenserna är RGB-färgade bilder och ljudspektrogram har en färgkanal. Bildsekvenserna producerades i batcher så att temporala funktioner kunde utvärderas tillsammans med ljudspektrogrammen. Det har visats att inkludering av temporala funktioner både under modellträning och sedan förutsäga beteende hos en aktivitet ökar valideringsnoggrannheten jämfört med modeller som endast använder rumsfunktioner[1]. Deep learning arkitekturer har implementerats för att känna igen olika mänskliga aktiviteter i AMI-kontorsmiljön med hjälp av extraherade data från the AMI-databas.Neurala nätverks modellerna byggdes med hjälp av KerasAPI tillsammans med TensorFlow biblioteket. Det ﬁnns olika typer av neurala nätverksarkitekturer. Arkitekturerna som undersöktes i detta projektet var Residual Neural Network, Visual GeometryGroup 16, Inception V3 och RCNN (LSTM). ImageNet-vikter har använts för att initialisera vikterna för Neurala nätverk basmodeller. ImageNet-vikterna tillhandahålls av Keras API och är optimerade för varje basmodell [2]. Basmodellerna använder ImageNet-vikter när de extraherar funktioner från inmatningsdata. Funktionsextraktionen med hjälp av ImageNet-vikter eller slumpmässiga vikter tillsammans med basmodellerna visade lovande resultat. Både Deep Learning användningen av täta skikt och LSTM spatio-temporala sekvens predikering implementerades framgångsrikt. / The Augmented Multi-party Interaction(AMI) Meeting Corpus database is used to investigate group activity recognition in an oﬃce environment. The AMI Meeting Corpus database provides researchers with remote controlled meetings and natural meetings in an oﬃce environment; meeting scenario in a four person sized oﬃce room. To achieve the group activity recognition video frames and 2-dimensional audio spectrograms were extracted from the AMI database. The video frames were RGB colored images and audio spectrograms had one color channel. The video frames were produced in batches so that temporal features could be evaluated together with the audio spectrogrames. It has been shown that including temporal features both during model training and then predicting the behavior of an activity increases the validation accuracy compared to models that only use spatial features [1]. Deep learning architectures have been implemented to recognize diﬀerent human activities in the AMI oﬃce environment using the extracted data from the AMI database.The Neural Network models were built using the Keras API together with TensorFlow library. There are diﬀerent types of Neural Network architectures. The architecture types that were investigated in this project were Residual Neural Network, Visual Geometry Group 16, Inception V3 and RCNN(Recurrent Neural Network). ImageNet weights have been used to initialize the weights for the Neural Network base models. ImageNet weights were provided by Keras API and was optimized for each base model[2]. The base models uses ImageNet weights when extracting features from the input data.The feature extraction using ImageNet weights or random weights together with the base models showed promising results. Both the Deep Learning using dense layers and the LSTM spatio-temporal sequence prediction were implemented successfully. ANN Deep learning DL human activity recognition ResNet VGG16 Inception V3 transfer learning ImageNet Keras AMI Augmented Multi-party Interaction LSTM RCNN CNN RGB colored images audio spectrograms Neural Network Engineering and Technology Teknik och teknologier
23	Image forgery detection using textural features and deep learning Malhotra, Yishu 06 1900 (has links) La croissance exponentielle et les progrès de la technologie ont rendu très pratique le partage de données visuelles, d'images et de données vidéo par le biais d’une vaste prépondérance de platesformes disponibles. Avec le développement rapide des technologies Internet et multimédia, l’efficacité de la gestion et du stockage, la rapidité de transmission et de partage, l'analyse en temps réel et le traitement des ressources multimédias numériques sont progressivement devenus un élément indispensable du travail et de la vie de nombreuses personnes. Sans aucun doute, une telle croissance technologique a rendu le forgeage de données visuelles relativement facile et réaliste sans laisser de traces évidentes. L'abus de ces données falsifiées peut tromper le public et répandre la désinformation parmi les masses. Compte tenu des faits mentionnés ci-dessus, la criminalistique des images doit être utilisée pour authentifier et maintenir l'intégrité des données visuelles. Pour cela, nous proposons une technique de détection passive de falsification d'images basée sur les incohérences de texture et de bruit introduites dans une image du fait de l'opération de falsification. De plus, le réseau de détection de falsification d'images (IFD-Net) proposé utilise une architecture basée sur un réseau de neurones à convolution (CNN) pour classer les images comme falsifiées ou vierges. Les motifs résiduels de texture et de bruit sont extraits des images à l'aide du motif binaire local (LBP) et du modèle Noiseprint. Les images classées comme forgées sont ensuite utilisées pour mener des expériences afin d'analyser les difficultés de localisation des pièces forgées dans ces images à l'aide de différents modèles de segmentation d'apprentissage en profondeur. Les résultats expérimentaux montrent que l'IFD-Net fonctionne comme les autres méthodes de détection de falsification d'images sur l'ensemble de données CASIA v2.0. Les résultats discutent également des raisons des difficultés de segmentation des régions forgées dans les images du jeu de données CASIA v2.0. / The exponential growth and advancement of technology have made it quite convenient for people to share visual data, imagery, and video data through a vast preponderance of available platforms. With the rapid development of Internet and multimedia technologies, performing efficient storage and management, fast transmission and sharing, real-time analysis, and processing of digital media resources has gradually become an indispensable part of many people’s work and life. Undoubtedly such technological growth has made forging visual data relatively easy and realistic without leaving any obvious visual clues. Abuse of such tampered data can deceive the public and spread misinformation amongst the masses. Considering the facts mentioned above, image forensics must be used to authenticate and maintain the integrity of visual data. For this purpose, we propose a passive image forgery detection technique based on textural and noise inconsistencies introduced in an image because of the tampering operation. Moreover, the proposed Image Forgery Detection Network (IFD-Net) uses a Convolution Neural Network (CNN) based architecture to classify the images as forged or pristine. The textural and noise residual patterns are extracted from the images using Local Binary Pattern (LBP) and the Noiseprint model. The images classified as forged are then utilized to conduct experiments to analyze the difficulties in localizing the forged parts in these images using different deep learning segmentation models. Experimental results show that both the IFD-Net perform like other image forgery detection methods on the CASIA v2.0 dataset. The results also discuss the reasons behind the difficulties in segmenting the forged regions in the images of the CASIA v2.0 dataset. Épissage d'images Motif binaire local (LBP) Image Splicing Convolution Neural Networks (CNN) ResNet-50 U-Net Local Binary Pattern (LBP)
24	Artificial data for Image classification in industrial applications Yonan, Yonan, Baaz, August January 2022 (has links) Machine learning and AI are growing rapidly and they are being implemented more often than before due to their high accuracy and performance. One of the biggest challenges to machine learning is data collection. The training data is the most important part of any machine learning project since it determines how the trained model will behave. In the case of object classification and detection, capturing a large number of images per object is not always possible and can be a very time-consuming and tedious process. This thesis explores options specific to image classification that help reducing the need to capture many images per object while still keeping the same performance accuracy. In this thesis, experiments have been performed with the goal of achieving a high classification accuracy with a limited dataset. One method that is explored is to create artificial training images using a game engine. Ways to expand a small dataset such as different data augmentation methods, and regularization methods, are also employed. / Maskininlärning och AI växer snabbt och de implementeras allt oftare på grund av deras höga noggrannhet och prestanda. En av de största utmaningarna för maskininlärning är datainsamling. Träningsdata är den viktigaste delen av ett maskininlärningsprojekt eftersom den avgör hur den tränade modellen kommer att bete sig. När det gäller objektklassificering och detektering är det inte alltid möjligt att ta många bilder per objekt och det kan vara en process som kräver mycket tid och arbete. Det här examensarbetet utforskar alternativ som är specifika för bildklassificering som minskar behovet av att ta många bilder per objekt samtidigt som prestanda bibehålls. I det här examensarbetet, flera experiment har utförts med målet att uppnå en hög klassificeringsprestanda med en begränsad dataset. En metod som utforskas är att skapa träningsbilder med hjälp av en spelmotor. Metoder för att utöka antal bilder i ett litet dataset, som data augmenteringsmetoder och regleringsmetoder, används också. Synthetic data artificial data object detection image classification artificial intelligence machine learning neural networks convolutional neural networks ResNet ResNet50 Engineering and Technology Teknik och teknologier Computer Engineering Datorteknik Computer Sciences Datavetenskap (datalogi)
25	Deep Neural Network for Classification of H&E-stained Colorectal Polyps : Exploring the Pipeline of Computer-Assisted Histopathology Brunzell, Stina January 2024 (has links) Colorectal cancer is one of the most prevalent malignancies globally and recently introduced digital pathology enables the use of machine learning as an aid for fast diagnostics. This project aimed to develop a deep neural network model to specifically identify and differentiate dysplasia in the epithelium of colorectal polyps and was posed as a binary classification problem. The available dataset consisted of 80 whole slide images of different H&E-stained polyp sections, which were parted info smaller patches, annotated by a pathologist. The best performing model was a pre-trained ResNet-18 utilising a weighted sampler, weight decay and augmentation during fine tuning. Reaching an area under precision-recall curve of 0.9989 and 97.41% accuracy on previously unseen data, the model’s performance was determined to underperform compared to the task’s intra-observer variability and be in alignment with the inter-observer variability. Final model made publicly available at https://github.com/stinabr/classification-of-colorectal-polyps. Machine Learning Deep Learning Neural Network Deep Neural Network ResNet-18 Transfer Learning Image Analysis Image Processing Medical Image Processing Whole Slide Image WSI Pathology Histopathology Digital Pathology Colorectal Cancer Cancer Polyp H&E-stain Medical Image Processing Medicinsk bildbehandling
26	Art to Genre through Deep Learning: A Comparative Analysis of ResNet and EfficientNet for Album Cover Image-Based Music Classification Bernsdorff Wallstedt, Simon January 2024 (has links) Musical genres enable listeners to differentiate between diverse styles and forms of music, serving as a practical tool to organize and categorize artists, albums, and songs. Album covers, featuring graphic depictions that reflect the vibe and tone of the music, serve as a visual intermediary between the artist and the audience. While numerous machine learning techniques leverage textual, visual, and audio information in a multi-modal approach to categorize music, the sole focus on visual aspects, specifically album cover images, and their correlation with musical genres has been less explored. The question guides this research: How do EfficientNet and ResNet compare in their ability to accurately classify album cover images into specific genres based solely on visual features? Two state-of-the-art convolutional neural networks, ResNet and EfficientNet, are employed to classify a newly created dataset (the EquiGen dataset) of 60,000 album cover images into 15 distinct genres. The dataset was divided into 70% for training, 15% for validation, and 15% for testing.The findings reveal that both ResNet and EfficientNet achieve better-than-random classification accuracy, indicating that visual features alone can be informative for genre classification. Some genres performed much better than others, namely Metal, New Age and Rap. EfficientNet demonstrated slightly superior performance compared to ResNet, with higher accuracy, precision, recall, and F1 scores. However, both models exhibited challenges in generalizing well-to-unseen data and showed signs of overfitting.This study contributes to the interdisciplinary research on Music Genre Categorization (MGC), machine learning, and music. CNN convolutional neural network deep learning music genre categorization (MGC) music information retrieval (MIR) EfficientNet ResNet album cover artwork Information Systems, Social aspects
27	Enhancing Fairness in Facial Recognition: Balancing Datasets and Leveraging AI-Generated Imagery for Bias Mitigation : A Study on Mitigating Ethnic and Gender Bias in Public Surveillance Systems Abbas, Rashad, Tesfagiorgish, William Issac January 2024 (has links) Facial recognition technology has become a ubiquitous tool in security and personal identification. However, the rise of this technology has been accompanied by concerns over inherent biases, particularly regarding ethnic and gender. This thesis examines the extent of these biases by focusing on the influence of dataset imbalances in facial recognition algorithms. We employ a structured methodological approach that integrates AI-generated images to enhance dataset diversity, with the intent to balance representation across ethnics and genders. Using the ResNet and Vgg model, we conducted a series of controlled experiments that compare the performance impacts of balanced versus imbalanced datasets. Our analysis includes the use of confusion matrices and accuracy, precision, recall and F1-score metrics to critically assess the model’s performance. The results demonstrate how tailored augmentation of training datasets can mitigate bias, leading to more equitable outcomes in facial recognition technology. We present our findings with the aim of contributing to the ongoing dialogue regarding AI fairness and propose a framework for future research in the field. Facial Recognition Technology Algorithmic Bias Dataset Imbalance ethnic and Gender Representation AI-Generated Images ResNet Model Vgg model Model Performance Evaluation Confusion Matrices AI Fairness and Data Augmentation Engineering and Technology Teknik och teknologier Computer Sciences Datavetenskap (datalogi) Computer and Information Sciences Data- och informationsvetenskap
28	Deep Learning with Vision-based Technologies for Structural Damage Detection and Health Monitoring Bai, Yongsheng 08 December 2022 (has links) No description available. Civil Engineering Computer Science Mechanics deep learning structural damage classification structural damage detection crack detection spalling detection ResNet U-Net cascaded networks Mask R-CNN structural health monitoring shaking table tests Lucas-Kanade tracker displacement subtraction frequency subtraction progressive collapse LiDAR camera drones.
29	Research of Left Ventricular Segmentation on Two-dimensional Ultrasound Images Based on Different Deep Learning Models : master's thesis Ли, Б., Li, B. January 2024 (has links) В последние годы распространенность сердечно-сосудистых заболеваний, а также уровень смертности растет, что серьезно угрожает здоровью человека, что требует от врачей ранней диагностики сердечно-сосудистых заболеваний, чтобы выиграть время для последующего лечения пациентов, а результаты сегментации ультразвуковых изображений левого желудочка могут помочь врачам в диагностике сердечно-сосудистых заболеваний, но ультразвуковые изображения левого желудочка имеют характеристики сильного шума, слабых границ и сложной структуры ткани, что делает сегментацию изображения сложной, низкой эффективностью и плохой точностью. Одним из важнейших этапов оценки здоровья сердца является отслеживание и сегментация эндокардиальной границы левого желудочка (ЛЖ) с помощью ЭхоКГ, которая используется для измерения фракции выброса и оценки движения региональной стенки. Недостатком этих методов является необходимость применения обработки изображений вручную или в полуавтоматическом режиме, что требует специальных знаний и навыков. В результате вопрос автоматического отслеживания и сегментации ЛЖ на ЭхоКГ-изображениях является актуальной и практической проблемой. В моем проекте изучается способность полностью обученных моделей глубокого обучения U-Net, U-Net++, MANet, LinkNet, FPN, PSPNet, PAN, DeepLabv3 и DeepLabv3+ автоматически определять область левого желудочка. В то же время в архитектурах U-Net, U-Net++, MANet, LinkNet, PSPNet, PAN, FPN, DeepLabv3 и DeepLabv3+ модули кодировщика затем последовательно заменялись на ResNet18, ResNet34, ResNet5, ResNet101, EfficientNet-b0, EfficientNet-b1, EfficientNet-b3, EfficientNet-b5, EfficientNet-b7 и MobileNetv2, а ImageNet использовался в качестве весов предварительной подготовки; Добавление магистральных сетей в архитектуру модели приводит к более высокой точности сегментации по сравнению с исходной моделью. В рамках той же архитектуры модели EfficientNet в качестве кодировщика достигает лучших результатов сегментации, а EfficientNet-b3 работает лучше. Аналогично, в рамках серии ResNet ResNet34 работает лучше. В модели сегментации этого эксперимента Deeplabv3+ показывает превосходную производительность. Это указывает на то, что в архитектуре модели этого эксперимента интеграция модулей ResNet34 и EfficientNet-b3 в качестве кодировщиков может эффективно и осуществимо автоматизировать распознавание эндокардиальной границы левого желудочка на ультразвуковых изображениях. Кроме того, аугментация данных также в определенной степени повысит точность сегментации модели. / In recent years, the prevalence of cardiovascular diseases. as well as the mortality rate is increasing, which has seriously threatened human health, which requires doctors to diagnose cardiovascular diseases early to gain time for patients' later treatment, and the segmentation results of left ventricular ultrasound images can assist doctors in the diagnosis of cardiovascular diseases, but the left ventricular ultrasound images have the characteristics of strong noise, weak edges and complex tissue structure, which makes the image segmentation difficult, low efficiency and poor precision. One of the most important steps in estimating the health of the heart is the tracking and segmentation of the left ventricular (LV) endocardial border from EchoCG, which is used for measuring the ejection fraction and assessing the regional wall motion. The disadvantage of these methods is the necessity to apply image processing manually or in a semi-automatic mode, which requires special knowledge and skills. As a result, the issue of an automatic tracking and segmentation of the LV on EchoCG-images is an actual and practical problem. In my project, the ability of fully trained Deep Learning Models U-Net, U-Net++, MANet, LinkNet, FPN, PSPNet, PAN, DeepLabv3and DeepLabv3+ to automatically identify the left ventricular region is explored. At the same time, in the U-Net, U-Net++, MANet, LinkNet, PSPNet, PAN, FPN, DeepLabv3 and DeepLabv3+ architectures, the encoder modules were then sequentially replaced with ResNet18, ResNet34, ResNet5, ResNet101, EfficientNet-b0, EfficientNet-b1, EfficientNet-b3, EfficientNet-b5, EfficientNet-b7and MobileNetv2, and ImageNet was used as the pre-training weights; The addition of backbones to the model architecture leads to higher segmentation accuracy compared to the original model. Within the same model architecture, EfficientNet as the encoder achieves better segmentation results, with EfficientNet-b3 performing the best. Similarly, within the ResNet series, ResNet34 performs better. In the segmentation model of this experiment, Deeplabv3+ shows superior performance. This indicates that in the model architecture of this experiment, integrating ResNet34 and EfficientNet-b3 modules as encoders can effectively and feasibly automate the recognition of the endocardial boundary of the left ventricle in ultrasound images. Furthermore, data augmentation will also enhance the model’s segmentation accuracy to a certain extent. MASTER'S THESIS ELECTRORETINOGRAPHY RETINAL DIAGNOSES MACHINE LEARNING FOURIER TRANSFORM SHORT-TIME FOURIER TRANSFORM DEEP LEARNING TIME-DOMAIN ANALYSIS FREQUENCY-DOMAIN ANALYSIS TIME-FREQUENCY ANALYSIS DISEASE DIAGNOSES FULL-FIELD ELECTRORETINOGRAMS SIGNAL CLASSIFICATION FEATURE EXTRACTION FEATURE LEARNING NEURAL NETWORKS SIGNAL ANALYSIS DEP LARNING BASKWONE ДОПОЛНЕНИЕ ДАННЫХ I-NET DEEPLAVV3 MANET LINKNET EFFICIENTNET FPN PSPNET PAN RESNET MOBILENETV2
30	Towards meaningful and data-efficient learning : exploring GAN losses, improving few-shot benchmarks, and multimodal video captioning Huang, Gabriel 09 1900 (has links) Ces dernières années, le domaine de l’apprentissage profond a connu des progrès énormes dans des applications allant de la génération d’images, détection d’objets, modélisation du langage à la réponse aux questions visuelles. Les approches classiques telles que l’apprentissage supervisé nécessitent de grandes quantités de données étiquetées et spécifiques à la tâches. Cependant, celles-ci sont parfois coûteuses, peu pratiques, ou trop longues à collecter. La modélisation efficace en données, qui comprend des techniques comme l’apprentissage few-shot (à partir de peu d’exemples) et l’apprentissage self-supervised (auto-supervisé), tentent de remédier au manque de données spécifiques à la tâche en exploitant de grandes quantités de données plus “générales”. Les progrès de l’apprentissage profond, et en particulier de l’apprentissage few-shot, s’appuient sur les benchmarks (suites d’évaluation), les métriques d’évaluation et les jeux de données, car ceux-ci sont utilisés pour tester et départager différentes méthodes sur des tâches précises, et identifier l’état de l’art. Cependant, du fait qu’il s’agit de versions idéalisées de la tâche à résoudre, les benchmarks sont rarement équivalents à la tâche originelle, et peuvent avoir plusieurs limitations qui entravent leur rôle de sélection des directions de recherche les plus prometteuses. De plus, la définition de métriques d’évaluation pertinentes peut être difficile, en particulier dans le cas de sorties structurées et en haute dimension, telles que des images, de l’audio, de la parole ou encore du texte. Cette thèse discute des limites et des perspectives des benchmarks existants, des fonctions de coût (training losses) et des métriques d’évaluation (evaluation metrics), en mettant l’accent sur la modélisation générative - les Réseaux Antagonistes Génératifs (GANs) en particulier - et la modélisation efficace des données, qui comprend l’apprentissage few-shot et self-supervised. La première contribution est une discussion de la tâche de modélisation générative, suivie d’une exploration des propriétés théoriques et empiriques des fonctions de coût des GANs. La deuxième contribution est une discussion sur la limitation des few-shot classification benchmarks, certains ne nécessitant pas de généralisation à de nouvelles sémantiques de classe pour être résolus, et la proposition d’une méthode de base pour les résoudre sans étiquettes en phase de testing. La troisième contribution est une revue sur les méthodes few-shot et self-supervised de détection d’objets , qui souligne les limites et directions de recherche prometteuses. Enfin, la quatrième contribution est une méthode efficace en données pour la description de vidéo qui exploite des jeux de données texte et vidéo non supervisés. / In recent years, the field of deep learning has seen tremendous progress for applications ranging from image generation, object detection, language modeling, to visual question answering. Classic approaches such as supervised learning require large amounts of task-specific and labeled data, which may be too expensive, time-consuming, or impractical to collect. Data-efficient methods, such as few-shot and self-supervised learning, attempt to deal with the limited availability of task-specific data by leveraging large amounts of general data. Progress in deep learning, and in particular, few-shot learning, is largely driven by the relevant benchmarks, evaluation metrics, and datasets. They are used to test and compare different methods on a given task, and determine the state-of-the-art. However, due to being idealized versions of the task to solve, benchmarks are rarely equivalent to the original task, and can have several limitations which hinder their role of identifying the most promising research directions. Moreover, defining meaningful evaluation metrics can be challenging, especially in the case of high-dimensional and structured outputs, such as images, audio, speech, or text. This thesis discusses the limitations and perspectives of existing benchmarks, training losses, and evaluation metrics, with a focus on generative modeling—Generative Adversarial Networks (GANs) in particular—and data-efficient modeling, which includes few-shot and self-supervised learning. The first contribution is a discussion of the generative modeling task, followed by an exploration of theoretical and empirical properties of the GAN loss. The second contribution is a discussion of a limitation of few-shot classification benchmarks, which is that they may not require class semantic generalization to be solved, and the proposal of a baseline method for solving them without test-time labels. The third contribution is a survey of few-shot and self-supervised object detection, which points out the limitations and promising future research for the field. Finally, the fourth contribution is a data-efficient method for video captioning, which leverages unsupervised text and video datasets, and explores several multimodal pretraining strategies. self-supervised learning few-shot classification few-shot object detection low-data learning object detection instance segmentation representation learning residual network visual transformer Faster R-CNN DETR parametric adversarial divergence generative adversarial network variational auto-encoder maximum-likelihood structured prediction optimal discriminator mutual information implicit generative model multimodal pretraining dense video captioning cross-attention YouCook2 HowTo-100M Youtube-8M Recipe-1M Pascal VOC MSCOCO LVIS mutual information neural estimation apprentissage auto-supervisé classification few-shot détection d'objets few-shot apprentissage efficace en données segmentation en instances apprentissage de représentation réseau résiduel transformer visual divergences antagonistes paramétriques auto-encodeur variationnel maximum de vraisemblance prédiction structurée discriminateur optimal information mutuelle modèle génératif implicite pré-apprentissage multi-modal description dense de vidéo attention croisée ResNet ViT GAN VAE MINE

Search results