• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 19
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 29
  • 11
  • 11
  • 10
  • 8
  • 8
  • 7
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Popis fotografií pomocí rekurentních neuronových sítí / Image Captioning with Recurrent Neural Networks

Kvita, Jakub January 2016 (has links)
Tato práce se zabývá automatickým generovaním popisů obrázků s využitím několika druhů neuronových sítí. Práce je založena na článcích z MS COCO Captioning Challenge 2015 a znakových jazykových modelech, popularizovaných A. Karpathym. Navržený model je kombinací konvoluční a rekurentní neuronové sítě s architekturou kodér--dekodér. Vektor reprezentující zakódovaný obrázek je předáván jazykovému modelu jako hodnoty paměti LSTM vrstev v síti. Práce zkoumá, na jaké úrovni je model s takto jednoduchou architekturou schopen popisovat obrázky a jak si stojí v porovnání s ostatními současnými modely. Jedním ze závěrů práce je, že navržená architektura není dostatečná pro jakýkoli popis obrázků.
22

Learning Embeddings for Fashion Images

Hermansson, Simon January 2023 (has links)
Today the process of sorting second-hand clothes and textiles is mostly manual. In this master’s thesis, methods for automating this process as well as improving the manual sorting process have been investigated. The methods explored include the automatic prediction of price and intended usage for second-hand clothes, as well as different types of image retrieval to aid manual sorting. Two models were examined: CLIP, a multi-modal model, and MAE, a self-supervised model. Quantitatively, the results favored CLIP, which outperformed MAE in both image retrieval and prediction. However, MAE may still be useful for some applications in terms of image retrieval as it returns items that look similar, even if they do not necessarily have the same attributes. In contrast, CLIP is better at accurately retrieving garments with as many matching attributes as possible. For price prediction, the best model was CLIP. When fine-tuned on the dataset used, CLIP achieved an F1-Score of 38.08 using three different price categories in the dataset. For predicting the intended usage (either reusing the garment or exporting it to another country) the best model managed to achieve an F1-Score of 59.04.
23

Parameter-efficient modeling and robust automatic evaluation of image captioning

Ahmadi, Saba 10 1900 (has links)
Le sous-titrage d’images est la tâche de l’intelligence artificielle (IA) qui consiste à décrire des images en langage naturel. Cette tâche d’IA a plusieurs applications sociétales utiles, telles que l’accessibilité pour les malvoyants, la génération automatisée de contenu, l’interaction humain-robot et l’analyse d’imagerie médicale. Au cours des huit dernières années, la recherche sur le sous-titrage d'images a connu d'énormes progrès dans la création de modèles solides, la collecte d'ensembles de données à grande échelle ainsi que le développement de mesures d'évaluation automatique. Malgré ces progrès remarquables, la recherche sur le sous-titrage d'images est confrontée à deux défis majeurs: 1) Comment construire des modèles efficaces en termes de paramètres, et 2) Comment construire des métriques d'évaluation automatique robustes. Dans cette thèse, nous apportons notre contribution à la résolution de chacun de ces défis. Premièrement, nous proposons une méthode efficace en termes de paramètres (MAPL \cite{mapl}) qui adapte des modèles pré-entraînés unimodaux de vision uniquement et de langage uniquement pour la tâche multimodale de sous-titrage d'images. MAPL apprend un mappage léger entre les espaces de représentation des modèles unimodaux. Ainsi, MAPL peut exploiter les fortes capacités de généralisation des modèles unimodaux pré-entraînés pour des tâches multimodales telles que le sous-titrage d'images. Deuxièmement, nous présentons une étude systématique de la robustesse des mesures d’évaluation des sous-titres d’images récemment proposées. Même si ces métriques correspondent bien aux jugements humains, nous avons constaté qu'elles ne sont pas robustes pour identifier les erreurs fines dans les légendes générées par le modèle. Il faut donc faire preuve de prudence lors de l'utilisation de ces métriques pour l'évaluation des sous-titres d'images. Nous espérons que nos résultats guideront de nouvelles améliorations dans l’évaluation automatique du sous-titrage d’images. / Image captioning is the artificial intelligence (AI) task of describing images in natural language. This AI task has several useful societal applications, such as accessibility for the visually impaired, automated content generation, human-robot interaction, and medical imaging analysis. Over the last eight years, image captioning research has seen tremendous progress in building strong models, collecting large scale datasets as well as developing automatic evaluation metrics. Despite such remarkable progress, image captioning research faces two major challenges: 1) How to build parameter-efficient models, and 2) How to build robust automatic evaluation metrics. In this thesis, we make contributions towards tackling each of these challenges. First, we propose a parameter efficient method (MAPL \cite{mapl}) that adapts pre-trained unimodal vision-only and language-only models for the multimodal task of image captioning. MAPL learns a lightweight mapping between the representation spaces of the unimodal models. Thus, MAPL can leverage the strong generalization capabilities of the pre-trained unimodal models for multimodal tasks such as image captioning. Second, we present a systematic study of the robustness of recently proposed image captioning evaluation metrics. Even though these metrics correlate well with human judgments, we found that these metrics are not robust in identifying fine-grained errors in model generated captions, and thus, caution needs to be exercised when using these metrics for image captioning evaluation. We hope our findings will guide further improvements in the automatic evaluation of image captioning.
24

Deep Understanding of Technical Documents: Automated Generation of Pseudocode from Digital Diagrams & Analysis/Synthesis of Mathematical Formulas

Gkorgkolis, Nikolaos January 2022 (has links)
No description available.
25

同時的な独話音声要約に基づくリアルタイム字幕生成

大野, 誠寛, 松原, 茂樹, 柏岡, 秀紀, 稲垣, 康善 07 1900 (has links) (PDF)
ここに掲載した著作物の利用に関する注意 本著作物の著作権は(社)情報処理学会に帰属します。 本著作物は著作権者である情報処理学会の許可のもとに掲載するものです。 ご利用に当たっては「著作権法」ならびに「情報処理学会倫理綱領」 に従うことをお願いいたします。 Notice for the use of this material The copyright of this material is retained by the Information Processing Society of Japan (IPSJ). This material is published on this web site with the agreement of the author (s) and the IPSJ. Please be complied with Copyright Law of Japan and the Code of Ethics of the IPSJ if any users wish to reproduce, make derivative work, distribute or make available to the public any part or whole thereof. All Rights Reserved, Copyright (C) Information Processing Society of Japan. Comments are welcome. Mail to address:  editj<at>ipsj.or.jp, please.
26

Learning visual representations with neural networks for video captioning and image generation

Yao, Li 12 1900 (has links)
No description available.
27

Medical image captioning based on Deep Architectures / Medicinsk bild textning baserad på Djupa arkitekturer

Moschovis, Georgios January 2022 (has links)
Diagnostic Captioning is described as “the automatic generation of a diagnostic text from a set of medical images of a patient collected during an examination” [59] and it can assist inexperienced doctors and radiologists to reduce clinical errors or help experienced professionals increase their productivity. In this context, tools that would help medical doctors produce higher quality reports in less time could be of high interest for medical imaging departments, as well as significantly impact deep learning research within the biomedical domain, which makes it particularly interesting for people involved in industry and researchers all along. In this work, we attempted to develop Diagnostic Captioning systems, based on novel Deep Learning approaches, to investigate to what extent Neural Networks are capable of performing medical image tagging, as well as automatically generating a diagnostic text from a set of medical images. Towards this objective, the first step is concept detection, which boils down to predicting the relevant tags for X-RAY images, whereas the ultimate goal is caption generation. To this end, we further participated in ImageCLEFmedical 2022 evaluation campaign, addressing both the concept detection and the caption prediction tasks by developing baselines based on Deep Neural Networks; including image encoders, classifiers and text generators; in order to get a quantitative measure of my proposed architectures’ performance [28]. My contribution to the evaluation campaign, as part of this work and on behalf of NeuralDynamicsLab¹ group at KTH Royal Institute of Technology, within the school of Electrical Engineering and Computer Science, ranked 4th in the former and 5th in the latter task [55, 68] among 12 groups included within the top-10 best performing submissions in both tasks. / Diagnostisk textning avser automatisk generering från en diagnostisk text från en uppsättning medicinska bilder av en patient som samlats in under en undersökning och den kan hjälpa oerfarna läkare och radiologer, minska kliniska fel eller hjälpa erfarna yrkesmän att producera diagnostiska rapporter snabbare [59]. Därför kan verktyg som skulle hjälpa läkare och radiologer att producera rapporter av högre kvalitet på kortare tid vara av stort intresse för medicinska bildbehandlingsavdelningar, såväl som leda till inverkan på forskning om djupinlärning, vilket gör den domänen särskilt intressant för personer som är involverade i den biomedicinska industrin och djupinlärningsforskare. I detta arbete var mitt huvudmål att utveckla system för diagnostisk textning, med hjälp av nya tillvägagångssätt som används inom djupinlärning, för att undersöka i vilken utsträckning automatisk generering av en diagnostisk text från en uppsättning medi-cinska bilder är möjlig. Mot detta mål är det första steget konceptdetektering som går ut på att förutsäga relevanta taggar för röntgenbilder, medan slutmålet är bildtextgenerering. Jag deltog i ImageCLEF Medical 2022-utvärderingskampanjen, där jag deltog med att ta itu med både konceptdetektering och bildtextförutsägelse för att få ett kvantitativt mått på prestandan för mina föreslagna arkitekturer [28]. Mitt bidrag, där jag representerade forskargruppen NeuralDynamicsLab² , där jag arbetade som ledande forskningsingenjör, placerade sig på 4:e plats i den förra och 5:e i den senare uppgiften [55, 68] bland 12 grupper som ingår bland de 10 bästa bidragen i båda uppgifterna.
28

數位電視平台與弱勢團體媒體近用:以公共電視台服務聽障社群為例 / Digital TV platform and the right of media access of underprivileged group: Take PTS service for hearing impaired community as example

陳慧汶 Unknown Date (has links)
邁入數位電視紀元乃是全球之趨,而其對於增進身障者獲取各類資訊的「媒介近用權」具有莫大助益,其中針對聽障社群接取內容最重要的近用需求──「字幕」和「手語」服務,在數位科技匯流發展下,皆可以「隱藏式」之方式供應,同時造福聽障和非聽障之傳播權益,以及減輕廣電業者相關技術的支付成本。因此,近用服務的提供從過去的消極被動轉向現今的積極樂觀。而外國先進國家大多皆以公共廣電媒體之設立價值與目標,作為該國近用服務推動的核心主體,希望藉由數位電視的技術研發,達成更多聽障輔助應用之需求和供應滿足,協助其順利進入數位包容社會。故本研究以探詢國外落實近用服務情形,以做為我國公共廣電服務借力使力之參考,期許對我國聽障社群在傳播權益上產生影響。      研究發現,英國、歐盟針對聽障社群的媒體近用落實,無論在法規的制定、實務的推行以及技術的研發等各層面皆有所重視,認為數位電視平台的時代,應協助聽障融入數位包容社會,並設法增進其傳播權益,以彰顯聽障與一般大眾之平權的公民地位;而在我國公視部份,其營運目標始終視英國BBC為效法對象,希冀在內、外資源充份下能達至同BBC供應近用服務之標準水平。然而在多種因素交織下,現階段公視對於聽障媒體近用服務的提供,則依舊保持類比電視時代之作為,不過,經本研究與其互動後了解,公視未來可能朝向增加其他近用服務項目發展,期望數位電視真正來臨時,其能化過往被動態度轉向積極進取:公視目前在電視平台持續兩個「手語專門」節目的製播,並預計規劃將手語服務擴大至「運動」類型節目,以符合聽障收視的期待;至於字幕服務,在已完備的基礎上,試圖朝向「表情字幕」與「即時字幕」發展;另外,於2011年HiHD數位頻道將推出「隱藏式字幕」功能。在網路平台方面,公視服務仍然延伸至電視頻道的節目宣傳與相關資訊供給為主,對於加強聽障的網路近用權益,例如「無障礙網頁空間」以及「近用小組」,認為必然有公共義務介入加以落實,但礙於目前並無相關資源規劃與投入,因此要實際推行仍有很大的進步空間。 / The main purpose of this study is to discover the practice of the right of media access in foreign countries, in order to provide reference to Taiwan’s Public Service Broadcasting (PSB) and to make progress on communication interests for hearing impaired community. “Caption” and “Sign Language” are the most important tools for hearing impaired people to gather all kinds of information and fulfill the necessity of access service. Under the digital convergence, these tools can be provided in special ways, which makes the hearing impaired people and the hearing people share the benefits simultaneously and the cost-down effect of broadcasting industry. We know that most developed countries positioned their access service project by referring to nation’s PSB. They believed the new era of digital TV is a solution to attend the balance between demand and supply of hearing impaired aid applications. While the provision of access services is getting more active and optimistic, the digital inclusion is much close to us. The study shows, British and Europe Union think they should assist hearing impaired people to be involved in e-Inclusion society and highlight equally citizen status by enhancing the rights of hearing impaired people. All the aspects such as regulation enactments, practical implementations and technique developments has been considering all the time on the stage of digital TV platform. Just like the BBC in British area, Public Television Service (PTS) in Taiwan is taking BBC as a benchmark to achieve the access services standard in condition of sufficient resources. However, changing the status quo is not so easy for inextricably interwoven reasons. PTS still works in an analog status. In spite of the circumstances haven't changed much till now, there are much more possibilities in the future. The study discovered some new progressive plans are possible for PTS’s access services in digital journey: PTS will continue to provide two programs which are sign-presented, and moreover, sign language service is going to show up in sports genre; As to caption services, PTS is working on facial expression caption and real-time caption provision; HiHD would have closed caption function in 2011. In the case of Internet platform, PTS is focused on propaganda and related information of TV programs. Barrier-free web space and access group are considered necessary for strengthening hearing impaired people’s Internet access rights and interests, but with insufficient resources planning and investment to put into realization. We can see there is still so much to do if we believe we have the affirmative obligations.
29

Towards meaningful and data-efficient learning : exploring GAN losses, improving few-shot benchmarks, and multimodal video captioning

Huang, Gabriel 09 1900 (has links)
Ces dernières années, le domaine de l’apprentissage profond a connu des progrès énormes dans des applications allant de la génération d’images, détection d’objets, modélisation du langage à la réponse aux questions visuelles. Les approches classiques telles que l’apprentissage supervisé nécessitent de grandes quantités de données étiquetées et spécifiques à la tâches. Cependant, celles-ci sont parfois coûteuses, peu pratiques, ou trop longues à collecter. La modélisation efficace en données, qui comprend des techniques comme l’apprentissage few-shot (à partir de peu d’exemples) et l’apprentissage self-supervised (auto-supervisé), tentent de remédier au manque de données spécifiques à la tâche en exploitant de grandes quantités de données plus “générales”. Les progrès de l’apprentissage profond, et en particulier de l’apprentissage few-shot, s’appuient sur les benchmarks (suites d’évaluation), les métriques d’évaluation et les jeux de données, car ceux-ci sont utilisés pour tester et départager différentes méthodes sur des tâches précises, et identifier l’état de l’art. Cependant, du fait qu’il s’agit de versions idéalisées de la tâche à résoudre, les benchmarks sont rarement équivalents à la tâche originelle, et peuvent avoir plusieurs limitations qui entravent leur rôle de sélection des directions de recherche les plus prometteuses. De plus, la définition de métriques d’évaluation pertinentes peut être difficile, en particulier dans le cas de sorties structurées et en haute dimension, telles que des images, de l’audio, de la parole ou encore du texte. Cette thèse discute des limites et des perspectives des benchmarks existants, des fonctions de coût (training losses) et des métriques d’évaluation (evaluation metrics), en mettant l’accent sur la modélisation générative - les Réseaux Antagonistes Génératifs (GANs) en particulier - et la modélisation efficace des données, qui comprend l’apprentissage few-shot et self-supervised. La première contribution est une discussion de la tâche de modélisation générative, suivie d’une exploration des propriétés théoriques et empiriques des fonctions de coût des GANs. La deuxième contribution est une discussion sur la limitation des few-shot classification benchmarks, certains ne nécessitant pas de généralisation à de nouvelles sémantiques de classe pour être résolus, et la proposition d’une méthode de base pour les résoudre sans étiquettes en phase de testing. La troisième contribution est une revue sur les méthodes few-shot et self-supervised de détection d’objets , qui souligne les limites et directions de recherche prometteuses. Enfin, la quatrième contribution est une méthode efficace en données pour la description de vidéo qui exploite des jeux de données texte et vidéo non supervisés. / In recent years, the field of deep learning has seen tremendous progress for applications ranging from image generation, object detection, language modeling, to visual question answering. Classic approaches such as supervised learning require large amounts of task-specific and labeled data, which may be too expensive, time-consuming, or impractical to collect. Data-efficient methods, such as few-shot and self-supervised learning, attempt to deal with the limited availability of task-specific data by leveraging large amounts of general data. Progress in deep learning, and in particular, few-shot learning, is largely driven by the relevant benchmarks, evaluation metrics, and datasets. They are used to test and compare different methods on a given task, and determine the state-of-the-art. However, due to being idealized versions of the task to solve, benchmarks are rarely equivalent to the original task, and can have several limitations which hinder their role of identifying the most promising research directions. Moreover, defining meaningful evaluation metrics can be challenging, especially in the case of high-dimensional and structured outputs, such as images, audio, speech, or text. This thesis discusses the limitations and perspectives of existing benchmarks, training losses, and evaluation metrics, with a focus on generative modeling—Generative Adversarial Networks (GANs) in particular—and data-efficient modeling, which includes few-shot and self-supervised learning. The first contribution is a discussion of the generative modeling task, followed by an exploration of theoretical and empirical properties of the GAN loss. The second contribution is a discussion of a limitation of few-shot classification benchmarks, which is that they may not require class semantic generalization to be solved, and the proposal of a baseline method for solving them without test-time labels. The third contribution is a survey of few-shot and self-supervised object detection, which points out the limitations and promising future research for the field. Finally, the fourth contribution is a data-efficient method for video captioning, which leverages unsupervised text and video datasets, and explores several multimodal pretraining strategies.

Page generated in 0.5014 seconds