Global ETD Search

Return to search

Image Captioning On General Data And Fashion Data : An Attribute-Image-Combined Attention-Based Network for Image Captioning on Mutli-Object Images and Single-Object Images / Bildtexter på allmänna data och modedata : Ett attribut-bild-kombinerat uppmärksamhetsbaserat nätverk för bildtextning på Mutli-objekt-bilder och en-objekt-bilder

Image captioning is a crucial field across computer vision and natural language processing. It could be widely applied to high-volume web images, such as conveying image content to visually impaired users. Many methods are adopted in this area such as attention-based methods, semantic-concept based models. These achieve excellent performance on general image datasets such as the MS COCO dataset. However, it is still left unexplored on single-object images.In this paper, we propose a new attribute-information-combined attention- based network (AIC-AB Net). At each time step, attribute information is added as a supplementary of visual information. For sequential word generation, spatial attention determines specific regions of images to pass the decoder. The sentinel gate decides whether to attend to the image or to the visual sentinel (what the decoder already knows, including the attribute information). Text attribute information is synchronously fed in to help image recognition and reduce uncertainty.We build a new fashion dataset consisting of fashion images to establish a benchmark for single-object images. This fashion dataset consists of 144,422 images from 24,649 fashion products, with one description sentence for each image. Our method is tested on the MS COCO dataset and the proposed Fashion dataset. The results show the superior performance of the proposed model on both multi-object images and single-object images. Our AIC-AB net outperforms the state-of-the-art network, Adaptive Attention Network by 0.017, 0.095, and 0.095 (CIDEr Score) on the COCO dataset, Fashion dataset (Bestsellers), and Fashion dataset (all vendors), respectively. The results also reveal the complement of attention architecture and attribute information. / Bildtextning är ett avgörande fält för datorsyn och behandling av naturligt språk. Det kan tillämpas i stor utsträckning på högvolyms webbbilder, som att överföra bildinnehåll till synskadade användare. Många metoder antas inom detta område såsom uppmärksamhetsbaserade metoder, semantiska konceptbaserade modeller. Dessa uppnår utmärkt prestanda på allmänna bilddatamängder som MS COCO-dataset. Det lämnas dock fortfarande outforskat på bilder med ett objekt.I denna uppsats föreslår vi ett nytt attribut-information-kombinerat uppmärksamhetsbaserat nätverk (AIC-AB Net). I varje tidsteg läggs attributinformation till som ett komplement till visuell information. För sekventiell ordgenerering bestämmer rumslig uppmärksamhet specifika regioner av bilder som ska passera avkodaren. Sentinelgrinden bestämmer om den ska ta hand om bilden eller den visuella vaktposten (vad avkodaren redan vet, inklusive attributinformation). Text attributinformation matas synkront för att hjälpa bildigenkänning och minska osäkerheten.Vi bygger en ny modedataset bestående av modebilder för att skapa ett riktmärke för bilder med en objekt. Denna modedataset består av 144 422 bilder från 24 649 modeprodukter, med en beskrivningsmening för varje bild. Vår metod testas på MS COCO dataset och den föreslagna Fashion dataset. Resultaten visar den överlägsna prestandan hos den föreslagna modellen på både bilder med flera objekt och enbildsbilder. Vårt AIC-AB-nät överträffar det senaste nätverket Adaptive Attention Network med 0,017, 0,095 och 0,095 (CIDEr Score) i COCO-datasetet, modedataset (bästsäljare) respektive modedatasetet (alla leverantörer). Resultaten avslöjar också komplementet till uppmärksamhetsarkitektur och attributinformation.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-282925

uppmärksamhetsbaserat

textattribut

Computer and Information Sciences

Data- och informationsvetenskap

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-282925
Date	January 2020
Creators	Tu, Guoyun
Publisher	KTH, Skolan för elektroteknik och datavetenskap (EECS)
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	Swedish
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-EECS-EX ; 2020:691

Page generated in 0.0029 seconds

Description

Links & Downloads

Tags

Additional Fields