Image captioning is a popular research direction at the intersection of machine vision and natural language processing. Most of the existing image captioning methods adopt an encoder–decoder-like structure in which the image is encoded and fed into a decoder to generate a paragraph describing the image content. Although the existing methods have achieved great results in describing natural images, there is still much room for improvement in describing details. We propose the semantic space captioner model to introduce the concept of dense captioning into image captioning using contrastive language-image pretraining as an encoder for text and images. Dense captions are generated for image regions and are used as an extra semantic space for decoding to enhance the final caption. According to the experimental results, our model outperforms existing methods in generalizing image details and is able to generate diverse and meaningful captions. It also performs well on the MSCOCO dataset-related metrics scores. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 1 scholarly publication.
Computer programming
Image processing
Image retrieval
Visual process modeling
Visualization
Data modeling
Lutetium