Cross-modal image text retrieval based on cross-optimization and joint strategy

Wei Zheng; Na Han; Yan Hu; Peipei Kang; Yue Zhang; Aqing Yang; Lei Zhang

doi:10.1117/12.3034096

13 June 2024 Cross-modal image text retrieval based on cross-optimization and joint strategy

Wei Zheng, Na Han, Yan Hu, Peipei Kang, Yue Zhang, Aqing Yang, Lei Zhang

Proceedings Volume 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024); 131802I (2024) https://doi.org/10.1117/12.3034096
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 2024, Guangzhou, China

Abstract

In the fields of computer vision and natural language processing, cross-modal retrieval is of great importance that cannot be ignored. In existing multi-granularity alignment methods, significant progress has been made by globally aligning images and sentences or locally aligning regions and words. However, most of these methods rely on attention mechanisms, which may be affected by noise and cannot balance the differences between modalities, resulting in suboptimal image-text alignment. In addition, training with attention mechanisms consumes a lot of computational resources, and the retrieval process is time-consuming. To address these challenges, we develop a novel multigranularity image-text alignment model, which we call DFVLM. First, we independently train the intra-modal encoder and cross-modal encoder to reduce interference between different modal encoders and better learn intra-modal and intermodal information. Then, we propose a joint strategy to combine the intra-modal and inter-modal information to better capture the information between image and text. In addition, we introduce a hard negative pair-based method to train the performance of the intra-modal multi-granularity encoder without using attention mechanisms. Unlike existing methods, We treat image and text retrieval as a two-way process and delve into the intrinsic connections between the two matrices. We propose a cross-validation method to optimize the retrieval result based on common image-to-text alignment scenarios in daily life, i.e., during the image-to-text retrieval process, a text-to-image reverse verification is performed on the retrieved corresponding text. Through extensive qualitative experiments and analysis, our approach performs well on both Flickr30K and MSCOCO datasets and also significantly reduces the time required for testing.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Wei Zheng, Na Han, Yan Hu, Peipei Kang, Yue Zhang, Aqing Yang, and Lei Zhang "Cross-modal image text retrieval based on cross-optimization and joint strategy", Proc. SPIE 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 131802I (13 June 2024); https://doi.org/10.1117/12.3034096

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
11 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Visualization

Image retrieval

Education and training

Image processing

Matrices

Object detection

Data modeling

Show All Keywords

Keywords/Phrases

Search In:

Publication Years