Paper
13 June 2024 Cross-modal image text retrieval based on cross-optimization and joint strategy
Na Han, Wei Zheng, Yan Hu, Peipei Kang, Yue Zhang, Aqing Yang, Lei Zhang
Author Affiliations +
Proceedings Volume 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024); 131802I (2024) https://doi.org/10.1117/12.3034096
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 2024, Guangzhou, China
Abstract
In the fields of computer vision and natural language processing, cross-modal retrieval is of great importance that cannot be ignored. In existing multi-granularity alignment methods, significant progress has been made by globally aligning images and sentences or locally aligning regions and words. However, most of these methods rely on attention mechanisms, which may be affected by noise and cannot balance the differences between modalities, resulting in suboptimal image-text alignment. In addition, training with attention mechanisms consumes a lot of computational resources, and the retrieval process is time-consuming. To address these challenges, we develop a novel multigranularity image-text alignment model, which we call DFVLM. First, we independently train the intra-modal encoder and cross-modal encoder to reduce interference between different modal encoders and better learn intra-modal and intermodal information. Then, we propose a joint strategy to combine the intra-modal and inter-modal information to better capture the information between image and text. In addition, we introduce a hard negative pair-based method to train the performance of the intra-modal multi-granularity encoder without using attention mechanisms. Unlike existing methods, We treat image and text retrieval as a two-way process and delve into the intrinsic connections between the two matrices. We propose a cross-validation method to optimize the retrieval result based on common image-to-text alignment scenarios in daily life, i.e., during the image-to-text retrieval process, a text-to-image reverse verification is performed on the retrieved corresponding text. Through extensive qualitative experiments and analysis, our approach performs well on both Flickr30K and MSCOCO datasets and also significantly reduces the time required for testing.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Na Han, Wei Zheng, Yan Hu, Peipei Kang, Yue Zhang, Aqing Yang, and Lei Zhang "Cross-modal image text retrieval based on cross-optimization and joint strategy", Proc. SPIE 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 131802I (13 June 2024); https://doi.org/10.1117/12.3034096
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Image retrieval

Education and training

Image processing

Matrices

Object detection

Data modeling

Back to Top