Abstract
The rapid advancement of Internet technology, driven by social media and e-commerce platforms, has facilitated the generation and sharing of multimodal data, leading to increased interest in efficient cross-modal retrieval systems. Cross-modal image-text retrieval, encompassing tasks such as image query text (IqT) retrieval and text query image (TqI) retrieval, plays a crucial role in semantic searches across modalities. This paper presents a comprehensive survey of cross-modal image-text retrieval, addressing the limitations of previous studies that focused on single perspectives such as subspace learning or deep learning models. We categorize existing models into single-tower, dual-tower, real-value representation, and binary representation models based on their structure and feature representation. A key focus is placed on the fusion of modalities to enhance retrieval performance across diverse data types. Additionally, we explore the impact of multimodal Large Language Models (MLLMs) on cross-modal fusion and retrieval. Our study also provides a detailed overview of common datasets, evaluation metrics, and performance comparisons of representative methods. Finally, we identify current challenges and propose future research directions to advance the field of cross-modal image-text retrieval.
Data Availability Statement
Not applicable.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant U22A2025, Grant 62072088, Grant 62232007, Grant U23A20309, and Grant 61991404; in part by the Liaoning Provincial Science and Technology Plan Project - Key R&D Department of Science and Technology under Grant 2023JH2/101300182; in part by the 111 Project under Grant B16009.
Conflicts of Interest
The authors declare no conflicts of interest.
Ethical Approval and Consent to Participate
Not applicable.
Cite This Article
APA Style
Li, T., Kong, L., Yang, X., Wang, B., & Xu, J. (2024). Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval. Chinese Journal of Information Fusion, 1(1), 79–92. https://doi.org/10.62762/CJIF.2024.361895
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions

Copyright © 2024 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.