<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with MathML3 v1.1d2 20140930//EN" "JATS-journalpublishing1-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.1d2" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">CJIF</journal-id>
      <journal-id journal-id-type="publisher-id">ICCK</journal-id>
      <journal-title-group>
        <journal-title>Chinese Journal of Information Fusion</journal-title>
      </journal-title-group>
      <issn pub-type="ppub" publication-format="print">2998-3363</issn>
      <issn pub-type="epub" publication-format="electronic">2998-3371</issn>
      <publisher>
        <publisher-name>Institute of Central Computation and Knowledge Inc</publisher-name>
        <publisher-loc>522 W RIVERSIDE AVE STE N, SPOKANE, WA, 99201, UNITED STATES</publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.62762/CJIF.2024.361895</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Review Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0002-0194-0590</contrib-id>
          <name>
            <surname>Li</surname>
            <given-names>Tieying</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-3657-8926</contrib-id>
          <name>
            <surname>Kong</surname>
            <given-names>Lingdu</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-6184-4771</contrib-id>
          <name>
            <surname>Yang</surname>
            <given-names>Xiaochun</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-2694-1023</contrib-id>
          <name>
            <surname>Wang</surname>
            <given-names>Bin</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-2498-5812</contrib-id>
          <name>
            <surname>Xu</surname>
            <given-names>Jiaxing</given-names>
          </name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff1"><label>1</label>School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China</aff>
        <aff id="aff2"><label>2</label>National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang 110819, China</aff>
        <aff id="aff3"><label>3</label>Key Laboratory of Data Analytics and Optimization for Smart Industry (Northeastern University), Ministry of Education, China</aff>
        <aff id="aff4"><label>4</label>Software College, Northeastern University, Shenyang 110169, China</aff>
        <aff id="aff5"><label>5</label>School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore</aff>
      </contrib-group>
      <author-notes>
        <corresp id="cor3">Corresponding Author: Xiaochun Yang. Email: <email>yangxc@mail.neu.edu.cn</email></corresp>
      </author-notes>
      <pub-date date-type="pub" pub-type="epub" publication-format="online">
        <day>12</day>
        <month>6</month>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <issue>1</issue>
      <fpage>79</fpage>
      <lpage>92</lpage>
      <history>
        <date date-type="received">
          <day>03</day>
          <month>4</month>
          <year>2024</year>
        </date>
        <date date-type="accepted">
          <day>08</day>
          <month>6</month>
          <year>2024</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2024 by the Authors. Published by Institute of Central Computation and Knowledge. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).</copyright-statement>
        <copyright-year>2024</copyright-year>
        <copyright-holder>The Authors</copyright-holder>
        <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
        </license>
      </permissions>
      <self-uri xlink:href="https://www.icck.org/article/abs/cjif.2024.361895">This article is available from https://www.icck.org/article/abs/cjif.2024.361895</self-uri>
      <abstract>
        <p>The rapid advancement of Internet technology, driven by social media and e-commerce platforms, has facilitated the generation and sharing of multimodal data, leading to increased interest in efficient cross-modal retrieval systems. Cross-modal image-text retrieval, encompassing tasks such as image query text (IqT) retrieval and text query image (TqI) retrieval, plays a crucial role in semantic searches across modalities. This paper presents a comprehensive survey of cross-modal image-text retrieval, addressing the limitations of previous studies that focused on single perspectives such as subspace learning or deep learning models. We categorize existing models into single-tower, dual-tower, real-value representation, and binary representation models based on their structure and feature representation. A key focus is placed on the fusion of modalities to enhance retrieval performance across diverse data types. Additionally, we explore the impact of multimodal Large Language Models (MLLMs) on cross-modal fusion and retrieval. Our study also provides a detailed overview of common datasets, evaluation metrics, and performance comparisons of representative methods. Finally, we identify current challenges and propose future research directions to advance the field of cross-modal image-text retrieval.</p>
      </abstract>
      <kwd-group kwd-group-type="author" xml:lang="en">
        <kwd>multi-modal data</kwd>
        <kwd>cross-modal retrieval</kwd>
        <kwd>cross-modal alignment</kwd>
        <kwd>cross-modal fusion</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="S1">
      <label>1.</label>
      <title>Introduction</title>
      <p id="S1.p1">The advent of Internet technology, driven by social media and e-commerce platforms, offers a convenient way to generate and share multimodal data. Efficient and accurate retrieval of relevant information from vast multimodal data has garnered increased interest from researchers due to its extensive real-world applications. Cross-modal image-text retrieval enables semantic search of instances in one modality (e.g., image) based on queries from another modality (e.g., text). Cross-modal image-text retrieval typically includes two main tasks: image query text (IqT) retrieval and text query image (TqI) retrieval. The formal definition is as follows:</p>
      <p id="S1.p2">The multimodal training set, denoted as <inline-formula><mml:math alttext="O={o_{i}}{i=1}^{n}" display="inline"><mml:mrow><mml:mi>O</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>⁢</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>, consists of n instances. Each instance <inline-formula><mml:math alttext="o_{i}=(v_{i},t_{i},y_{i})" display="inline"><mml:mrow><mml:msub><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> comprises an original image sample <inline-formula><mml:math alttext="v_{i}" display="inline"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, a text sample <inline-formula><mml:math alttext="t_{i}" display="inline"><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, and a label annotation vector <inline-formula><mml:math alttext="y_{i}" display="inline"><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> = [<inline-formula><mml:math alttext="y{i1},\ldots,y_{iC}" display="inline"><mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>⁢</mml:mo><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="normal">…</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:math></inline-formula>], where <inline-formula><mml:math alttext="C" display="inline"><mml:mi>C</mml:mi></mml:math></inline-formula> is the number of classes. Each annotation <inline-formula><mml:math alttext="y_{iz}" display="inline"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> equals <inline-formula><mml:math alttext="1" display="inline"><mml:mn>1</mml:mn></mml:math></inline-formula> if the instance <inline-formula><mml:math alttext="o_{i}" display="inline"><mml:msub><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> belongs to the <inline-formula><mml:math alttext="z" display="inline"><mml:mi>z</mml:mi></mml:math></inline-formula>-th class, and <inline-formula><mml:math alttext="y_{iz}" display="inline"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> equals <inline-formula><mml:math alttext="0" display="inline"><mml:mn>0</mml:mn></mml:math></inline-formula> otherwise (<inline-formula><mml:math alttext="1\leq z\leq C" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo>≤</mml:mo><mml:mi>z</mml:mi><mml:mo>≤</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:math></inline-formula>). The testing set <inline-formula><mml:math alttext="Q={q_{i}}_{i=1}^{m}" display="inline"><mml:mrow><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:mmultiscripts><mml:mi>q</mml:mi><mml:mi>i</mml:mi><mml:mrow/><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:mmultiscripts></mml:mrow></mml:math></inline-formula> consists of m query instances, where <inline-formula><mml:math alttext="q_{i}=(v_{i},t_{i})" display="inline"><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. For each query sample <inline-formula><mml:math alttext="v_{i}" display="inline"><mml:msub><mml:mi>v</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> or <inline-formula><mml:math alttext="t_{i}" display="inline"><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, samples of the other modality that are semantically relevant should be returned.</p>
      <p id="S1.p3">Deep learning-based cross-modal image-text retrieval has achieved great success due to deep models that can effectively extract semantic information from visual and language data of different modalities.</p>
      <p id="S1.p4">Furthermore, with the success of large language models (LLMs) like ChatGPT, multimodal Large Language Models (MLLMs) have emerged, drawing more attention from researchers. Several previous efforts have surveyed cross-modal image-text retrieval. However, current surveys often classify cross-modal retrieval models from only a single perspective (e.g., subspace learning model or deep learning model), leading to insufficiently thorough results. Moreover, there is a lack of analysis on the cross-modal retrieval capabilities of the latest multimodal large language models. Inspired by this, we present a more comprehensive and up-to-date survey of cross-modal image-text retrieval in this paper.</p>
      <p>
        <fig id="F1">
          <label>Figure 1.</label>
          <caption>
            <p>Illustration of the classification of cross-modal retrieval model from two perspectives..</p>
          </caption>
          <graphic xlink:href="figures/fig_class.pdf"/>
        </fig>
      </p>
      <p id="S1.p5">The two most critical factors influencing cross-modal image-text retrieval systems are model structure and feature representation. We classify existing models based on these two key aspects to provide a more thorough analysis of cross-modal image-text retrieval. Figure 1 illustrates our classification of cross-modal retrieval models.</p>
      <p>
        <list list-type="bullet" id="S1.I1">
          <list-item id="S1.I1.i1">
            <p id="S1.I1.i1.p1">Single-tower models, also known as single-stream models, utilize a unified architecture to process both modalities simultaneously. These models integrate the modalities early, aiming to learn joint representations directly. They are beneficial for capturing complex interactions but may face scalability and fusion challenges.</p>
          </list-item>
          <list-item id="S1.I1.i2">
            <p id="S1.I1.i2.p1">Dual-tower models, also known as two-stream models, use separate architectures (towers) for each modality. These models process each modality separately, allowing for specialized processing and scalability. However, they must ensure compatibility between the independently learned representations for effective retrieval.</p>
          </list-item>
          <list-item id="S1.I1.i3">
            <p id="S1.I1.i3.p1">Real-value representation models involve encoding data into continuous vectors in a high-dimensional space. These vectors typically consist of floating-point numbers. These models are suitable for capturing detailed and complex relationships. However, they incur high computational and storage costs, making them less ideal for large-scale data applications.</p>
          </list-item>
          <list-item id="S1.I1.i4">
            <p id="S1.I1.i4.p1">Binary representation models encode data into compact, fixed-length binary codes (e.g., hash vectors of bits). These models offer efficient storage and fast retrieval, making them well-suited for large-scale databases. However, they may sacrifice some accuracy and require sophisticated projection models to learn effective binary codes.</p>
          </list-item>
        </list>
      </p>
      <p id="S1.p6">Based on above classification, we summarize the representative cross-modal image-text retrieval methods, as depicted in Table <xref rid="T1" ref-type="table">1</xref>. The structure of our study is outlined as follows: First, we summarize cross-modal image-text retrieval models based on the above taxonomy in Section 2. Section 3 introduces MLLMs and focuses on their capabilities in cross-modal retrieval tasks. Section 4 provides a detailed overview of common cross-modal image-text datasets, evaluation metrics, and accuracy comparisons among representative approaches. Section 5 summarizes the challenges identified in the preceding review and outlines meaningful research directions for the future.</p>
    </sec>
    <sec id="S2">
      <label>2.</label>
      <title>Deep Learning-Based</title>
      <p id="S2.p1">This section reviews recent research on cross-modal image-text retrieval using deep-learning neural networks. These models typically involve two main components: feature extraction from each modality and feature alignment or fusion through an alignment or fusion module. The primary goal is to learn a common semantic subspace that preserves semantic correlations both within and across modalities. We categorize these models based on their structure and feature representation into four categories: single-tower models, dual-tower models, real-valued representation models, and binary representation models.</p>
      <p>
        <fig id="F2">
          <label>Figure 2.</label>
          <caption>
            <p>Illustration of single-tower and dual-tower structure.</p>
          </caption>
          <!-- The element block 
 is currently not supported for the main body.
	-->
        </fig>
      </p>
      <p>
        <table-wrap id="T1">
          <label>Table 1</label>
          <caption>
            <p>The compilation of representative cross-modal image-text retrieval methods. In the following, Pro and con represent the advantages and disadvantages of this method respectively. </p>
          </caption>
          <table>
            <thead>
              <tr>
                <th style="border-top: 1px solid black;" colspan="2" align="center">Categories</th>
                <th style="border-top: 1px solid black;" align="center">Model</th>
                <th style="border-top: 1px solid black;" align="center">Technology</th>
                <th style="border-top: 1px solid black;" align="center">Pros&amp;Cons</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td style="border-top: 1px solid black;"/>
                <td style="border-top: 1px solid black;"/>
                <td style="border-top: 1px solid black;" align="left">ViLT [<xref rid="ref008" ref-type="bibr">8</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Convolution-free encoder</td>
                        </tr>
                        <tr>
                          <td align="left">Vision-language transformer</td>
                        </tr>
                        <tr>
                          <td align="left">Contrastive learning strategy</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Reduced computational complexity.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Dependence on large datasets.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td align="center">Single-tower models</td>
                <td style="border-top: 1px solid black;" align="left">Unicoder [<xref rid="ref009" ref-type="bibr">9</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <table>
                    <tr>
                      <td align="left">Universal encoder structure</td>
                    </tr>
                    <tr>
                      <td align="left">Masked object classifation</td>
                    </tr>
                    <tr>
                      <td align="left">Masked language modeling</td>
                    </tr>
                  </table>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro:&amp;nbsp;&amp;nbsp;Capable of handling multiple tasks across different modalities.</td>
                        </tr>
                        <tr>
                          <td align="left">Con:&amp;nbsp;&amp;nbsp;Requires significant computational resources for pre-training.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">VisualBERT [<xref rid="ref010" ref-type="bibr">10</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">OD-based region features</td>
                        </tr>
                        <tr>
                          <td align="left">Masked language modeling task</td>
                        </tr>
                        <tr>
                          <td align="left">Sentence-image prediction task</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Achieves high performance with a straightforward and adaptable framework.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Requires extensive pre-training on image-caption datasets like COCO for optimal performance.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td style="border-top: 1px solid black;"/>
                <td style="border-top: 1px solid black;" align="left">ViLBERT [<xref rid="ref011" ref-type="bibr">11</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Two-stream architecture</td>
                        </tr>
                        <tr>
                          <td align="left">Co-attentional transformer layers</td>
                        </tr>
                        <tr>
                          <td align="left">Masked multi-modal learning</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Adaptable to a variety of tasks with minor modifications.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Requires substantial resources for pre-training and fine-tuning.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td align="center">Dual-tower models</td>
                <td style="border-top: 1px solid black;" align="left">CLIP [<xref rid="ref012" ref-type="bibr">12</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <table>
                    <tr>
                      <td align="left">Contrastive language-image pre-training</td>
                    </tr>
                    <tr>
                      <td align="left">Zero-shot transfer learning</td>
                    </tr>
                  </table>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Capable of zero-shot transfer to multiple tasks without additional training.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Demonstrates competitive performance across a wide range of vision-language benchmarks.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Trains on 400 million (image, text) pairs collected from the internet.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">ALIGN [<xref rid="ref014" ref-type="bibr">14</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Dual-encoder architecture</td>
                        </tr>
                        <tr>
                          <td align="left">Normalized softmax loss</td>
                        </tr>
                        <tr>
                          <td align="left">Noisy data filtering</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Effective zero-shot classification without fine-tuning.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance might vary significantly across different datasets.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Requires extensive preprocessing to handle noisy data.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td style="border-top: 1px solid black;"/>
                <td style="border-top: 1px solid black;" align="left">ACMR [<xref rid="ref015" ref-type="bibr">15</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Adversarial learning framework</td>
                        </tr>
                        <tr>
                          <td align="left">Triplet ranking loss</td>
                        </tr>
                        <tr>
                          <td align="left">Gradient reversal layer</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Effective Subspace Representation could preserves semantic structure across modalities .</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance depends on careful tuning of hyperparameters .</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td align="center">Deep learning-based models</td>
                <td/>
                <td style="border-top: 1px solid black;" align="left">DSCMR [<xref rid="ref016" ref-type="bibr">16</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Weight sharing strategy</td>
                        </tr>
                        <tr>
                          <td align="left">Modality invariance loss</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Learns discriminative and modality-invariant features.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance may vary across different data modalities.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td align="center">Real-value representation models</td>
                <td style="border-top: 1px solid black;" align="left">IEFT [<xref rid="ref017" ref-type="bibr">17</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">Channelwise feature enhancement</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Mitigates feature misalignment by simultaneous image and text embedding.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Focuses on intermodal differences, not affinities.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">COTS [<xref rid="ref018" ref-type="bibr">18</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Two-stream pre-training</td>
                        </tr>
                        <tr>
                          <td align="left">Momentum contrastive learning</td>
                        </tr>
                        <tr>
                          <td align="left">Masked vision-language modeling</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Highest performance among all two-stream methods.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: 10,800X faster in inference compared to the latest single-stream methods.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance might be affected by the quality of pre-training.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">TEAM [<xref rid="ref019" ref-type="bibr">19</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Token embeddings interaction</td>
                        </tr>
                        <tr>
                          <td align="left">Token embeddings alignment block</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Explicit alignment enhances fine-grained similarity measurement.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Potential increase in computational complexity due to explicit alignment.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">CLIP4CMR [<xref rid="ref034" ref-type="bibr">34</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Modality-specific MLP</td>
                        </tr>
                        <tr>
                          <td align="left">Prototype contrastive loss</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Using modality imbalance robustness: maintains performance in imbalanced settings.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro:The compact representation makes the method effective even in low-dimensional spaces.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Involves intricate model and loss function design and tested on a few benchmark datasets only.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td style="border-top: 1px solid black;"/>
                <td style="border-top: 1px solid black;" align="left">DCMH [<xref rid="ref020" ref-type="bibr">20</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Hash-code learning</td>
                        </tr>
                        <tr>
                          <td align="left">Deep neural networks for end-to-end learning</td>
                        </tr>
                        <tr>
                          <td align="left">Cross-modal hashing for multimedia retrieval</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: The first method of multimedia retrieval using hash code.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Low storage cost and fast query speed.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance may be affected by the quality of hand-crafted features.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">UDCMH [<xref rid="ref021" ref-type="bibr">21</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Self-taught learning</td>
                        </tr>
                        <tr>
                          <td align="left">Binary latent factor models</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Enables multimodal data search in a self-taught manner.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Preserves both the nearest and farthest neighbors of data.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance might be affected by the quality of unsupervised learning.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td align="center">Binary representation models</td>
                <td style="border-top: 1px solid black;" align="left">SSAH [<xref rid="ref022" ref-type="bibr">22</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Adversarial learning framework</td>
                        </tr>
                        <tr>
                          <td align="left">Self-supervised semantic network</td>
                        </tr>
                        <tr>
                          <td align="left">Two adversarial networks</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Applicable to diverse forms of multimodal data.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Transformation to a continuous space potentially improves the efficiency of the hashing.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Performance may be affected by the quality of self-supervised semantic information.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">Bi-CMR [<xref rid="ref023" ref-type="bibr">23</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Bidirectional reinforcement module</td>
                        </tr>
                        <tr>
                          <td align="left">Adjusted similarity matrix</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Realizing the assumption that "label annotations reliably reflect the instance relevance" conflicts with human perception.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Dynamically adjust the similarity matrix based on the label network.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Requires intricate bidirectional learning setup.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">DCHMT [<xref rid="ref024" ref-type="bibr">24</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Selecting mechanism</td>
                        </tr>
                        <tr>
                          <td align="left">Transformer-based encoder</td>
                        </tr>
                        <tr>
                          <td align="left">Differentiable hashing function</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Incorporates location information using transformers.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Uses gradient descent, bypassing NP-hard discrete optimization issues.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Require further validation on a wider range of datasets to assess generalizability.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td/>
                <td/>
                <td style="border-top: 1px solid black;" align="left">DSPH [<xref rid="ref036" ref-type="bibr">36</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Semantic-aware proxy loss</td>
                        </tr>
                        <tr>
                          <td align="left">Multi-Modal Irrelevant Loss</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Maintains fine-grained similarity ranking among samples.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Involves intricate proxy-based mechanisms.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td style="border-top: 1px solid black;" colspan="2"/>
                <td style="border-top: 1px solid black;" align="left">BLIP-2 [<xref rid="ref001" ref-type="bibr">1</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Q-former</td>
                        </tr>
                        <tr>
                          <td align="left">Two-stage fine tuning</td>
                        </tr>
                        <tr>
                          <td align="left">Retrieval model preselection</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Leverage frozen pre-trained image encoders and language models.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Utilize significantly fewer trainable parameters.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Lack of in-context learning capability due to pretraining on single image-text pairs.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td colspan="2"/>
                <td style="border-top: 1px solid black;" align="left">InternLM [<xref rid="ref002" ref-type="bibr">2</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Perceive sampler</td>
                        </tr>
                        <tr>
                          <td align="left">Two-stage fine tuning</td>
                        </tr>
                        <tr>
                          <td align="left">Retrieval model preselection</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Generate coherent articles by intelligently identifying areas for image integration.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Comprehension with rich multilingual knowledge by training on a multi-modal database.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Lack of established metrics for quantitatively assessing text-image composition.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td colspan="2"/>
                <td style="border-top: 1px solid black;" align="left">EIIRwQR [<xref rid="ref003" ref-type="bibr">3</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Image caption model</td>
                        </tr>
                        <tr>
                          <td align="left">Zero-shot</td>
                        </tr>
                        <tr>
                          <td align="left">Queries augmented by LLMs</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Refine queries based on user relevance feedback.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Incorporate VLM and LLM denoiser for refining query expansions.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Misleading feedback and query mismatch issues.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td colspan="2" align="center">Multimodal large language models</td>
                <td style="border-top: 1px solid black;" align="left">CIREVL [<xref rid="ref006" ref-type="bibr">6</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Fixed pattern generated by LLMs</td>
                        </tr>
                        <tr>
                          <td align="left">Zero-shot</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Without the need for task-specific training.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Offer modular language reasoning.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Require further refinement to address nuanced variations in target modifications.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td colspan="2"/>
                <td style="border-top: 1px solid black;" align="left">CbIR [<xref rid="ref005" ref-type="bibr">5</xref>]</td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Chat-based text query</td>
                        </tr>
                        <tr>
                          <td align="left">Single-stage fine tuning</td>
                        </tr>
                        <tr>
                          <td align="left">Query embeddings generated by LLMs</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: A user-friendly chat-based image retrieval system.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: Engage in conversation to clarify user search intent.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Require extensive training and tuning of dialogue generation modules.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
              <tr>
                <td style="border-bottom: 1px solid black;" colspan="2"/>
                <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="left">GRACE [<xref rid="ref004" ref-type="bibr">4</xref>]</td>
                <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Unique identifier token</td>
                        </tr>
                        <tr>
                          <td align="left">Two-stage fine tuning</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
                <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="left">
                  <p>
                    <table-wrap>
                      <table>
                        <tr>
                          <td align="left">Pro: Memorize and recall images within their parameters.</td>
                        </tr>
                        <tr>
                          <td align="left">Pro: More comprehensive and contextually relevant responses to user queries.</td>
                        </tr>
                        <tr>
                          <td align="left">Con: Face challenges in effectively encoding information within the parameters of MLLMs.</td>
                        </tr>
                      </table>
                    </table-wrap>
                  </p>
                </td>
              </tr>
            </tbody>
          </table>
        </table-wrap>
      </p>
      <sec id="S2.SS1">
        <label>2.1</label>
        <title>Single-tower models</title>
        <p id="S2.SS1.p1">Single-tower (single-stream) architecture models process image and text features through a shared encoder, like a transformer, as shown in Figure <xref ref-type="fig" rid="">2</xref> (a). These models usually combine the two input modalities early in the network and jointly process them through shared encoders. The main motivation behind single-tower models is their ability to directly learn joint representations of the two modalities, capturing complex interactions between them. By using shared layers to process both modalities together, these models aim to learn rich, fused representations that benefit cross-modal retrieval tasks.</p>
        <p id="S2.SS1.p2">In this review, we focus on the role of modality fusion in enhancing retrieval performance. Particularly, single-tower models exemplify early fusion by embedding image and text inputs into a unified semantic space through shared encoders. This strategy enables deeper interactions between modalities, yielding richer representations. Additionally, MLLM-based approaches, such as BLIP-2, implement hybrid fusion through modules like Q-Former, allowing semantic alignment across modalities at multiple stages.</p>
        <p id="S2.SS1.p3">The ViLT (Vision and Language Transformer) model [<xref rid="ref008" ref-type="bibr">8</xref>] presents an innovative method for multi-modal training, drawing inspiration from the Vision Transformer (ViT) mechanism. In contrast to earlier methods that needed an object detector for region-level feature extraction, ViLT directly splits images into patches, performs linear embedding, and uses these as transformer inputs. Text data is also embedded and merged with image embeddings for joint training, significantly enhancing learning and inference efficiency. ViLT employs three pre-training objectives: Image-Text Matching (ITM), Masked Language Modeling (MLM), and Word Patch Alignment (WPA). For fine-tuning in cross-modal retrieval, ViLT initializes the similarity score head from the pre-trained ITM head and fine-tunes it with cross-entropy loss to maximize positive pair scores. Experimental results indicate that ViLT drastically reduces per-instance processing time from 900 milliseconds to 15 milliseconds, showcasing its efficiency and innovation in multi-modal learning.</p>
        <p id="S2.SS1.p4">Traditional pre-trained models for computer vision (CV) and natural language processing (NLP) perform well independently but face challenges with cross-modal tasks involving lengthy natural language inputs and intricate visual elements. Unicoder-VL [<xref rid="ref009" ref-type="bibr">9</xref>] utilizes a multi-layer Transformer to learn joint representations of vision and language via cross-modal pre-training. It uses three tasks: MLM, Masked Object Classification (MOC), and Visual-linguistic Matching (VLM) The model processes linguistic and visual content simultaneously, effectively learning context-aware representations and predicting relationships between images and texts. Pre-training on large-scale image-caption pairs allows it to excel in downstream tasks such as image-text retrieval and visual commonsense reasoning. Unicoder-VL achieves state-of-the-art results in image-text retrieval on the MSCOCO and Flickr30K datasets, showcasing strong generalization abilities. However, its reliance on pre-training datasets might limit performance on tasks that require domain-specific knowledge.</p>
        <p id="S2.SS1.p5">A flexible model is needed to handle various vision-and-language tasks, capturing detailed semantics from both modalities without complex architectures. VisualBERT [<xref rid="ref010" ref-type="bibr">10</xref>] integrates BERT with pre-trained object detection systems, processing image features and text together using Transformer layers. It is pre-trained on the COCO dataset using visually-grounded language model objectives such as masked language modeling and sentence-image prediction The model's design enables it to implicitly align language elements and image regions through self-attention, capturing intricate associations without explicit supervision. VisualBERT's design emphasizes simplicity and flexibility in handling diverse tasks.</p>
        <p id="S2.SS1.p6">Single-stream methods represent a powerful approach for cross-modal retrieval, leveraging unified Transformer architectures to effectively bridge the gap between different modalities. While these models perform well on general datasets, fine-tuning them for specific domains may require additional data and computational adjustments.</p>
      </sec>
      <sec id="S2.SS2">
        <label>2.2</label>
        <title>Dual-tower models</title>
        <p id="S2.SS2.p1">Dual-stream cross-modal methods, aim to integrate and process information from multiple modalities, such as text, images, and audio. These methods are characterized by their ability to handle the heterogeneity and complexity inherent in multimodal data, thereby facilitating a richer and more comprehensive understanding and generation of content. The primary challenge addressed by dual-stream cross-modal methods is the effective alignment and fusion of disparate data types, which often possess different structures, noise levels, and contextual nuances. The dual stream cross-modal approach typically involves two parallel processing streams, each dedicated to handling a specific modality, as shown in Figure <xref ref-type="fig" rid="">2</xref> (b).</p>
        <p id="S2.SS2.p2">ViLBERT [<xref rid="ref011" ref-type="bibr">11</xref>] aims to tackle the challenge of jointly understanding and reasoning about vision and language, which is difficult due to the inherent differences and complexities of each modality. It employs a two-stream model in which one stream processes visual information and the other processes linguistic information. These streams interact via a co-attentional Transformer layer that enables each modality to attend to the other. The key innovation is the co-attentional Transformer layer, which facilitates the interaction between visual and linguistic representations, allowing the model to learn rich, joint representations of both modalities.</p>
        <p id="S2.SS2.p3">CLIP [<xref rid="ref012" ref-type="bibr">12</xref>] meets the need for models that can understand and connect images and text flexibly, particularly for zero-shot learning tasks where the model must generalize to new concepts without explicit training. CLIP employs separate encoders for images and text, training them with a contrastive loss to align image and text embeddings in a shared space. The model is trained on a vast dataset of images and their corresponding captions from the internet. The key innovation is using contrastive learning to align visual and textual representations, enabling the model to perform zero-shot learning by leveraging the rich, diverse data it was trained on. CLIP demonstrates impressive performance on various tasks without fine-tuning, including image classification, image-text retrieval, object detection, and generating text descriptions for images.</p>
        <p id="S2.SS2.p4">In [<xref rid="ref014" ref-type="bibr">14</xref>], the authors introduce ALIGN (A Large-scale ImaGe and Noisy-text embedding), which utilizes a massive dataset of over one billion image-alt text pairs collected from the web with minimal filtering. The core of ALIGN is a straightforward dual-encoder architecture that employs contrastive learning to align visual and language representations in a shared embedding space. The ALIGN model uses a dual-encoder architecture with separate encoders for images and text. The encoders are trained with a contrastive loss to align the embeddings of matching image-text pairs. During training, the model applies simple frequency-based filtering on the dataset. The contrastive loss function helps in bringing together the embeddings of matched pairs and separating those of non-matched pairs. ALIGN achieves <inline-formula><mml:math alttext="76.4\%" display="inline"><mml:mrow><mml:mn>76.4</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math></inline-formula> top-1 accuracy on ImageNet without using any of its training samples and sets new state-of-the-art results on Flickr30K and MSCOCO benchmarks. In addition to the basic dual-encoder designs, some recent studies further enhance retrieval quality by promoting representation diversity. For instance, Kim et al. [<xref rid="ref013" ref-type="bibr">13</xref>] proposed a method that integrates a set of diverse embeddings to enrich the semantic space, improving the robustness of cross-modal retrieval across varying query intents and data distributions.</p>
        <p id="S2.SS2.p5">Dual-stream methods offer a robust framework for cross-modal retrieval by utilizing specialized pathways for different modalities and aligning their outputs in a shared space. By effectively aligning embeddings and using tailored processing, these models achieve strong performance in retrieving relevant content across heterogeneous data types, showcasing their value in multimodal applications.</p>
      </sec>
      <sec id="S2.SS3">
        <label>2.3</label>
        <title>Real-value representation models</title>
        <p id="S2.SS3.p1">Non-hashing methods based on real-valued representations effectively reduce the semantic gap between different modalities by learning dense feature representations, thereby enhancing retrieval precision. By employing deep learning methods to model features of various modalities and extract deep semantic features, these methods effectively address the issue of feature heterogeneity in cross-modal data. They also emphasize semantic correspondence between modalities, narrowing the semantic gap to improve the accuracy of cross-modal data matching, thereby increasing retrieval precision.</p>
        <p id="S2.SS3.p2">ACMR [<xref rid="ref015" ref-type="bibr">15</xref>] tackles the challenge of aligning visual and textual data for cross-modal retrieval tasks, where traditional methods often fail to bridge the semantic gap between different modalities. The proposed solution involves employing adversarial training to learn robust cross-modal representations. Specifically, ACMR utilizes a dual-stream architecture where each modality is processed separately, with an adversarial loss to align the embeddings in a shared space. The key innovation of ACMR is the integration of adversarial learning, which encourages the model to produce modality-invariant features. This approach ensures that visual and textual representations are more closely aligned, thereby improving retrieval accuracy. ACMR significantly enhances the performance of cross-modal retrieval tasks, demonstrating improved alignment between visual and textual data and higher retrieval accuracy compared to non-adversarial methods. However, adversarial training can be complex and computationally intensive, and it may lead to potential instability during training.</p>
        <p id="S2.SS3.p3">DSCMR [<xref rid="ref016" ref-type="bibr">16</xref>] addresses the challenge of learning effective representations for cross-modal retrieval tasks, where existing methods often struggle to capture the complex relationships between different modalities. The proposed solution employs a deep supervised approach that utilizes labeled data to learn discriminative features for each modality. DSCMR uses a dual-stream network with deep neural networks for both visual and textual data, supervised by a cross-modal ranking loss. The innovation in DSCMR lies in its application of deep supervision and a cross-modal ranking loss, ensuring that the learned representations are both discriminative and aligned across modalities. DSCMR achieves state-of-the-art performance in cross-modal retrieval tasks, showcasing the effectiveness of deep supervision and ranking-based training objectives in improving retrieval accuracy. However, DSCMR requires large amounts of labeled data and is potentially prone to overfitting to specific datasets.</p>
        <p id="S2.SS3.p4">IEFT [<xref rid="ref017" ref-type="bibr">17</xref>] tackles the challenge of enhancing feature interactions for cross-modal retrieval, where traditional models often fail to fully capture the intricate relationships between visual and textual data. The proposed solution, Interacting-Enhancing Feature Transformer (IEFT), uses a Transformer-based architecture to enhance feature interactions between modalities. IEFT processes visual and textual features in separate streams and employs attention mechanisms to integrate them. The key innovation of IEFT is its use of Transformer-based attention mechanisms to enhance interactions between visual and textual features, allowing the model to learn richer and more nuanced representations. IEFT demonstrates superior performance on cross-modal retrieval benchmarks, benefiting from enhanced feature interactions and the powerful representation capabilities of Transformers.</p>
        <p id="S2.SS3.p5">COTS [<xref rid="ref018" ref-type="bibr">18</xref>] addresses the difficulty of effectively combining visual and textual information for cross-modal retrieval, where existing methods may not fully leverage the potential of collaborative learning between modalities. The solution involves a Collaborative Two-Stream (COTS) architecture, where two streams process visual and textual data independently but collaborate through shared intermediate representations and alignment losses. The innovation in COTS lies in its collaborative learning mechanism, which ensures that the two streams not only process their respective modalities effectively but also learn from each other through shared representations. While collaborative learning enhances feature alignment and robust performance across various tasks, it increases complexity due to collaboration mechanisms and potential synchronization issues between streams.</p>
        <p id="S2.SS3.p6">TEAM [<xref rid="ref019" ref-type="bibr">19</xref>] addresses the issue of aligning token embeddings from different modalities for cross-modal retrieval, where conventional methods may not fully capture the semantic relationships between visual and textual data. The proposed solution, Token Embeddings AlignMent (TEAM), employs alignment strategies to ensure that token embeddings from different modalities are closely related in a shared space. TEAM utilizes dual-stream networks with alignment losses to achieve this goal. TEAM's key innovation is its specific focus on token-level alignment, ensuring that individual tokens from text and corresponding visual elements are accurately aligned in the embedding space. TEAM significantly improves cross-modal retrieval performance by ensuring precise alignment of token embeddings, leading to better semantic understanding and retrieval accuracy. However, it incurs potentially high computational costs for fine-grained alignment and complexity in managing token-level interactions.</p>
      </sec>
      <sec id="S2.SS4">
        <label>2.4</label>
        <title>Binary representation models</title>
        <p id="S2.SS4.p1">Real-valued cross-modal image-text retrieval methods based on deep learning use feature vectors directly obtained from feature extraction for modeling and retrieval. However, with the explosive growth of multimedia data, such as short videos on TikTok or image-text information on Weibo, multimodal data often reaches hundreds of thousands, millions, or even billions of instances. This requires that the retrieval process for multimodal data ensures both precision and efficiency. Among various retrieval methods, hashing methods have gained widespread attention due to their low storage cost, efficiency, and fast retrieval speed, making them more suitable for large-scale datasets.</p>
        <p id="S2.SS4.p2">Hashing methods map feature vectors from the original feature space to binary codes (Hamming space) to save storage space and increase retrieval speed while maintaining the similarity between data points during the mapping process. Subsequently, the Hamming distance between the hash codes of the query data and those in the database is calculated for similarity ranking, ultimately yielding the retrieval results. Calculating the Hamming distance is faster than other distance metrics such as Euclidean and cosine distances. Additionally, storing data as binary codes rather than real-valued ones reduces the storage requirements for retrieval tasks.</p>
        <p id="S2.SS4.p3">Learning hash functions mainly involves dimensionality reduction and quantization. Dimensionality reduction maps the information from the original space to a lower-dimensional space, such as mapping an image's original pixel space information to a lower-dimensional (e.g., tens of dimensions) representation. Quantization involves linear or nonlinear transformations of the original features and binary segmentation of the feature space to produce hash codes. As mentioned in the problem definition section of cross-modal retrieval, there is a semantic gap between different forms (modalities) of data representation. Minimizing this semantic gap remains a primary challenge for cross-modal retrieval hashing methods. Generally, there are two approaches to address this: one is learning a unified hash code, and the other is using supervised information, such as labels, to collaboratively represent and minimize the distance between hash codes of semantically relevant instances.</p>
        <p id="S2.SS4.p4">DCMH [<xref rid="ref020" ref-type="bibr">20</xref>] addresses the challenge of efficiently retrieving relevant data across different modalities (e.g., text and images) by using hashing techniques to map high-dimensional data into compact binary codes. The proposed solution utilizes a deep learning framework to generate hash codes for each modality through learning shared representations. These representations are optimized to maintain semantic similarity across different modalities, ensuring related items have similar hash codes. This is the first use of deep hashing neural networks to learn these representations, allowing the model to capture complex relationships between modalities and generate more accurate hash codes.</p>
        <p id="S2.SS4.p5">UDCMH [<xref rid="ref021" ref-type="bibr">21</xref>] addresses the challenge of cross-modal retrieval without labeled data, which is significant since traditional supervised methods rely heavily on labeled training examples. The key innovation is the unsupervised learning approach, which eliminates the need for labeled data and still achieves effective cross-modal retrieval by learning from the data's inherent structure. This approach demonstrates strong performance in cross-modal retrieval tasks, especially in scenarios where labeled data is scarce or unavailable. However, its performance may not match supervised methods on well-labeled datasets and may be sensitive to the quality of the data structure.</p>
        <p id="S2.SS4.p6">SSAH [<xref rid="ref022" ref-type="bibr">22</xref>] tackles the challenge of generating robust hash codes for cross-modal retrieval by leveraging the advantages of both self-supervised learning and adversarial training. Self-supervised learning generates initial hash codes, while adversarial training refines these codes to ensure they are modality-invariant and semantically meaningful. This combination enables the model to learn effective representations without the need for extensive labeled data. SSAH achieves enhanced retrieval performance and robustness, demonstrating the effectiveness of its novel training strategy.</p>
        <p id="S2.SS4.p7">Bi-CMR [<xref rid="ref023" ref-type="bibr">23</xref>] is the first to recognize that the assumption "label annotations reliably reflect instance relevance" conflicts with human perception. It proposes a new evaluation method to guide the learning of instance hash codes consistent with human perception. Bi-CMR introduces a novel bidirectional reinforcement-guided hashing method that reinforces hash code learning through mutual promotion. The key innovation is using reinforcement learning to dynamically adjust and improve the hashing process, ensuring the generated hash codes are effective for cross-modal retrieval. Bi-CMR demonstrates superior performance in cross-modal retrieval tasks, with hash codes that are well-aligned and optimized for retrieval accuracy.</p>
        <p id="S2.SS4.p8">DCHMT [<xref rid="ref024" ref-type="bibr">24</xref>] tackles the challenge of effectively integrating and hashing data from multiple modalities using a unified framework. It constructs a multi-modal transformer to capture detailed cross-modal semantic information and introduces a micro-hashing module to map modal representations into hash codes. UCMFH tackles the need for effective cross-modal retrieval without labeled data, focusing on learning robust hash codes through unsupervised methods. The proposed solution uses unsupervised contrastive learning to generate hash codes. By leveraging contrastive learning, the model maximizes the similarity between related items across modalities while minimizing the similarity between unrelated items. UCMFH demonstrates strong performance in unsupervised cross-modal retrieval tasks, achieving high accuracy and robustness by effectively learning from the inherent structure of data.</p>
        <p id="S2.SS4.p9">Overall, real-valued representations are suitable for tasks that require high precision, while hashing representations are ideal for applications that need rapid, large-scale retrieval.</p>
      </sec>
    </sec>
    <sec id="S3">
      <label>3.</label>
      <title>Multimodal Large Language Models</title>
      <p>
        <fig id="F3">
          <label>Figure 3.</label>
          <caption>
            <p>Example of BLIP-2's pipeline for text-to-image retrieval.</p>
          </caption>
          <graphic xlink:href="figures/llm_pipeline.drawio.pdf"/>
        </fig>
      </p>
      <p id="S3.p1">In the past two years, large language models (LLMs) have made significant strides, demonstrating the ability to perform many NLP downstream tasks in a zero-shot setting. However, their inference capabilities with data from other modalities have been limited. To address this gap, MLLMs have been proposed. These models are capable of not only generating and understanding complex text but also processing image information, allowing a single MLLM to handle multiple multimodal downstream tasks simultaneously. Utilizing MLLMs for image-text retrieval has emerged as a powerful and widely applied technique. By integrating natural language processing and computer vision technologies, MLLMs can efficiently extract information from vast datasets, achieving precise image-text matching and search.</p>
      <p id="S3.p2">Before introducing this section, we first differentiate between VLP models and MLLMs. We define VLP as a multimodal pre-training model tailored for specific tasks involving vision and language. In contrast, MLLMs are pre-trained models capable of addressing multiple complex reasoning tasks across different modalities. The key distinction lies in their ability to handle multiple downstream tasks. Therefore, VLP models are not classified within this section. Our categorization is based on the core components and capabilities of the models.</p>
      <p id="S3.p3">The process of using MLLMs for image-text retrieval generally includes the following steps:</p>
      <p>
        <list list-type="bullet" id="S3.I1">
          <list-item id="S3.I1.i1">
            <p id="S3.I1.i1.p1">Using an MLLM trained on large-scale data and fine-tuning it with an image-text retrieval dataset.</p>
          </list-item>
          <list-item id="S3.I1.i2">
            <p id="S3.I1.i2.p1">Employing specific prompts to complete the image-text retrieval task.</p>
          </list-item>
          <list-item id="S3.I1.i3">
            <p id="S3.I1.i3.p1">Involving smaller image-text retrieval models to assist the MLLM in the task.</p>
          </list-item>
        </list>
      </p>
      <p id="S3.p5">BLIP-2 [<xref rid="ref001" ref-type="bibr">1</xref>] employs a bidirectional retrieval approach by leveraging pre-trained image models and large language models. The text-to-image retrieval pipeline used by BLIP-2 is illustrated in Figure <xref ref-type="fig" rid="F3">3</xref>. This pipeline is enhanced with Q-Former to bridge the gap between modalities, using a two-stage training process: initially training the image model, followed by the text model. The retrieval process begins with a common retrieval model selecting 128 candidate images based on image-text similarity. These candidate images, along with the query text, are then input into the model, which selects the most relevant image as the retrieval result. Essentially, this approach utilizes generative models to perform the retrieval task, ensuring accurate and efficient matching of images based on textual input.</p>
      <p id="S3.p6">InternLM [<xref rid="ref002" ref-type="bibr">2</xref>] focuses solely on image retrieval. It involves fine-tuning both the Perceive Sampler and the MLLM, followed by fine-tuning Perceive Sampler with LoRA. Initially, CLIP is used to select the top-k candidate images, from which the MLLM selects one image as the final retrieved result. This approach, like the previous one, is fundamentally generative.</p>
      <p id="S3.p7">EIIRwQR [<xref rid="ref003" ref-type="bibr">3</xref>] also focuses on image retrieval, utilizing a VLM to generate a set of candidate images. Each candidate image is described with a caption generated by an image description model. The MLLM takes the original query and these generated captions as input, modifying each query. The VLM then uses the modified queries for image retrieval. This process is iterated multiple times to refine the final retrieval result. The MLLM is employed only during the inference stage without any fine-tuning, categorizing this approach as using MLLMs for data augmentation.</p>
      <p id="S3.p8">CIREVL [<xref rid="ref006" ref-type="bibr">6</xref>] focuses on image retrieval without any training process, addressing the high labor costs associated with annotated data. It employs an MLLM to transform the text into a fixed descriptive sentence format, which is then used by a traditional model for image retrieval. he MLLM is utilized only during the inference stage and is not fine-tuned, effectively categorizing this approach as using MLLMs for data augmentation.</p>
      <p id="S3.p9">In CbIR [<xref rid="ref005" ref-type="bibr">5</xref>], dialogues are used as input. The accumulated dialogue information, processed with a contrastive loss function, fine-tunes the large model to obtain <inline-formula><mml:math alttext="256" display="inline"><mml:mn>256</mml:mn></mml:math></inline-formula>-dimensional retrieval vectors. These vectors are then compared with <inline-formula><mml:math alttext="256" display="inline"><mml:mn>256</mml:mn></mml:math></inline-formula>-dimensional image vectors using cosine similarity to retrieve the images.</p>
      <p id="S3.p10">GRACE [<xref rid="ref004" ref-type="bibr">4</xref>] involves assigning each image a unique image token and training the instruction to predict the identifier for the &lt;image token&gt;. During inference, the model predicts the image identifier corresponding to the given query.</p>
      <p id="S3.p11">In fact, aside from the methods mentioned above, most MLLMs can potentially be employed for image-text retrieval tasks, although many of these models have not been specifically tested for this purpose. Additionally, existing MLLM methods tested for image-text retrieval typically involve LLMs trained solely on text data. However, there are models like Google's Gemini [<xref rid="ref007" ref-type="bibr">7</xref>], which are inherently multimodal. Instead of a two-stage process where the model is first trained on text and then on images, these models are pre-trained on multimodal data from the beginning. Such inherently multimodal models exhibit greater adaptability and robustness with multimodal data. Future exploration of these native multimodal LLMs may further enhance the performance of image-text retrieval.</p>
      <p id="S3.p12">In summary, the existing works highlight various approaches to utilizing MLLMs for image-text retrieval. The methods range from leveraging pre-trained models and fine-tuning specific components to employing generative techniques and using MLLMs for data augmentation without additional training. These diverse strategies underscore the flexibility and potential of MLLMs in enhancing image-text retrieval tasks, paving the way for more accurate and efficient retrieval systems in the future.</p>
    </sec>
    <sec id="S4">
      <label>4.</label>
      <title>Datasets and Evaluation</title>
      <sec id="S4.SS1">
        <label>4.1</label>
        <title>Datasets</title>
        <p id="S4.SS1.p1">The researchers have proposed various datasets for cross-modal image-text retrieval, including Wikipedia [<xref rid="ref025" ref-type="bibr">25</xref>], NUS-WIDE [<xref rid="ref026" ref-type="bibr">26</xref>], TC-12 [<xref rid="ref027" ref-type="bibr">27</xref>], Flickr [<xref rid="ref028" ref-type="bibr">28</xref>], Pascal Sentence [<xref rid="ref029" ref-type="bibr">29</xref>], etc. The most frequently used datasets are summarized as MSCOCO [<xref rid="ref030" ref-type="bibr">30</xref>] and Flickr30K [<xref rid="ref028" ref-type="bibr">28</xref>]. MS COCO dataset contains <inline-formula><mml:math alttext="123,287" display="inline"><mml:mrow><mml:mn>123</mml:mn><mml:mo>,</mml:mo><mml:mn>287</mml:mn></mml:mrow></mml:math></inline-formula> images from the Microsoft Common Objects in Context (COCO) dataset, each paired with five human-generated textual captions. After removing rare words, the average caption length is <inline-formula><mml:math alttext="8.7" display="inline"><mml:mn>8.7</mml:mn></mml:math></inline-formula> words. The dataset is divided into <inline-formula><mml:math alttext="82,783" display="inline"><mml:mrow><mml:mn>82</mml:mn><mml:mo>,</mml:mo><mml:mn>783</mml:mn></mml:mrow></mml:math></inline-formula> training image-text pairs, <inline-formula><mml:math alttext="5,000" display="inline"><mml:mrow><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> validation pairs, and <inline-formula><mml:math alttext="5,000" display="inline"><mml:mrow><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> test pairs. Model evaluations are conducted on five folds of <inline-formula><mml:math alttext="1,000" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> test pairs and the entire set of <inline-formula><mml:math alttext="5,000" display="inline"><mml:mrow><mml:mn>5</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> test pairs. Flickr30K0 comprising <inline-formula><mml:math alttext="31,000" display="inline"><mml:mrow><mml:mn>31</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> images sourced from the Flickr website, each image in this dataset is annotated with five textual descriptions. The dataset is split into three sections: <inline-formula><mml:math alttext="1,000" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> image-text pairs for validation, <inline-formula><mml:math alttext="1,000" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>000</mml:mn></mml:mrow></mml:math></inline-formula> pairs for testing, and the remaining for training.</p>
        <p>
          <table-wrap id="T2">
            <label>Table 2</label>
            <caption>
              <p>The MAP@ALL results of real-value cross-modal image-text retrieval methods. The experiment results are from  [<xref rid="ref038" ref-type="bibr">38</xref>]and [<xref rid="ref037" ref-type="bibr">37</xref>].</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" align="center">Task</th>
                  <th style="border-top: 1px solid black;" align="center">Methods</th>
                  <th style="border-top: 1px solid black;" align="center">Source</th>
                  <th style="border-top: 1px solid black;" align="center">Wikipedia</th>
                  <th style="border-top: 1px solid black;" align="center">Pascal-Sentence</th>
                  <th style="border-top: 1px solid black;" align="center">NUS-WIDE</th>
                  <th style="border-top: 1px solid black;" align="center">Xmedia</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td style="border-top: 1px solid black;"/>
                  <td style="border-top: 1px solid black;" align="center">ACMR [<xref rid="ref015" ref-type="bibr">15</xref>]</td>
                  <td style="border-top: 1px solid black;" align="center">ACM MM17</td>
                  <td style="border-top: 1px solid black;" align="center">0.468</td>
                  <td style="border-top: 1px solid black;" align="center">0.538</td>
                  <td style="border-top: 1px solid black;" align="center">0.519</td>
                  <td style="border-top: 1px solid black;" align="center">0.536</td>
                </tr>
                <tr>
                  <td rowspan="3" align="center">IqT</td>
                  <td align="center">CM-GANS [<xref rid="ref032" ref-type="bibr">32</xref>]</td>
                  <td align="center">TMM18</td>
                  <td align="center">0.521</td>
                  <td align="center">0.603</td>
                  <td align="center">0.536</td>
                  <td align="center">0.567</td>
                </tr>
                <tr>
                  <td align="center">DSCMR [<xref rid="ref016" ref-type="bibr">16</xref>]</td>
                  <td align="center">CVPR19</td>
                  <td align="center">0.521</td>
                  <td align="center">0.674</td>
                  <td align="center">0.611</td>
                  <td align="center">0.697</td>
                </tr>
                <tr>
                  <td align="center">AGCN [<xref rid="ref033" ref-type="bibr">33</xref>]</td>
                  <td align="center">IEEE CSVT22</td>
                  <td align="center">0.620</td>
                  <td align="center">0.683</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                </tr>
                <tr>
                  <td/>
                  <td align="center">CLIP4CMR [<xref rid="ref034" ref-type="bibr">34</xref>]</td>
                  <td align="center">ARXIV22</td>
                  <td align="center">0.592</td>
                  <td align="center">0.698</td>
                  <td align="center">0.609</td>
                  <td align="center">0.746</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;"/>
                  <td style="border-top: 1px solid black;" align="center">ACMR [<xref rid="ref015" ref-type="bibr">15</xref>]</td>
                  <td style="border-top: 1px solid black;" align="center">ACM MM17</td>
                  <td style="border-top: 1px solid black;" align="center">0.412</td>
                  <td style="border-top: 1px solid black;" align="center">0.544</td>
                  <td style="border-top: 1px solid black;" align="center">0.542</td>
                  <td style="border-top: 1px solid black;" align="center">0.519</td>
                </tr>
                <tr>
                  <td rowspan="3" align="center">IqT</td>
                  <td align="center">CM-GANS [<xref rid="ref032" ref-type="bibr">32</xref>]</td>
                  <td align="center">TMM18</td>
                  <td align="center">0.466</td>
                  <td align="center">0.604</td>
                  <td align="center">0.551</td>
                  <td align="center">0.551</td>
                </tr>
                <tr>
                  <td align="center">DSCMR [<xref rid="ref016" ref-type="bibr">16</xref>]</td>
                  <td align="center">CVPR19</td>
                  <td align="center">0.478</td>
                  <td align="center">0.682</td>
                  <td align="center">0.615</td>
                  <td align="center">0.693</td>
                </tr>
                <tr>
                  <td align="center">AGCN [<xref rid="ref033" ref-type="bibr">33</xref>]</td>
                  <td align="center">IEEE CSVT22</td>
                  <td align="center">0.532</td>
                  <td align="center">0.683</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                </tr>
                <tr>
                  <td style="border-bottom: 1px solid black;"/>
                  <td style="border-bottom: 1px solid black;" align="center">CLIP4CMR [<xref rid="ref034" ref-type="bibr">34</xref>]</td>
                  <td style="border-bottom: 1px solid black;" align="center">ARXIV22</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.574</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.692</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.621</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.758</td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p>
          <table-wrap id="T3">
            <label>Table 3</label>
            <caption>
              <p>The MAP@ALL results of binary cross-modal image-text retrieval methods. The experiment results are from [<xref rid="ref038" ref-type="bibr">38</xref>]and [<xref rid="ref037" ref-type="bibr">37</xref>].</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">Task</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">Methods</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">Source</th>
                  <th style="border-top: 1px solid black;" colspan="3" align="center">MirFlickr</th>
                  <th style="border-top: 1px solid black;" colspan="3" align="center">NUS-WIDE</th>
                  <th style="border-top: 1px solid black;" colspan="3" align="center">MS COCO</th>
                </tr>
                <tr>
                  <th align="center">16bits</th>
                  <th align="center">32bits</th>
                  <th align="center">64bits</th>
                  <th align="center">16bits</th>
                  <th align="center">32bits</th>
                  <th align="center">64bits</th>
                  <th align="center">16bits</th>
                  <th align="center">32bits</th>
                  <th align="center">64bits</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <td style="border-top: 1px solid black;"/>
                  <td style="border-top: 1px solid black;" align="center">DCMH [<xref rid="ref020" ref-type="bibr">20</xref>]</td>
                  <td style="border-top: 1px solid black;" align="center">CVPR17</td>
                  <td style="border-top: 1px solid black;" align="center">0.724</td>
                  <td style="border-top: 1px solid black;" align="center">0.731</td>
                  <td style="border-top: 1px solid black;" align="center">0.731</td>
                  <td style="border-top: 1px solid black;" align="center">0.568</td>
                  <td style="border-top: 1px solid black;" align="center">0.561</td>
                  <td style="border-top: 1px solid black;" align="center">0.596</td>
                  <td style="border-top: 1px solid black;" align="center">0.505</td>
                  <td style="border-top: 1px solid black;" align="center">0.536</td>
                  <td style="border-top: 1px solid black;" align="center">0.557</td>
                </tr>
                <tr>
                  <td rowspan="3" align="center">IqT</td>
                  <td align="center">SSAH [<xref rid="ref022" ref-type="bibr">22</xref>]</td>
                  <td align="center">CVPR18</td>
                  <td align="center">0.903</td>
                  <td align="center">0.922</td>
                  <td align="center">0.925</td>
                  <td align="center">0.691</td>
                  <td align="center">0.727</td>
                  <td align="center">0.728</td>
                  <td align="center">0.632</td>
                  <td align="center">0.669</td>
                  <td align="center">0.668</td>
                </tr>
                <tr>
                  <td align="center">DCHUC [<xref rid="ref035" ref-type="bibr">35</xref>]</td>
                  <td align="center">TKDE20</td>
                  <td align="center">0.895</td>
                  <td align="center">0.916</td>
                  <td align="center">0.926</td>
                  <td align="center">0.707</td>
                  <td align="center">0.672</td>
                  <td align="center">0.738</td>
                  <td align="center">0.513</td>
                  <td align="center">0.550</td>
                  <td align="center">0.558</td>
                </tr>
                <tr>
                  <td align="center">HMAH [<xref rid="ref031" ref-type="bibr">31</xref>]</td>
                  <td align="center">TMM23</td>
                  <td align="center">0.960</td>
                  <td align="center">0.965</td>
                  <td align="center">0.969</td>
                  <td align="center">0.813</td>
                  <td align="center">0.825</td>
                  <td align="center">0.840</td>
                  <td align="center">0.691</td>
                  <td align="center">0.732</td>
                  <td align="center">0.763</td>
                </tr>
                <tr>
                  <td/>
                  <td align="center">DSPH [<xref rid="ref036" ref-type="bibr">36</xref>]</td>
                  <td align="center">TCSVT23</td>
                  <td align="center">0.925</td>
                  <td align="center">0.940</td>
                  <td align="center">0.945</td>
                  <td align="center">0.852</td>
                  <td align="center">0.905</td>
                  <td align="center">0.929</td>
                  <td align="center">0.793</td>
                  <td align="center">0.815</td>
                  <td align="center">0.833</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;"/>
                  <td style="border-top: 1px solid black;" align="center">DCMH [<xref rid="ref020" ref-type="bibr">20</xref>]</td>
                  <td style="border-top: 1px solid black;" align="center">CVPR17</td>
                  <td style="border-top: 1px solid black;" align="center">0.764</td>
                  <td style="border-top: 1px solid black;" align="center">0.749</td>
                  <td style="border-top: 1px solid black;" align="center">0.780</td>
                  <td style="border-top: 1px solid black;" align="center">0.558</td>
                  <td style="border-top: 1px solid black;" align="center">0.591</td>
                  <td style="border-top: 1px solid black;" align="center">0.616</td>
                  <td style="border-top: 1px solid black;" align="center">0.549</td>
                  <td style="border-top: 1px solid black;" align="center">0.572</td>
                  <td style="border-top: 1px solid black;" align="center">0.605</td>
                </tr>
                <tr>
                  <td rowspan="3" align="center">TqI</td>
                  <td align="center">SSAH [<xref rid="ref022" ref-type="bibr">22</xref>]</td>
                  <td align="center">CVPR18</td>
                  <td align="center">0.896</td>
                  <td align="center">0.906</td>
                  <td align="center">0.915</td>
                  <td align="center">0.658</td>
                  <td align="center">0.673</td>
                  <td align="center">0.666</td>
                  <td align="center">0.583</td>
                  <td align="center">0.556</td>
                  <td align="center">0.664</td>
                </tr>
                <tr>
                  <td align="center">DCHUC [<xref rid="ref035" ref-type="bibr">35</xref>]</td>
                  <td align="center">TKDE20</td>
                  <td align="center">0.764</td>
                  <td align="center">0.749</td>
                  <td align="center">0.780</td>
                  <td align="center">0.558</td>
                  <td align="center">0.591</td>
                  <td align="center">0.616</td>
                  <td align="center">0.549</td>
                  <td align="center">0.572</td>
                  <td align="center">0.605</td>
                </tr>
                <tr>
                  <td align="center">HMAH [<xref rid="ref031" ref-type="bibr">31</xref>]</td>
                  <td align="center">TMM23</td>
                  <td align="center">0.915</td>
                  <td align="center">0.925</td>
                  <td align="center">0.938</td>
                  <td align="center">0.783</td>
                  <td align="center">0.796</td>
                  <td align="center">0.814</td>
                  <td align="center">0.800</td>
                  <td align="center">0.869</td>
                  <td align="center">0.904</td>
                </tr>
                <tr>
                  <td style="border-bottom: 1px solid black;"/>
                  <td style="border-bottom: 1px solid black;" align="center">DSPH [<xref rid="ref036" ref-type="bibr">36</xref>]</td>
                  <td style="border-bottom: 1px solid black;" align="center">TCSVT23</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.897</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.904</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.911</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.859</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.920</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.935</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.792</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.800</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.819</td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
      </sec>
      <sec id="S4.SS2">
        <label>4.2</label>
        <title>Evaluation</title>
        <p id="S4.SS2.p1">We summarize the following evaluation metrics widely used to assess cross-modal retrieval tasks: Mean Average Precision@K (MAP@K): MAP calculates the average precision for each query and then averages these values over all queries. In the experimental validation of MLLMs, the R@n metric is commonly used, indicating the proportion of queries for which at least one correct result is retrieved within the top-n results.</p>
        <p>
          <table-wrap id="T4">
            <label>Table 4</label>
            <caption>
              <p>The R@n results of MLLM methods. The experiment results are from their papers.</p>
            </caption>
            <table>
              <tbody>
                <tr>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">Task</td>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">Methods</td>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">Source</td>
                  <td style="border-top: 1px solid black;" colspan="3" align="center">Flickr30K</td>
                  <td style="border-top: 1px solid black;" colspan="3" align="center">MS-COCO(5K)</td>
                </tr>
                <tr>
                  <td align="center">R@1</td>
                  <td align="center">R@5</td>
                  <td align="center">R@10</td>
                  <td align="center">R@1</td>
                  <td align="center">R@5</td>
                  <td align="center">R@10</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;" rowspan="4" align="center">IqT</td>
                  <td style="border-top: 1px solid black;" align="center">AGREE (FT only) [<xref rid="ref039" ref-type="bibr">39</xref>]</td>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">WSDM23</td>
                  <td style="border-top: 1px solid black;" align="center">0.916</td>
                  <td style="border-top: 1px solid black;" align="center">0.987</td>
                  <td style="border-top: 1px solid black;" align="center">0.992</td>
                  <td style="border-top: 1px solid black;" align="center">-</td>
                  <td style="border-top: 1px solid black;" align="center">-</td>
                  <td style="border-top: 1px solid black;" align="center">-</td>
                </tr>
                <tr>
                  <td align="center">AGREE [<xref rid="ref039" ref-type="bibr">39</xref>]</td>
                  <td align="center">0.921</td>
                  <td align="center">0.987</td>
                  <td align="center">0.992</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;" align="center">BLIP-2 ViT-L [<xref rid="ref001" ref-type="bibr">1</xref>]</td>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">ICML23</td>
                  <td align="center">0.969</td>
                  <td align="center">1.000</td>
                  <td align="center">1.000</td>
                  <td align="center">0.835</td>
                  <td align="center">0.960</td>
                  <td align="center">0.980</td>
                </tr>
                <tr>
                  <td align="center">BLIP-2 ViT-g [<xref rid="ref001" ref-type="bibr">1</xref>]</td>
                  <td align="center">0.976</td>
                  <td align="center">1.000</td>
                  <td align="center">1.000</td>
                  <td align="center">0.854</td>
                  <td align="center">0.970</td>
                  <td align="center">0.985</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" rowspan="5" align="center">TqI</td>
                  <td style="border-top: 1px solid black;" align="center">GRACE [<xref rid="ref004" ref-type="bibr">4</xref>]</td>
                  <td style="border-top: 1px solid black;" align="center">ARXIV24</td>
                  <td style="border-top: 1px solid black;" align="center">0.684</td>
                  <td style="border-top: 1px solid black;" align="center">0.889</td>
                  <td style="border-top: 1px solid black;" align="center">0.937</td>
                  <td style="border-top: 1px solid black;" align="center">0.415</td>
                  <td style="border-top: 1px solid black;" align="center">0.691</td>
                  <td style="border-top: 1px solid black;" align="center">0.791</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;" align="center">AGREE (FT only) [<xref rid="ref039" ref-type="bibr">39</xref>]</td>
                  <td style="border-top: 1px solid black;" rowspan="2" align="center">WSDM23</td>
                  <td align="center">0.781</td>
                  <td align="center">0.951</td>
                  <td align="center">0.978</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                </tr>
                <tr>
                  <td align="center">AGREE [<xref rid="ref039" ref-type="bibr">39</xref>]</td>
                  <td align="center">0.828</td>
                  <td align="center">0.959</td>
                  <td align="center">0.978</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                  <td align="center">-</td>
                </tr>
                <tr>
                  <td style="border-top: 1px solid black;" align="center">BLIP-2 ViT-L [<xref rid="ref001" ref-type="bibr">1</xref>]</td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" rowspan="2" align="center">ICML23</td>
                  <td align="center">0.886</td>
                  <td align="center">0.976</td>
                  <td align="center">0.989</td>
                  <td align="center">0.663</td>
                  <td align="center">0.865</td>
                  <td align="center">0.918</td>
                </tr>
                <tr>
                  <td style="border-bottom: 1px solid black;" align="center">BLIP-2 ViT-g [<xref rid="ref001" ref-type="bibr">1</xref>]</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.897</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.981</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.989</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.683</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.877</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.926</td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
      </sec>
      <sec id="S4.SS3">
        <label>4.3</label>
        <title>Result Analysis</title>
        <p id="S4.SS3.p1">In this section, we present the accuracy of several representative methods in cross-modal retrieval tasks. As shown in Tables <xref rid="T2" ref-type="table">2</xref>- <xref rid="T4" ref-type="table">4</xref>, we compare the accuracy of cross-modal retrieval methods using common measures for each task. Based on the presented performance, we can summarize the following observations:</p>
        <p>
          <list list-type="bullet" id="S4.I1">
            <list-item id="S4.I1.i1">
              <p id="S4.I1.i1.p1">As shown in Table <xref rid="T2" ref-type="table">2</xref>, in cross-modal real-valued retrieval, methods based on VLP (Vision-Language Pre-training) or transformer structures often achieve better accuracy. This improvement is due to the enhanced ability of encoders to extract semantic information, as demonstrated by the performance of CLIP4CMR.</p>
            </list-item>
            <list-item id="S4.I1.i2">
              <p id="S4.I1.i2.p1">As shown in Table <xref rid="T3" ref-type="table">3</xref>, cross-modal hashing retrieval methods exhibit progressive accuracy with different hash code lengths. Most methods show an increase in accuracy as the code length increases, indicating that longer codes can represent more semantic information, thereby improving retrieval accuracy. However, the accuracy improvement from 32-bit to 64-bit codes is often not as significant as the improvement from 16-bit to 32-bit codes. This may be because once an optimal hash code length is achieved, longer vector lengths do not provide additional valuable semantic information for retrieval.</p>
            </list-item>
            <list-item id="S4.I1.i3">
              <p id="S4.I1.i3.p1">As shown in Table <xref rid="T4" ref-type="table">4</xref>, the experimental results of MLLMs demonstrate that most methods can retrieve the correct result within the top-5 results. Some models even achieve a 100% recall rate on the validation set. These results highlight that training or fine-tuning MLLMs on large-scale language and image datasets enables the models to capture subtle details and semantic variations in both text and images. This approach not only enhances the models' generalization capabilities but also reduces the dependency on large amounts of annotated data, a significant advantage over traditional models. However, this benefit comes at the cost of requiring substantially more computational resources for training and inference due to the large number of parameters in these models.</p>
            </list-item>
          </list>
        </p>
      </sec>
    </sec>
    <sec id="S5">
      <label>5.</label>
      <title>Conclusion and Future Works</title>
      <p id="S5.p1">This survey has comprehensively reviewed the field of cross-modal image-text retrieval, categorizing existing methods and highlighting their strengths and limitations. Current cross-modal retrieval methods can be broadly classified into single-tower, dual-tower, real-value representation, and binary representation models.</p>
      <p id="S5.p2"><italic>1) Summary of Existing Methods.</italic> Single-tower models integrate modalities early, learning joint representations that capture complex interactions. Their unified architecture, however, may struggle with scalability and efficient fusion of different data types. Dual-tower models process each modality separately through specialized architectures, enhancing scalability and tailored processing. Yet, they face challenges in ensuring compatibility between separately learned representations. Real-value representation models encode data into continuous, high-dimensional vectors, effectively capturing detailed and complex relationships. Despite their accuracy, they are computationally intensive and costly in terms of storage, making them less suitable for large-scale applications. Binary representation models use compact, fixed-length binary codes for data encoding, offering efficient storage and fast retrieval. These models are ideal for large-scale databases but often trade-off some accuracy and require sophisticated techniques to learn effective binary codes.</p>
      <p id="S5.p3"><italic>2) Advantages and Problems.</italic> Advantages: Single-tower models. Effective in capturing intricate interactions between modalities. Dual-tower models. Highly scalable and adaptable to specialized processing needs. Real-value representation models. High accuracy in representing complex relationships. Binary representation models. Efficient in storage and fast in retrieval, suitable for large datasets.</p>
      <p id="S5.p4">Problems: Single-tower models. Scalability issues and challenges in modality fusion. Dual-tower models. Difficulty in ensuring compatibility of learned representations. Real-value representation models. High computational and storage costs. Binary representation models. Potential loss of accuracy and complexity in learning effective binary codes.</p>
      <p id="S5.p5"><italic>3) Future Directions.</italic> To advance the field of cross-modal retrieval, future research should focus on several key areas: 1. Improving Model Compatibility and Fusion: Developing hybrid models that leverage the strengths of both single-tower and dual-tower architectures to enhance compatibility and fusion efficiency. 2. Enhancing Computational Efficiency: Designing novel methods that reduce computational and storage demands of real-value representation models without compromising accuracy. 3. Advanced Binary Coding Techniques: Innovating more sophisticated binary coding methods that balance accuracy and efficiency, making them viable for large-scale applications. 4. Leveraging Multimodal Large Language Models (MLLMs): Further exploring the potential of MLLMs in enhancing cross-modal retrieval tasks, particularly in improving semantic understanding and retrieval accuracy. 5. Comprehensive Benchmarking: Establishing more robust benchmarking frameworks that include diverse datasets and comprehensive evaluation metrics to better assess model performance. 6. Addressing Scalability and Real-world Applications: Developing scalable solutions that can handle real-world data complexities and large-scale multimodal databases, ensuring the practical applicability of cross-modal retrieval systems.</p>
      <p id="S5.p6">By addressing these challenges and focusing on these future directions, the field of cross-modal image-text retrieval can achieve more robust, efficient, and accurate systems, enhancing the practical utility of these technologies in various real-world applications.</p>
    </sec>
  </body>
  <back>
    <ack>
      <title>Acknowledgments</title>
      <p id="ack.p1">This work was supported in part by the National Natural Science Foundation of China under Grant U22A2025, Grant 62072088, Grant 62232007, Grant U23A20309, and Grant 61991404; in part by the Liaoning Provincial Science and Technology Plan Project - Key R&D Department of Science and Technology under Grant 2023JH2/101300182; in part by the 111 Project under Grant B16009.</p>
    </ack>
    <sec id="sec0100" sec-type="COI-statement">
      <title>Conflict of interest</title>
      <p>The authors declare no conflicts of interest.</p>
    </sec>
    <ref-list>
      <title>References</title>
      <ref id="ref001">
        <label>[1]</label>
        <mixed-citation> Li, J., Li, D., Savarese, S., &amp; Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In <italic>International conference on machine learning</italic> (pp. 19730-19742). PMLR. [<uri>https://proceedings.mlr.press/v202/li23q.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref002">
        <label>[2]</label>
        <mixed-citation> Zhang, P., Wang, X. D. B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., … &amp; Wang, J. (2023). Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. <italic>arXiv preprint arXiv</italic>:2309.15112. [<uri>https://doi.org/10.48550/arXiv.2309.15112</uri>] </mixed-citation>
      </ref>
      <ref id="ref003">
        <label>[3]</label>
        <mixed-citation> Zhu, H., Huang, J. H., Rudinac, S., &amp; Kanoulas, E. (2024). Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models. <italic>arXiv preprint arXiv</italic>:2404.18746. [<uri>https://doi.org/10.48550/arXiv.2404.18746</uri>] </mixed-citation>
      </ref>
      <ref id="ref004">
        <label>[4]</label>
        <mixed-citation> Li, Y., Wang, W., Qu, L., Nie, L., Li, W., &amp; Chua, T. S. (2024). Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond. <italic>arXiv preprint arXiv</italic>:2402.10805. [<uri>https://doi.org/10.48550/arXiv.2402.10805</uri>] </mixed-citation>
      </ref>
      <ref id="ref005">
        <label>[5]</label>
        <mixed-citation> Levy, M., Ben-Ari, R., Darshan, N., &amp; Lischinski, D. (2024). Chatting makes perfect: Chat-based image retrieval. <italic>Advances in Neural Information Processing Systems</italic>, 36. [<uri>https://proceedings.neurips.cc/paper_files/paper/2023/hash/c1b3d1e2cf53bb28cabd801bd58b3521-Abstract-Conference.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref006">
        <label>[6]</label>
        <mixed-citation> Karthik, S., Roth, K., Mancini, M., &amp; Akata, Z. (2023). Vision-by-language for training-free compositional image retrieval. <italic>arXiv preprint arXiv</italic>:2310.09291. [<uri>https://doi.org/10.48550/arXiv.2310.09291</uri>] </mixed-citation>
      </ref>
      <ref id="ref007">
        <label>[7]</label>
        <mixed-citation> Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J. B., Yu, J., … &amp; Ahn, J. (2023). Gemini: a family of highly capable multimodal models. <italic>arXiv preprint arXiv</italic>:2312.11805. [<uri>https://doi.org/10.48550/arXiv.2312.11805</uri>] </mixed-citation>
      </ref>
      <ref id="ref008">
        <label>[8]</label>
        <mixed-citation> Kim, W., Son, B., &amp; Kim, I. (2021, July). Vilt: Vision-and-language transformer without convolution or region supervision. In <italic>International conference on machine learning</italic> (pp. 5583-5594). PMLR. [<uri>https://proceedings.mlr.press/v139/kim21k.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref009">
        <label>[9]</label>
        <mixed-citation> Li, G., Duan, N., Fang, Y., Gong, M., &amp; Jiang, D. (2020, April). Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In <italic>Proceedings of the AAAI conference on artificial intelligence</italic> (Vol. 34, No. 07, pp. 11336-11344). [<uri>https://doi.org/10.1609/aaai.v34i07.6795</uri>] </mixed-citation>
      </ref>
      <ref id="ref010">
        <label>[10]</label>
        <mixed-citation> Li, L. H., Yatskar, M., Yin, D., Hsieh, C. J., &amp; Chang, K. W. (2019). Visualbert: A simple and performant baseline for vision and language. <italic>arXiv preprint arXiv</italic>:1908.03557. [<uri>https://doi.org/10.48550/arXiv.1908.03557</uri>] </mixed-citation>
      </ref>
      <ref id="ref011">
        <label>[11]</label>
        <mixed-citation> Lu, J., Batra, D., Parikh, D., &amp; Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. <italic>Advances in neural information processing systems</italic>, 32. [<uri>https://proceedings.neurips.cc/paper_files/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref012">
        <label>[12]</label>
        <mixed-citation> Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … &amp; Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In <italic>International conference on machine learning</italic> (pp. 8748-8763). PMLR. [<uri>http://proceedings.mlr.press/v139/radford21a</uri>] </mixed-citation>
      </ref>
      <ref id="ref013">
        <label>[13]</label>
        <mixed-citation> Kim, D., Kim, N., &amp; Kwak, S. (2023). Improving cross-modal retrieval with set of diverse embeddings. In <italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 23422-23431). [<uri>http://openaccess.thecvf.com/content/CVPR2023/html/Kim_Improving_Cross-Modal_Retrieval_With_Set_of_Diverse_Embeddings_CVPR_2023_paper.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref014">
        <label>[14]</label>
        <mixed-citation> Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., … &amp; Duerig, T. (2021, July). Scaling up visual and vision-language representation learning with noisy text supervision. In <italic>International conference on machine learning</italic> (pp. 4904-4916). PMLR. [<uri>https://proceedings.mlr.press/v139/jia21b.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref015">
        <label>[15]</label>
        <mixed-citation> Wang, B., Yang, Y., Xu, X., Hanjalic, A., &amp; Shen, H. T. (2017, October). Adversarial cross-modal retrieval. In <italic>Proceedings of the 25th ACM international conference on Multimedia</italic> (pp. 154-162). [<uri>https://doi.org/10.1145/3123266.3123326</uri>] </mixed-citation>
      </ref>
      <ref id="ref016">
        <label>[16]</label>
        <mixed-citation> Zhen, L., Hu, P., Wang, X., &amp; Peng, D. (2019). Deep supervised cross-modal retrieval. In <italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 10394-10403). [<uri>http://openaccess.thecvf.com/content_CVPR_2019/html/Zhen_Deep_Supervised_Cross-Modal_Retrieval_CVPR_2019_paper.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref017">
        <label>[17]</label>
        <mixed-citation> Tang, X., Wang, Y., Ma, J., Zhang, X., Liu, F., &amp; Jiao, L. (2023). Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval. <italic>IEEE Transactions on Geoscience and Remote Sensing</italic>. [<uri>https://doi.org/10.1109/TGRS.2023.3280546</uri>] </mixed-citation>
      </ref>
      <ref id="ref018">
        <label>[18]</label>
        <mixed-citation> Lu, H., Fei, N., Huo, Y., Gao, Y., Lu, Z., &amp; Wen, J. R. (2022). Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In <italic>Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition</italic> (pp. 15692-15701). [<uri>http://openaccess.thecvf.com/content/CVPR2022/html/Lu_COTS_Collaborative_Two-Stream_Vision-Language_Pre-Training_Model_for_Cross-Modal_Retrieval_CVPR_2022_paper.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref019">
        <label>[19]</label>
        <mixed-citation> Xie, C. W., Wu, J., Zheng, Y., Pan, P., &amp; Hua, X. S. (2022, October). Token embeddings alignment for cross-modal retrieval. In <italic>Proceedings of the 30th ACM International Conference on Multimedia</italic> (pp. 4555-4563). [<uri>https://doi.org/10.1145/3503161.3548107</uri>] </mixed-citation>
      </ref>
      <ref id="ref020">
        <label>[20]</label>
        <mixed-citation> Jiang, Q. Y., &amp; Li, W. J. (2017). Deep cross-modal hashing. In <italic>Proceedings of the IEEE conference on computer vision and pattern recognition</italic> (pp. 3232-3240). [<uri>http://openaccess.thecvf.com/content_cvpr_2017/html/Jiang_Deep_Cross-Modal_Hashing_CVPR_2017_paper.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref021">
        <label>[21]</label>
        <mixed-citation> Wu, G., Lin, Z., Han, J., Liu, L., Ding, G., Zhang, B., &amp; Shen, J. (2018, July). Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. In <italic>IJCAI</italic> (Vol. 1, No. 3, p. 5). [<uri>https://www.ijcai.org/Proceedings/2018/0396.pdf</uri>] </mixed-citation>
      </ref>
      <ref id="ref022">
        <label>[22]</label>
        <mixed-citation> Li, C., Deng, C., Li, N., Liu, W., Gao, X., &amp; Tao, D. (2018). Self-supervised adversarial hashing networks for cross-modal retrieval. In <italic>Proceedings of the IEEE conference on computer vision and pattern recognition</italic> (pp. 4242-4251). [<uri>http://openaccess.thecvf.com/content_cvpr_2018/html/Li_Self-Supervised_Adversarial_Hashing_CVPR_2018_paper.html</uri>] </mixed-citation>
      </ref>
      <ref id="ref023">
        <label>[23]</label>
        <mixed-citation> Li, T., Yang, X., Wang, B., Xi, C., Zheng, H., &amp; Zhou, X. (2022, June). Bi-CMR: bidirectional reinforcement guided hashing for effective cross-modal retrieval. In <italic>Proceedings of the AAAI Conference on Artificial Intelligence</italic> (Vol. 36, No. 9, pp. 10275-10282). [<uri>https://doi.org/10.1609/aaai.v36i9.21268</uri>] </mixed-citation>
      </ref>
      <ref id="ref024">
        <label>[24]</label>
        <mixed-citation> Tu, J., Liu, X., Lin, Z., Hong, R., &amp; Wang, M. (2022, October). Differentiable cross-modal hashing via multimodal transformers. In <italic>Proceedings of the 30th ACM International Conference on Multimedia</italic> (pp. 453-461). [<uri>https://doi.org/10.1145/3503161.3548187</uri>] </mixed-citation>
      </ref>
      <ref id="ref025">
        <label>[25]</label>
        <mixed-citation> Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G. R., Levy, R., &amp; Vasconcelos, N. (2010, October). A new approach to cross-modal multimedia retrieval. In <italic>Proceedings of the 18th ACM international conference on Multimedia</italic> (pp. 251-260). [<uri>https://doi.org/10.1145/1873951.1873987</uri>] </mixed-citation>
      </ref>
      <ref id="ref026">
        <label>[26]</label>
        <mixed-citation> Chua, T. S., Tang, J., Hong, R., Li, H., Luo, Z., &amp; Zheng, Y. (2009, July). Nus-wide: a real-world web image database from national university of singapore. In <italic>Proceedings of the ACM international conference on image and video retrieval</italic> (pp. 1-9). [<uri>https://doi.org/10.1145/1646396.1646452</uri>] </mixed-citation>
      </ref>
      <ref id="ref027">
        <label>[27]</label>
        <mixed-citation> Escalante, H. J., Hernández, C. A., Gonzalez, J. A., López-López, A., Montes, M., Morales, E. F., … &amp; Grubinger, M. (2010). The segmented and annotated IAPR TC-12 benchmark. <italic>Computer vision and image understanding</italic>, 114(4), 419-428. [<uri>https://doi.org/10.1016/j.cviu.2009.03.008</uri>] </mixed-citation>
      </ref>
      <ref id="ref028">
        <label>[28]</label>
        <mixed-citation> Huiskes, M. J., &amp; Lew, M. S. (2008, October). The mir flickr retrieval evaluation. In <italic>Proceedings of the 1st ACM international conference on Multimedia information retrieval</italic> (pp. 39-43). [<uri>https://doi.org/10.1145/1460096.1460104</uri>] </mixed-citation>
      </ref>
      <ref id="ref029">
        <label>[29]</label>
        <mixed-citation> Rashtchian, C., Young, P., Hodosh, M., &amp; Hockenmaier, J. (2010, June). Collecting image annotations using amazon's mechanical turk. In <italic>Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk</italic> (pp. 139-147). [<uri>https://aclanthology.org/W10-0721.pdf</uri>] </mixed-citation>
      </ref>
      <ref id="ref030">
        <label>[30]</label>
        <mixed-citation> Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … &amp; Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In <italic>Computer Vision–ECCV 2014: 13th European Conference</italic>, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing. [<uri>https://link.springer.com/chapter/10.1007/978-3-319-10602-1_48</uri>] </mixed-citation>
      </ref>
      <ref id="ref031">
        <label>[31]</label>
        <mixed-citation> Tan, W., Zhu, L., Li, J., Zhang, H., &amp; Han, J. (2022). Teacher-student learning: Efficient hierarchical message aggregation hashing for cross-modal retrieval. <italic>IEEE Transactions on Multimedia</italic>. [<uri>https://doi.org/10.1109/TMM.2022.3177901</uri>] </mixed-citation>
      </ref>
      <ref id="ref032">
        <label>[32]</label>
        <mixed-citation> Peng, Y., &amp; Qi, J. (2019). CM-GANs: Cross-modal generative adversarial networks for common representation learning. <italic>ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)</italic>, 15(1), 1-24. [<uri>https://doi.org/10.1145/3284750</uri>] </mixed-citation>
      </ref>
      <ref id="ref033">
        <label>[33]</label>
        <mixed-citation> Dong, X., Liu, L., Zhu, L., Nie, L., &amp; Zhang, H. (2021). Adversarial graph convolutional network for cross-modal retrieval. <italic>IEEE Transactions on Circuits and Systems for Video Technology</italic>, 32(3), 1634-1645. [<uri>https://doi.org/10.1109/TCSVT.2021.3075242</uri>] </mixed-citation>
      </ref>
      <ref id="ref034">
        <label>[34]</label>
        <mixed-citation> Zeng, Z., &amp; Mao, W. (2022). A comprehensive empirical study of vision-language pre-trained model for supervised cross-modal retrieval. <italic>arXiv preprint arXiv</italic>:2201.02772. [<uri>https://doi.org/10.48550/arXiv.2201.02772</uri>] </mixed-citation>
      </ref>
      <ref id="ref035">
        <label>[35]</label>
        <mixed-citation> Tu, R. C., Mao, X. L., Ma, B., Hu, Y., Yan, T., Wei, W., &amp; Huang, H. (2020). Deep cross-modal hashing with hashing functions and unified hash codes jointly learning. <italic>IEEE Transactions on Knowledge and Data Engineering</italic>, 34(2), 560-572. [<uri>https://doi.org/10.1109/TKDE.2020.2987312</uri>] </mixed-citation>
      </ref>
      <ref id="ref036">
        <label>[36]</label>
        <mixed-citation> Huo, Y., Qin, Q., Dai, J., Wang, L., Zhang, W., Huang, L., &amp; Wang, C. (2023). Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. <italic>IEEE Transactions on Circuits and Systems for Video Technology</italic>. [<uri>https://doi.org/10.1109/TCSVT.2023.3285266</uri>] </mixed-citation>
      </ref>
      <ref id="ref037">
        <label>[37]</label>
        <mixed-citation> Zhu, L., Wang, T., Li, F., Li, J., Zhang, Z., &amp; Shen, H. T. (2023). Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. <italic>arXiv preprint arXiv</italic>:2308.14263. [<uri>https://doi.org/10.48550/arXiv.2308.14263</uri>] </mixed-citation>
      </ref>
      <ref id="ref038">
        <label>[38]</label>
        <mixed-citation> Zhou, K., Hassan, F. H., &amp; Hoon, G. K. (2023). The State of the Art for Cross-Modal Retrieval: A Survey. <italic>IEEE Access</italic>. [<uri>https://doi.org/10.1109/ACCESS.2023.3338548</uri>] </mixed-citation>
      </ref>
      <ref id="ref039">
        <label>[39]</label>
        <mixed-citation> Wang, X., Li, L., Li, Z., Wang, X., Zhu, X., Wang, C., … &amp; Xiao, Y. (2023, February). AGREE: aligning cross-modal entities for image-text retrieval upon vision-language pre-trained models. In <italic>Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining</italic> (pp. 456-464). [<uri>https://doi.org/10.1145/3539597.3570481</uri>] </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>
