<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with MathML3 v1.1d2 20140930//EN" "JATS-journalpublishing1-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="1.1d2" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="nlm-ta">CJIF</journal-id>
      <journal-id journal-id-type="publisher-id">ICCK</journal-id>
      <journal-title-group>
        <journal-title>Chinese Journal of Information Fusion</journal-title>
      </journal-title-group>
      <issn pub-type="ppub" publication-format="print">2998-3363</issn>
      <issn pub-type="epub" publication-format="electronic">2998-3371</issn>
      <publisher>
        <publisher-name>Institute of Central Computation and Knowledge Inc</publisher-name>
        <publisher-loc>522 W RIVERSIDE AVE STE N, SPOKANE, WA, 99201, UNITED STATES</publisher-loc>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.62762/CJIF.2024.655617</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Research Article</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>DMFuse: Diffusion Model Guided Cross-Attention Learning for Infrared and Visible Image Fusion</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0009-0008-8429-1672</contrib-id>
          <name>
            <surname>Qi</surname>
            <given-names>Wuqiang</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <name>
            <surname>Zhang</surname>
            <given-names>Zhuoqun</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-0763-5901</contrib-id>
          <name>
            <surname>Wang</surname>
            <given-names>Zhishe</given-names>
          </name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff1"><label>1</label>School of Applied Science, Taiyuan University of Science and Technology, Taiyuan 030024, China</aff>
      </contrib-group>
      <author-notes>
        <corresp id="cor3">Corresponding Author: Zhishe Wang. Email: <email>wangzs@tyust.edu.cn</email></corresp>
      </author-notes>
      <pub-date date-type="pub" pub-type="epub" publication-format="online">
        <day>31</day>
        <month>12</month>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <issue>3</issue>
      <fpage>226</fpage>
      <lpage>242</lpage>
      <history>
        <date date-type="received">
          <day>24</day>
          <month>8</month>
          <year>2024</year>
        </date>
        <date date-type="accepted">
          <day>28</day>
          <month>12</month>
          <year>2024</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>© 2024 by the Authors. Published by Institute of Central Computation and Knowledge. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).</copyright-statement>
        <copyright-year>2024</copyright-year>
        <copyright-holder>The Authors</copyright-holder>
        <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
        </license>
      </permissions>
      <self-uri xlink:href="https://www.icck.org/article/abs/cjif.2024.655617">This article is available from https://www.icck.org/article/abs/cjif.2024.655617</self-uri>
      <abstract>
        <p>Image fusion aims to integrate complementary information from different sensors into a single fused output for superior visual description and scene understanding. The existing GAN-based fusion methods generally suffer from multiple challenges, such as unexplainable mechanism, unstable training, and mode collapse, which may affect the fusion quality. To overcome these limitations, this paper introduces a diffusion model guided cross-attention learning network, termed as DMFuse, for infrared and visible image fusion. Firstly, to improve the diffusion inference efficiency, we compress the quadruple channels of the denoising UNet network to achieve more efficient and robust model for fusion tasks. After that, we employ the pre-trained diffusion model as an autoencoder and incorporate its strong generative priors to further train the following fusion network. This design allows the generated diffusion features to effectively showcase high-quality distribution mapping ability. In addition, we devise a cross-attention interactive fusion module to establish the long-range dependencies from local diffusion features. This module integrates the global interactions to improve the complementary characteristics of different modalities. Finally, we propose a multi-level decoder network to reconstruct the fused output. Extensive experiments on fusion tasks and downstream applications, including object detection and semantic segmentation, indicate that the proposed model yields promising performance while maintaining competitive computational efficiency. The code and data are available at <ext-link xlink:href="https://github.com/Zhishe-Wang/DMFuse">https://github.com/Zhishe-Wang/DMFuse</ext-link>.</p>
      </abstract>
      <kwd-group kwd-group-type="author" xml:lang="en">
        <kwd>image fusion</kwd>
        <kwd>diffusion model</kwd>
        <kwd>feature interaction</kwd>
        <kwd>attention mechanism</kwd>
        <kwd>deep generative model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="S1">
      <label>1.</label>
      <title>Introduction</title>
      <p id="S1.p1">Infrared sensors detect hidden target characteristics through thermal radiation and work under various weather and lighting conditions. The acquired images are often exhibit low contrast and lack fine details. On the contrary, visible sensors offer high-resolution scene perception through light reflection imaging. However, under adverse weather or camouflage conditions, visible sensors are difficult to distinguish obvious targets from the background environment. The image fusion technology can integrate the complementary information from different sensors into a single image, which can achieve superior visual description and scene understanding. A common application of fused images is to provide faster and more accurate visual interpretation for both human observers and computer systems. In addition, this technology has been extended into other visual tasks, such as person re-identification [<xref rid="ref001" ref-type="bibr">1</xref>], object detection [<xref rid="ref002" ref-type="bibr">2</xref>], and RGBT tracking [<xref rid="ref003" ref-type="bibr">3</xref>], and so on.</p>
      <p id="S1.p2">Over the past decades, traditional algorithms, including multi-scale transformation [<xref rid="ref004" ref-type="bibr">4</xref>], sparse representation [<xref rid="ref005" ref-type="bibr">5</xref>], subspace decomposition [<xref rid="ref006" ref-type="bibr">6</xref>], optimization model [<xref rid="ref007" ref-type="bibr">7</xref>], hybrid-based [<xref rid="ref008" ref-type="bibr">8</xref>], and other methods [<xref rid="ref009" ref-type="bibr">9</xref>], have been proposed for infrared and visible image fusion. Although these methods achieved great processes and can fulfill the requirements of most scenarios, they still exhibit certain limitations. On the one hand, these methods usually develop the same mathematical model to indiscriminately extract image features, and rarely consider the inherent distinctiveness of different modality images, which limits the fusion performance improvement. On the other hand, the fusion rules or activity level measurement need to be manually designed. This strategy potentially compromises the objectivity and reliability of image fusion output, which is unsuitable for some complicated scenarios and subsequent decision-making applications.</p>
      <p>
        <fig id="F1">
          <label>Figure 1.</label>
          <caption>
            <p>The comparative schematic diagram of the proposed model with U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>], YDTR [<xref rid="ref015" ref-type="bibr">15</xref>] and DDFM [<xref rid="ref019" ref-type="bibr">19</xref>].</p>
          </caption>
          <graphic xlink:href="Fig1.jpg"/>
        </fig>
      </p>
      <p id="S1.p3">In recent years, deep neural networks have experienced rapid adoption in the field of image fusion. Generally, the mainstream deep learning-based models include autoencoder (AE)-based [<xref rid="ref010" ref-type="bibr">10</xref>], [<xref rid="ref011" ref-type="bibr">11</xref>], convolutional neural network (CNN)-based [<xref rid="ref012" ref-type="bibr">12</xref>], [<xref rid="ref013" ref-type="bibr">13</xref>], Transformer-based [<xref rid="ref014" ref-type="bibr">14</xref>], [<xref rid="ref015" ref-type="bibr">15</xref>], and generative adversarial network (GAN)-based [<xref rid="ref016" ref-type="bibr">16</xref>], [<xref rid="ref017" ref-type="bibr">17</xref>] methods. AE-based methods employ the encoder-decoder framework to extract and reconstruct features, and design a fusion layer to integrate their respective features. Nevertheless, the fusion strategies are still hand-crafted. CNN-based methods usually concatenate source images in the input stage as an image-level framework or integrate features in the fusion stage to form a feature-level framework. Different to CNN, Transformer-based methods employ a self-attention mechanism to model the long-range dependencies, and achieve state-of-the-art (SOTA) performance. However, the above methods are non-generative fusion schemes, which cannot take advantage of strong generative ability. Image fusion as a generative task, GAN-based methods employ adversarial training to constrain the same distribution of fused output and source images. Nevertheless, the tradeoff between generator and discriminator is difficult to follow during training, which presents a challenge for achieving controlled generation. Moreover, unexplainable mechanism and mode collapse of GANs seriously affect the fusion quality.</p>
      <p id="S1.p4">Recently, denoising diffusion probabilistic models (DDPM) [<xref rid="ref018" ref-type="bibr">18</xref>] have demonstrated remarkable advances in generating hopeful synthetic samples. Unlike the existing GAN-based methods, the generation process of DDPM is interpretable as it relies on denoising principles, which can effectively achieve controllable high-quality and high-fidelity generation. Furthermore, DDPM does not require discriminative constraints, thereby avoiding the common issues of unstable training and mode collapse often encountered by GANs. Specifically, Zhao et al. [<xref rid="ref019" ref-type="bibr">19</xref>] formulated fusion task into an unconditional generation problem, and integrated the hierarchical Bayesian model in likelihood rectification. Yue et al. [<xref rid="ref020" ref-type="bibr">20</xref>] constructed the multi-channel distribution based on diffusion model to extract complementary information for high color fidelity fusion tasks. Although these methods achieve surprising fusion performance, some drawbacks still need to be addressed. On the one hand, due to the posterior sampling procedure, their fusion models usually require extensive storage space and long inference times. On the other hand, these methods only leverage the generative capacity of diffusion mode while failing to consider the contextual interactions of multi-modality images, resulting in limited fusion performance.</p>
      <p id="S1.p5">To address these issues, we introduce a simple yet strong fusion baseline, namely diffusion model guided cross-attention learning network, termed as DMFuse. In the first training stage, to alleviate the strains on storage space and inference process, we directly compress the quadruple channels of diffusion UNet, and train a robust model using the MS-COCO dataset [<xref rid="ref021" ref-type="bibr">21</xref>]. Because this dataset encompasses diverse object categories, abundant image data, and various visual scenarios, it aids in bolstering the generalization ability of the diffusion model for fusion tasks, even when model parameters are compressed. In the second training stage, instead of relying on mainstream convolution operations or self-attention mechanisms, we employ the pre-trained diffusion model as an autoencoder to generate the diffusion features, which can seamlessly transfer its high-quality generation ability to the subsequent fusion network. In addition, we develop a cross-attention interactive fusion module to aggregate the diffusion features of infrared and visible images, which can model the global dependencies from local contexts and improve the complementary characteristics of different modalities. Finally, a multi-level decoder network is proposed to progressively reconstruct the fused output.</p>
      <p id="S1.p6">To demonstrate the superiority of the proposed DMFuse, we compare it with the CNN-based method, i.e., U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>], Transformer-based method, i.e., YDTR [<xref rid="ref015" ref-type="bibr">15</xref>], and diffusion model-based method, i.e., DDFM [<xref rid="ref019" ref-type="bibr">19</xref>]. Figure <xref ref-type="fig" rid="F1">1</xref> illustrates a schematic diagram for comparison. U2Fusion and YDTR are non-generative schemes that focus on modeling local features and local-global dependencies, respectively. Although the fused results preserve visible details well, they fail to retain the infrared target brightness. DDFM formulates the fusion task into unconditional generation and samples a fusion image from the posterior distribution. However, the generated result still exhibits limited preservation of target brightness. In contrast, the proposed model can simultaneously enable rich detail preservation and considerable intensity control. In summary, the main contributions of our work are threefold.</p>
      <p>
        <list list-type="bullet" id="S1.I1">
          <list-item id="S1.I1.i1">
            <p id="S1.I1.i1.p1">We introduce a novel diffusion model guided fusion baseline. The pre-trained diffusion model is employed as an encoder to provide a powerful distribution mapping, thereby grafting its generation ability for fusion tasks.</p>
          </list-item>
          <list-item id="S1.I1.i2">
            <p id="S1.I1.i2.p1">We develop a cross-attention interactive fusion module to model the global dependencies from local diffusion features, thus effectively strengthening and integrating the complementary characteristics of different modalities.</p>
          </list-item>
          <list-item id="S1.I1.i3">
            <p id="S1.I1.i3.p1">We train a more efficient and robust diffusion model with different strategies. Extensive experiments demonstrate that DMFuse achieves SOTA fusion performance as well as competitive operational efficiency.</p>
          </list-item>
        </list>
      </p>
      <p id="S1.p8">The rest of this paper is schemed as follows. Section 2 mainly discusses the non-generative and generative fusion schemes. In Section 3, the framework of the proposed model is elaborated. In Section 4 and Section 5, experimental comparisons and relevant conclusions are given, respectively.</p>
    </sec>
    <sec id="S2">
      <label>2.</label>
      <title>Related Work</title>
      <p id="S2.p1">This section provides an overview of the related work that is closely related to the proposed method. From a generative standpoint, we can roughly categorize the existing works as non-generative and generative fusion schemes.</p>
      <sec id="S2.SS1">
        <label>2.1</label>
        <title>Non-Generative Fusion Scheme</title>
        <p id="S2.SS1.p1">AE-based methods generally follow the traditional framework, and employ a pre-trained encoder-decoder network to extract and reconstruct features. For example, Li et al. developed DenseFuse [<xref rid="ref010" ref-type="bibr">10</xref>] and NestFuse [<xref rid="ref011" ref-type="bibr">11</xref>] where dense blocks and nest connections are introduced to enhance feature representation. Zhao et al. [<xref rid="ref022" ref-type="bibr">22</xref>] presented AUIF in which the traditional optimization model was mapped to a trainable neural network by the algorithm unrolling. To improve fusion performance, Jian et al. elaborated SEDRFuse [<xref rid="ref023" ref-type="bibr">23</xref>] and DDNSA [<xref rid="ref024" ref-type="bibr">24</xref>] in which attention-based fusion strategies are employed to better strengthen the complementary characteristics of different modalities. However, these methods need to design the fusion strategies manually, restricting their practical applications.</p>
        <p id="S2.SS1.p2">CNN-based methods usually propose image-level or feature-level frameworks to implement unsupervised learning. Typically, Xu et al. [<xref rid="ref012" ref-type="bibr">12</xref>] introduced U2Fusion, which concatenated source images as an input, and employed a pre-trained VGG-16 network to measure information preservation degree for supervising the similarity constraint. Li et al. [<xref rid="ref013" ref-type="bibr">13</xref>] elaborated RFN-Nest, which proposed a two-stage training strategy to train the encoder-decoder framework and fusion network, respectively. They also presented LRRNet [<xref rid="ref025" ref-type="bibr">25</xref>], which formulated the fusion task as optimized decomposition and network learning problems. An et al. [<xref rid="ref026" ref-type="bibr">26</xref>] introduced MRASFusion, which designed a residual attention fusion module for feature interactions. Chen et al. [<xref rid="ref027" ref-type="bibr">27</xref>] developed IVIFD for a joint fusion and detection task. Zhu et al. [<xref rid="ref028" ref-type="bibr">28</xref>] proposed MGRCFusion, which utilized a multi-scale group residual convolution module to exploit finer deep-level features.</p>
        <p id="S2.SS1.p3">Transformer-based methods mainly depend on the self-attention mechanism to model the global dependencies and maintain long-range context. Pang et al. [<xref rid="ref014" ref-type="bibr">14</xref>] introduced SDTFusion, which employed dense Transformer blocks to extract the global features. Tang et al. presented YTDR [<xref rid="ref015" ref-type="bibr">15</xref>] and DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>], which proposed a serial CNN-Transformer architecture to aggregate local and global features. Ma et al. [<xref rid="ref030" ref-type="bibr">30</xref>] elaborated SwinFusion, which designed self-attention and cross-attention units to integrate intra- and inter-domain interactions. Tang et al. [<xref rid="ref031" ref-type="bibr">31</xref>] developed a multi-branch network based on CNN and Transformer to extract the local and global information for multi-modality fusion. In addition, Liu et al. [<xref rid="ref032" ref-type="bibr">32</xref>] introduced SegMiF, which proposed a multi-interactive framework for the joint tasks of fusion and segmentation.</p>
        <p id="S2.SS1.p4">The aforementioned methods tend to design efficient network structures [<xref rid="ref010" ref-type="bibr">10</xref>], [<xref rid="ref011" ref-type="bibr">11</xref>], [<xref rid="ref026" ref-type="bibr">26</xref>], [<xref rid="ref028" ref-type="bibr">28</xref>], novel fusion rules [<xref rid="ref023" ref-type="bibr">23</xref>], [<xref rid="ref024" ref-type="bibr">24</xref>], different training strategies [<xref rid="ref013" ref-type="bibr">13</xref>], [<xref rid="ref022" ref-type="bibr">22</xref>], [<xref rid="ref025" ref-type="bibr">25</xref>], [<xref rid="ref027" ref-type="bibr">27</xref>], long-range modeling [<xref rid="ref014" ref-type="bibr">14</xref>], [<xref rid="ref015" ref-type="bibr">15</xref>], [<xref rid="ref030" ref-type="bibr">30</xref>], [<xref rid="ref031" ref-type="bibr">31</xref>], and multi-task learning [<xref rid="ref012" ref-type="bibr">12</xref>], [<xref rid="ref032" ref-type="bibr">32</xref>]. The core is to employ convolutional or self-attention operations to discriminate model local, global, or joint features. However, due to the lack of ground truth and the fact that these methods are non-generative fusion schemes, the lack of in-depth exploration of generative models limits the potential fusion performance improvement.</p>
      </sec>
      <sec id="S2.SS2">
        <label>2.2</label>
        <title>Generative Fusion Scheme</title>
        <p id="S2.SS2.p1">GAN-based methods generally apply adversarial training to generate a fused image that follows the same distribution as the source images. Ma et al. [<xref rid="ref016" ref-type="bibr">16</xref>] firstly devised FusionGAN, which employed a generator to obtain the fused image, and used a discriminator to determine whether the fused output has a similar distribution to source images. Meanwhile, they also introduced TarDAL [<xref rid="ref033" ref-type="bibr">33</xref>], which designed a target-aware dual adversarial learning network for the joint problems of fusion and detection. Wang et al. presented ICAFusion [<xref rid="ref034" ref-type="bibr">34</xref>], CrossFuse [<xref rid="ref035" ref-type="bibr">35</xref>], and FreqGAN [<xref rid="ref036" ref-type="bibr">36</xref>], which introduced attention mechanisms and frequency information to implement feature interaction and iterative optimization. These methods focus on the design of flexible networks, such as generator architecture [<xref rid="ref016" ref-type="bibr">16</xref>], attention mechanism [<xref rid="ref034" ref-type="bibr">34</xref>], [<xref rid="ref035" ref-type="bibr">35</xref>], and multi-task learning [<xref rid="ref033" ref-type="bibr">33</xref>]. However, the GAN-based methods suffer from unexplained mechanism, unstable training, and mode collapse, which adversely impacts the fusion quality.</p>
        <p id="S2.SS2.p2">Diffusion-based methods formulate fusion tasks as a conditional generation problem within the diffusion sampling framework, which can overcome the common problems of GANs. For example, Yue et al. [<xref rid="ref020" ref-type="bibr">20</xref>] presented Dif-Fusion, which directly introduced the multi-channel data construction into a diffusion process, and achieved a fused output with high color fidelity. Zhao et al. [<xref rid="ref019" ref-type="bibr">19</xref>] devised DDFM, where an unconditional generation module and a conditional likelihood rectification module are designed to deliver favorable results. These methods leverage the generative ability of diffusion mode, but present significant time-consuming issues in terms of storage space and inference processes, and do not take into account the contextual interactions. Different from them, the proposed model employs a more efficient and robust diffusion model to graft its high-quality generation ability for fusion tasks. Meanwhile, we design a cross-attention interactive fusion module to strengthen the complementary characteristics of different modalities. Therefore, the proposed model achieves superior fusion performance while requiring less computational costs.</p>
        <p>
          <fig id="F2">
            <label>Figure 2.</label>
            <caption>
              <p>The overall workflow for the proposed model. The diffusion encoder is employed as autoencoder to extract the diffusion features from different modality images. And these features are fed into cross-attention interactive modules (CAIMs) to generate the fusion features. Finally, the fused output is reconstructed by a multi-level decoder network.</p>
            </caption>
            <graphic xlink:href="Fig2.jpg"/>
          </fig>
        </p>
      </sec>
    </sec>
    <sec id="S3">
      <label>3.</label>
      <title>Methodology</title>
      <p id="S3.p1">In this section, we elaborate on the overall workflow of the fusion baseline, including network overview, cross-attention interactive fusion module, and loss function.</p>
      <sec id="S3.SS1">
        <label>3.1</label>
        <title>Network Overview</title>
        <p id="S3.SS1.p1">As depicted in Figure <xref ref-type="fig" rid="F2">2</xref>(a), DMFuse consists of three core components, i.e., pre-trained diffusion model, multi-level decoder, and cross-attention interactive fusion module. Given the input infrared and visible images <inline-formula><mml:math alttext="I_{0}=\{{I_{i}},{I_{v}}\}" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, the forward process of the diffusion model gradually adds Gaussian noise to the input image <inline-formula><mml:math alttext="{I_{0}}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula>, and generates noisy image <inline-formula><mml:math alttext="I_{t}=\{{I_{t}^{i}},{I_{t}^{v}}\}" display="inline"><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>I</mml:mi><mml:mi>t</mml:mi><mml:mi>v</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and its distribution <inline-formula><mml:math alttext="P({I_{t}}|{I_{t-1}})" display="inline"><mml:mrow><mml:mi>P</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo fence="false">|</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> at timestep <inline-formula><mml:math alttext="t" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula>.</p>
        <p id="S3.SS1.p2">After that, we employ the diffusion model encoder to extract multi-level diffusion features of infrared and visible images, termed as <inline-formula><mml:math alttext="\Phi_{i}^{l}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="\Phi_{v}^{l}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>, and fed them into cross-attention interactive fusion module (CAIM), which is shown in Figure <xref ref-type="fig" rid="F2">2</xref>(b), to generate the fusion features <inline-formula><mml:math alttext="\Phi_{f}^{l}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>f</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>. Finally, a multi-level decoder network is proposed to reconstruct the final fused outputs, which is formulated by Eq.(<xref rid="S3.E1">1</xref>).</p>
        <p>
          <disp-formula id="S3.E1">
            <mml:math alttext="{I_{f}}=C[\Phi_{f}^{1},U(C[\Phi_{f}^{2},U(C[\Phi_{f}^{3},U(C[\Phi_{f}^{4},U(%&#10;\Phi_{f}^{5})])])])]" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>I</mml:mi>
                  <mml:mi>f</mml:mi>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mi>C</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">[</mml:mo>
                    <mml:msubsup>
                      <mml:mi mathvariant="normal">Φ</mml:mi>
                      <mml:mi>f</mml:mi>
                      <mml:mn>1</mml:mn>
                    </mml:msubsup>
                    <mml:mo>,</mml:mo>
                    <mml:mrow>
                      <mml:mi>U</mml:mi>
                      <mml:mo>⁢</mml:mo>
                      <mml:mrow>
                        <mml:mo stretchy="false">(</mml:mo>
                        <mml:mrow>
                          <mml:mi>C</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mrow>
                            <mml:mo stretchy="false">[</mml:mo>
                            <mml:msubsup>
                              <mml:mi mathvariant="normal">Φ</mml:mi>
                              <mml:mi>f</mml:mi>
                              <mml:mn>2</mml:mn>
                            </mml:msubsup>
                            <mml:mo>,</mml:mo>
                            <mml:mrow>
                              <mml:mi>U</mml:mi>
                              <mml:mo>⁢</mml:mo>
                              <mml:mrow>
                                <mml:mo stretchy="false">(</mml:mo>
                                <mml:mrow>
                                  <mml:mi>C</mml:mi>
                                  <mml:mo>⁢</mml:mo>
                                  <mml:mrow>
                                    <mml:mo stretchy="false">[</mml:mo>
                                    <mml:msubsup>
                                      <mml:mi mathvariant="normal">Φ</mml:mi>
                                      <mml:mi>f</mml:mi>
                                      <mml:mn>3</mml:mn>
                                    </mml:msubsup>
                                    <mml:mo>,</mml:mo>
                                    <mml:mrow>
                                      <mml:mi>U</mml:mi>
                                      <mml:mo>⁢</mml:mo>
                                      <mml:mrow>
                                        <mml:mo stretchy="false">(</mml:mo>
                                        <mml:mrow>
                                          <mml:mi>C</mml:mi>
                                          <mml:mo>⁢</mml:mo>
                                          <mml:mrow>
                                            <mml:mo stretchy="false">[</mml:mo>
                                            <mml:msubsup>
                                              <mml:mi mathvariant="normal">Φ</mml:mi>
                                              <mml:mi>f</mml:mi>
                                              <mml:mn>4</mml:mn>
                                            </mml:msubsup>
                                            <mml:mo>,</mml:mo>
                                            <mml:mrow>
                                              <mml:mi>U</mml:mi>
                                              <mml:mo>⁢</mml:mo>
                                              <mml:mrow>
                                                <mml:mo stretchy="false">(</mml:mo>
                                                <mml:msubsup>
                                                  <mml:mi mathvariant="normal">Φ</mml:mi>
                                                  <mml:mi>f</mml:mi>
                                                  <mml:mn>5</mml:mn>
                                                </mml:msubsup>
                                                <mml:mo stretchy="false">)</mml:mo>
                                              </mml:mrow>
                                            </mml:mrow>
                                            <mml:mo stretchy="false">]</mml:mo>
                                          </mml:mrow>
                                        </mml:mrow>
                                        <mml:mo stretchy="false">)</mml:mo>
                                      </mml:mrow>
                                    </mml:mrow>
                                    <mml:mo stretchy="false">]</mml:mo>
                                  </mml:mrow>
                                </mml:mrow>
                                <mml:mo stretchy="false">)</mml:mo>
                              </mml:mrow>
                            </mml:mrow>
                            <mml:mo stretchy="false">]</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                        <mml:mo stretchy="false">)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo stretchy="false">]</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="C(\cdot)" display="inline"><mml:mrow><mml:mi>C</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math alttext="U(\cdot)" display="inline"><mml:mrow><mml:mi>U</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denote the convolutional and upsampling operations. [<inline-formula><mml:math alttext="\cdot" display="inline"><mml:mo>⋅</mml:mo></mml:math></inline-formula>] indicates the channel concatenation. Next, we will describe the training process of the diffusion model.</p>
      </sec>
      <sec id="S3.SS2">
        <label>3.2</label>
        <title>Diffusion model encoder</title>
        <p id="S3.SS2.p1">The diffusion model implements the variational inference on a Markovian chain, which includes both forward and backward processes. In the forward process, Gaussian noise is incrementally added to the input image <inline-formula><mml:math alttext="I_{0}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> until it is fully destroyed within <inline-formula><mml:math alttext="T" display="inline"><mml:mi>T</mml:mi></mml:math></inline-formula> timesteps. By using the reparameterization trick, the simplified distribution of noisy image <inline-formula><mml:math alttext="I_{t}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> at each time step <inline-formula><mml:math alttext="t" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> can be directly derived from the input image <inline-formula><mml:math alttext="I_{0}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> sampling, which is formulated by Eq.(<xref rid="S3.E2">2</xref>).</p>
        <p>
          <disp-formula id="S3.E2">
            <mml:math alttext="P({I_{t}}|{I_{0}})=\mathcal{N}({I_{t}};\sqrt{{{\overline{\alpha}}_{t}}}{I_{0}}%&#10;,(1-{\overline{\alpha}_{t}})X)" display="block">
              <mml:mrow>
                <mml:mrow>
                  <mml:mi>P</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>I</mml:mi>
                        <mml:mi>t</mml:mi>
                      </mml:msub>
                      <mml:mo fence="false">|</mml:mo>
                      <mml:msub>
                        <mml:mi>I</mml:mi>
                        <mml:mn>0</mml:mn>
                      </mml:msub>
                    </mml:mrow>
                    <mml:mo stretchy="false">)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mi class="ltx_font_mathcaligraphic">𝒩</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">(</mml:mo>
                    <mml:msub>
                      <mml:mi>I</mml:mi>
                      <mml:mi>t</mml:mi>
                    </mml:msub>
                    <mml:mo>;</mml:mo>
                    <mml:mrow>
                      <mml:msqrt>
                        <mml:msub>
                          <mml:mover accent="true">
                            <mml:mi>α</mml:mi>
                            <mml:mo>¯</mml:mo>
                          </mml:mover>
                          <mml:mi>t</mml:mi>
                        </mml:msub>
                      </mml:msqrt>
                      <mml:mo>⁢</mml:mo>
                      <mml:msub>
                        <mml:mi>I</mml:mi>
                        <mml:mn>0</mml:mn>
                      </mml:msub>
                    </mml:mrow>
                    <mml:mo>,</mml:mo>
                    <mml:mrow>
                      <mml:mrow>
                        <mml:mo stretchy="false">(</mml:mo>
                        <mml:mrow>
                          <mml:mn>1</mml:mn>
                          <mml:mo>−</mml:mo>
                          <mml:msub>
                            <mml:mover accent="true">
                              <mml:mi>α</mml:mi>
                              <mml:mo>¯</mml:mo>
                            </mml:mover>
                            <mml:mi>t</mml:mi>
                          </mml:msub>
                        </mml:mrow>
                        <mml:mo stretchy="false">)</mml:mo>
                      </mml:mrow>
                      <mml:mo>⁢</mml:mo>
                      <mml:mi>X</mml:mi>
                    </mml:mrow>
                    <mml:mo stretchy="false">)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="\mathcal{N}" display="inline"><mml:mi class="ltx_font_mathcaligraphic">𝒩</mml:mi></mml:math></inline-formula> is a Gaussian distribution, <inline-formula><mml:math alttext="{\alpha_{t}}" display="inline"><mml:msub><mml:mi>α</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> denotes the variance schedule, and <inline-formula><mml:math alttext="{\overline{\alpha}_{t}}=\prod_{i=1}^{t}{\alpha_{i}}" display="inline"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>α</mml:mi><mml:mo>¯</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub><mml:mo rspace="0.111em">=</mml:mo><mml:mrow><mml:msubsup><mml:mo>∏</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:msub><mml:mi>α</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math alttext="t\in[1,T]" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mo>∈</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. <inline-formula><mml:math alttext="X" display="inline"><mml:mi>X</mml:mi></mml:math></inline-formula> represents the standard normal distribution.</p>
        <p id="S3.SS2.p2">Technically, the forward process aims to degrade the image data into an isotropic Gaussian distribution by adding noise. On the contrary, the backward process attempts to eliminate the degradation by a denoising network. During the backward process, a series of denoising operations are performed on the noisy image <inline-formula><mml:math alttext="I_{t}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> to obtain back <inline-formula><mml:math alttext="I_{t-1}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>. The corresponding distribution of <inline-formula><mml:math alttext="I_{t-1}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> given <inline-formula><mml:math alttext="I_{t}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> can be formulated by Eq.(<xref rid="S3.E3">3</xref>).</p>
        <p>
          <disp-formula id="S3.E3">
            <mml:math alttext="Q({I_{t-1}}|{I_{t}})=\mathcal{N}({I_{t}};{\mu_{\theta}}({I_{t}},t),\sigma_{t}^%&#10;{2}X)\ " display="block">
              <mml:mrow>
                <mml:mrow>
                  <mml:mi>Q</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">(</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>I</mml:mi>
                        <mml:mrow>
                          <mml:mi>t</mml:mi>
                          <mml:mo>−</mml:mo>
                          <mml:mn>1</mml:mn>
                        </mml:mrow>
                      </mml:msub>
                      <mml:mo fence="false">|</mml:mo>
                      <mml:msub>
                        <mml:mi>I</mml:mi>
                        <mml:mi>t</mml:mi>
                      </mml:msub>
                    </mml:mrow>
                    <mml:mo stretchy="false">)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mi class="ltx_font_mathcaligraphic">𝒩</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">(</mml:mo>
                    <mml:msub>
                      <mml:mi>I</mml:mi>
                      <mml:mi>t</mml:mi>
                    </mml:msub>
                    <mml:mo>;</mml:mo>
                    <mml:mrow>
                      <mml:msub>
                        <mml:mi>μ</mml:mi>
                        <mml:mi>θ</mml:mi>
                      </mml:msub>
                      <mml:mo>⁢</mml:mo>
                      <mml:mrow>
                        <mml:mo stretchy="false">(</mml:mo>
                        <mml:msub>
                          <mml:mi>I</mml:mi>
                          <mml:mi>t</mml:mi>
                        </mml:msub>
                        <mml:mo>,</mml:mo>
                        <mml:mi>t</mml:mi>
                        <mml:mo stretchy="false">)</mml:mo>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>,</mml:mo>
                    <mml:mrow>
                      <mml:msubsup>
                        <mml:mi>σ</mml:mi>
                        <mml:mi>t</mml:mi>
                        <mml:mn>2</mml:mn>
                      </mml:msubsup>
                      <mml:mo>⁢</mml:mo>
                      <mml:mi>X</mml:mi>
                    </mml:mrow>
                    <mml:mo stretchy="false">)</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="{\mu_{\theta}}({I_{t}},t)" display="inline"><mml:mrow><mml:msub><mml:mi>μ</mml:mi><mml:mi>θ</mml:mi></mml:msub><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math alttext="\sigma_{t}^{2}" display="inline"><mml:msubsup><mml:mi>σ</mml:mi><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> are the mean and standard deviation of <inline-formula><mml:math alttext="Q({I_{t-1}}|{I_{t}})" display="inline"><mml:mrow><mml:mi>Q</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>I</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo fence="false">|</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
        <p id="S3.SS2.p3">During the training phase, the noise <inline-formula><mml:math alttext="\varepsilon\sim\mathcal{N}(0,X)" display="inline"><mml:mrow><mml:mi>ε</mml:mi><mml:mo>∼</mml:mo><mml:mrow><mml:mi class="ltx_font_mathcaligraphic">𝒩</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> and the timestep <inline-formula><mml:math alttext="t\sim U(\{1,\cdots T\}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mi>t</mml:mi><mml:mo>∼</mml:mo><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">⋯</mml:mi><mml:mi>T</mml:mi><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula> are sampled from the standard normal distribution and the uniform distribution, respectively. The noisy image <inline-formula><mml:math alttext="I_{t}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and the timestep <inline-formula><mml:math alttext="t" display="inline"><mml:mi>t</mml:mi></mml:math></inline-formula> are fed into the denoising network <inline-formula><mml:math alttext="\varepsilon_{\theta}(\cdot,\cdot)" display="inline"><mml:mrow><mml:msub><mml:mi>ε</mml:mi><mml:mi>θ</mml:mi></mml:msub><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo rspace="0em">,</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, which is a UNet framework. A simple supervised loss can be formulated by Eq.(<xref rid="S3.E4">4</xref>).</p>
        <p>
          <disp-formula id="S3.E4">
            <mml:math alttext="{L_{diff}}={\left\|{\varepsilon-{\varepsilon_{\theta}}(\sqrt{{{\overline{%&#10;\alpha}}_{t}}}{I_{0}}+\sqrt{1-{{\overline{\alpha}}_{t}}}\varepsilon,t)}\right%&#10;\|_{2}}" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>L</mml:mi>
                  <mml:mrow>
                    <mml:mi>d</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>i</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>f</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>f</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:msub>
                  <mml:mrow>
                    <mml:mo>‖</mml:mo>
                    <mml:mrow>
                      <mml:mi>ε</mml:mi>
                      <mml:mo>−</mml:mo>
                      <mml:mrow>
                        <mml:msub>
                          <mml:mi>ε</mml:mi>
                          <mml:mi>θ</mml:mi>
                        </mml:msub>
                        <mml:mo>⁢</mml:mo>
                        <mml:mrow>
                          <mml:mo stretchy="false">(</mml:mo>
                          <mml:mrow>
                            <mml:mrow>
                              <mml:msqrt>
                                <mml:msub>
                                  <mml:mover accent="true">
                                    <mml:mi>α</mml:mi>
                                    <mml:mo>¯</mml:mo>
                                  </mml:mover>
                                  <mml:mi>t</mml:mi>
                                </mml:msub>
                              </mml:msqrt>
                              <mml:mo>⁢</mml:mo>
                              <mml:msub>
                                <mml:mi>I</mml:mi>
                                <mml:mn>0</mml:mn>
                              </mml:msub>
                            </mml:mrow>
                            <mml:mo>+</mml:mo>
                            <mml:mrow>
                              <mml:msqrt>
                                <mml:mrow>
                                  <mml:mn>1</mml:mn>
                                  <mml:mo>−</mml:mo>
                                  <mml:msub>
                                    <mml:mover accent="true">
                                      <mml:mi>α</mml:mi>
                                      <mml:mo>¯</mml:mo>
                                    </mml:mover>
                                    <mml:mi>t</mml:mi>
                                  </mml:msub>
                                </mml:mrow>
                              </mml:msqrt>
                              <mml:mo>⁢</mml:mo>
                              <mml:mi>ε</mml:mi>
                            </mml:mrow>
                          </mml:mrow>
                          <mml:mo>,</mml:mo>
                          <mml:mi>t</mml:mi>
                          <mml:mo stretchy="false">)</mml:mo>
                        </mml:mrow>
                      </mml:mrow>
                    </mml:mrow>
                    <mml:mo>‖</mml:mo>
                  </mml:mrow>
                  <mml:mn>2</mml:mn>
                </mml:msub>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p id="S3.SS2.p4">The diffusion model consists of a five-level U-Net framework, where the decoder backbone is subjected to randomly sampled noise levels to reconstruct the denoised diffusion features. Therefore, we employ the diffusion model as an encoder to extract multi-level diffusion features from noised infrared and visible images. The formulation is expressed by Eq.(<xref rid="S3.E5">5</xref>).</p>
        <p>
          <disp-formula id="S3.E5">
            <mml:math alttext="\{\Phi_{i}^{l},\;\Phi_{v}^{l}\}=Dif\{I_{t}^{i},\;I_{t}^{v}\}\ " display="block">
              <mml:mrow>
                <mml:mrow>
                  <mml:mo stretchy="false">{</mml:mo>
                  <mml:msubsup>
                    <mml:mi mathvariant="normal">Φ</mml:mi>
                    <mml:mi>i</mml:mi>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                  <mml:mo rspace="0.447em">,</mml:mo>
                  <mml:msubsup>
                    <mml:mi mathvariant="normal">Φ</mml:mi>
                    <mml:mi>v</mml:mi>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                  <mml:mo stretchy="false">}</mml:mo>
                </mml:mrow>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mi>D</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mi>i</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mi>f</mml:mi>
                  <mml:mo>⁢</mml:mo>
                  <mml:mrow>
                    <mml:mo stretchy="false">{</mml:mo>
                    <mml:msubsup>
                      <mml:mi>I</mml:mi>
                      <mml:mi>t</mml:mi>
                      <mml:mi>i</mml:mi>
                    </mml:msubsup>
                    <mml:mo rspace="0.447em">,</mml:mo>
                    <mml:msubsup>
                      <mml:mi>I</mml:mi>
                      <mml:mi>t</mml:mi>
                      <mml:mi>v</mml:mi>
                    </mml:msubsup>
                    <mml:mo stretchy="false">}</mml:mo>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="Dif\{\cdot\}" display="inline"><mml:mrow><mml:mi>D</mml:mi><mml:mo>⁢</mml:mo><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>f</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the diffusion model encoder operation.</p>
        <p id="S3.SS2.p5">In particular, the diffusion model encoder is capable of generating more robust feature representations over the CNN encoder. Additionally, to accelerate inference process of the diffusion model, we compress the channel numbers of each layer to 1/4 of the original. A comprehensive discussion regarding the diffusion model encoder and its training strategies will be presented in the ablation study.</p>
      </sec>
      <sec id="S3.SS3">
        <label>3.3</label>
        <title>Cross-attention interactive fusion module</title>
        <p id="S3.SS3.p1">After training the diffusion model, we employ it as an encoder and freeze its parameters while proceeding to train the fusion network. The multi-level diffusion features are then utilized as input for the cross-attention interactive fusion modules, facilitating global interactions. Inspired by CCNet [<xref rid="ref037" ref-type="bibr">37</xref>], we aggregate contextual dependencies together for all pixels in its criss-cross path. More importantly, we exchange the query features of different modalities to capture their interactive cross-attention maps, which effectively strengthens their complementary characteristics to promote better fusion performance.</p>
        <p id="S3.SS3.p2">As shown in Figure <xref ref-type="fig" rid="F2">2</xref>(b), given the diffusion features <inline-formula><mml:math alttext="\Phi_{i}^{l}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="\Phi_{v}^{l}\in{R^{C\times H\times W}}" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>H</mml:mi><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, we first perform two convolution layers with 1<inline-formula><mml:math alttext="\times" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula>1 filters to achieve their query and key features, i.e., <inline-formula><mml:math alttext="\{Q_{i}^{l},K_{i}^{l}\}" display="inline"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>Q</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math alttext="\{Q_{v}^{l},K_{v}^{l}\}\in{R^{C^{\prime}\times H\times W}}\ " display="inline"><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>Q</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>K</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow><mml:mo>∈</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mo>′</mml:mo></mml:msup><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>H</mml:mi><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math alttext="H" display="inline"><mml:mi>H</mml:mi></mml:math></inline-formula> and <inline-formula><mml:math alttext="W" display="inline"><mml:mi>W</mml:mi></mml:math></inline-formula> represent the height and width of feature maps, and the channel <inline-formula><mml:math alttext="C^{\prime}" display="inline"><mml:msup><mml:mi>C</mml:mi><mml:mo>′</mml:mo></mml:msup></mml:math></inline-formula> is less than <inline-formula><mml:math alttext="C" display="inline"><mml:mi>C</mml:mi></mml:math></inline-formula> for dimension reduction. After that, we exchange the feature maps <inline-formula><mml:math alttext="Q_{i}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="Q_{v}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> of different modalities and further generate their respective cross-attention maps <inline-formula><mml:math alttext="A_{i}^{l}" display="inline"><mml:msubsup><mml:mi>A</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="A_{v}^{l}\in{R^{(H+W-1)\times(H\times W)}}" display="inline"><mml:mrow><mml:msubsup><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>∈</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>+</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo rspace="0.055em" stretchy="false">)</mml:mo></mml:mrow><mml:mo rspace="0.222em">×</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> via <italic>Affinity</italic> opertions. Taking the infrared modality as an example, at the position <italic>n</italic> within the spatial dimension of infrared features <inline-formula><mml:math alttext="K_{i}^{l}" display="inline"><mml:msubsup><mml:mi>K</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>, we can achieve a vector <inline-formula><mml:math alttext="K_{i,n}^{l}" display="inline"><mml:msubsup><mml:mi>K</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> from itself and a set <inline-formula><mml:math alttext="Q_{v,n}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> from visible features <inline-formula><mml:math alttext="Q_{v}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>, which are in the same column or row with position n. Then, the <italic>Affinity</italic> opertions can be formulated by Eq.(<xref rid="S3.E6">6</xref>) and Eq.(<xref rid="S3.E7">7</xref>), respectively.</p>
        <p>
          <disp-formula id="S3.E6">
            <mml:math alttext="d_{i,m,n}^{l}=K_{i,n}^{l}Q_{v,m,n}^{l}\ " display="block">
              <mml:mrow>
                <mml:msubsup>
                  <mml:mi>d</mml:mi>
                  <mml:mrow>
                    <mml:mi>i</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>m</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>n</mml:mi>
                  </mml:mrow>
                  <mml:mi>l</mml:mi>
                </mml:msubsup>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:msubsup>
                    <mml:mi>K</mml:mi>
                    <mml:mrow>
                      <mml:mi>i</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                  <mml:mo>⁢</mml:mo>
                  <mml:msubsup>
                    <mml:mi>Q</mml:mi>
                    <mml:mrow>
                      <mml:mi>v</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>m</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>
          <disp-formula id="S3.E7">
            <mml:math alttext="d_{v,m,n}^{l}=K_{v,n}^{l}Q_{i,m,n}^{l}\ " display="block">
              <mml:mrow>
                <mml:msubsup>
                  <mml:mi>d</mml:mi>
                  <mml:mrow>
                    <mml:mi>v</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>m</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>n</mml:mi>
                  </mml:mrow>
                  <mml:mi>l</mml:mi>
                </mml:msubsup>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:msubsup>
                    <mml:mi>K</mml:mi>
                    <mml:mrow>
                      <mml:mi>v</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                  <mml:mo>⁢</mml:mo>
                  <mml:msubsup>
                    <mml:mi>Q</mml:mi>
                    <mml:mrow>
                      <mml:mi>i</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>m</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="\{d_{i,m,n}^{l},d_{v,m,n}^{l}{\rm{\}}}\in\{D_{i}^{l},D_{v}^{l}\}\ " display="inline"><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>d</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow><mml:mo>∈</mml:mo><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>D</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denote the degree of correlation between infrared and visible features and their reverse order, <inline-formula><mml:math alttext="\{Q_{i,m,n}^{l}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math alttext="Q_{v,m,n}^{l}\}\in{R^{C^{\prime}}}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo><mml:mo>∈</mml:mo><mml:mi>R</mml:mi><mml:msup><mml:mi/><mml:msup><mml:mi>C</mml:mi><mml:mo>′</mml:mo></mml:msup></mml:msup></mml:mrow></mml:math></inline-formula> stand for the <inline-formula><mml:math alttext="m" display="inline"><mml:mi>m</mml:mi></mml:math></inline-formula>th element of <inline-formula><mml:math alttext="Q_{i,n}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="Q_{v,n}^{l}" display="inline"><mml:msubsup><mml:mi>Q</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>, <inline-formula><mml:math alttext="m=[1,\cdots,H+W-1]" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">⋯</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>+</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math alttext="\{D_{i}^{l}\ " class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math alttext="D_{v}^{l}\}\in{R^{(H+W-1)\times(H\times W)}}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:msubsup><mml:mi>D</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo><mml:mo>∈</mml:mo><mml:mi>R</mml:mi><mml:msup><mml:mi/><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>+</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo rspace="0.055em" stretchy="false">)</mml:mo></mml:mrow><mml:mo rspace="0.222em">×</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. Then, we employ a softmax layer on <inline-formula><mml:math alttext="D_{i}^{l}" display="inline"><mml:msubsup><mml:mi>D</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="D_{v}^{l}\ " display="inline"><mml:msubsup><mml:mi>D</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> across the channel dimension to calcuate the cross-attention maps <inline-formula><mml:math alttext="A_{i}^{l}" display="inline"><mml:msubsup><mml:mi>A</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="A_{v}^{l}" display="inline"><mml:msubsup><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>, respectively.</p>
        <p id="S3.SS3.p3">Subsequently, another convolution layer with <inline-formula><mml:math alttext="1\times 1" display="inline"><mml:mrow><mml:mn>1</mml:mn><mml:mo lspace="0.222em" rspace="0.222em">×</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula> filters is used for the diffusion features <inline-formula><mml:math alttext="\{\Phi_{i}^{l}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math alttext="\Phi_{v}^{l}\}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> to generate <inline-formula><mml:math alttext="\{V_{i}^{l},V_{v}^{l}\}" display="inline"><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mi>i</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mi>v</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></inline-formula> for feature adaptation. Similarly, we can also obtain the vetors <inline-formula><mml:math alttext="\{V_{i,n}^{l},V_{v,n}^{l}\}\in{R^{C}}" display="inline"><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow><mml:mo>∈</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mi>C</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> and sets <inline-formula><mml:math alttext="\{V_{i,m,n}^{l},V_{v,m,n}^{l}\}\in{R^{(H+W-1)\times C}}" display="inline"><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>V</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>n</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">}</mml:mo></mml:mrow><mml:mo>∈</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>+</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo>−</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo rspace="0.055em" stretchy="false">)</mml:mo></mml:mrow><mml:mo rspace="0.222em">×</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> at their spatial position <italic>n</italic>. Thus, we apply an multiplication operation and a skip connection to collect the contextual information of different modalities, which are expressed by Eq.(<xref rid="S3.E8">8</xref>) and Eq.(<xref rid="S3.E9">9</xref>), respectively.</p>
        <p>
          <disp-formula id="S3.E8">
            <mml:math alttext="\Phi_{i}^{l,c}=\sum\limits_{m=0}^{H+W-1}{A_{i,m,n}^{l}V_{i,m,n}^{l}}+\Phi_{i,n%&#10;}^{l}\ " display="block">
              <mml:mrow>
                <mml:msubsup>
                  <mml:mi mathvariant="normal">Φ</mml:mi>
                  <mml:mi>i</mml:mi>
                  <mml:mrow>
                    <mml:mi>l</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>c</mml:mi>
                  </mml:mrow>
                </mml:msubsup>
                <mml:mo rspace="0.111em">=</mml:mo>
                <mml:mrow>
                  <mml:mrow>
                    <mml:munderover>
                      <mml:mo movablelimits="false">∑</mml:mo>
                      <mml:mrow>
                        <mml:mi>m</mml:mi>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0</mml:mn>
                      </mml:mrow>
                      <mml:mrow>
                        <mml:mrow>
                          <mml:mi>H</mml:mi>
                          <mml:mo>+</mml:mo>
                          <mml:mi>W</mml:mi>
                        </mml:mrow>
                        <mml:mo>−</mml:mo>
                        <mml:mn>1</mml:mn>
                      </mml:mrow>
                    </mml:munderover>
                    <mml:mrow>
                      <mml:msubsup>
                        <mml:mi>A</mml:mi>
                        <mml:mrow>
                          <mml:mi>i</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>n</mml:mi>
                        </mml:mrow>
                        <mml:mi>l</mml:mi>
                      </mml:msubsup>
                      <mml:mo>⁢</mml:mo>
                      <mml:msubsup>
                        <mml:mi>V</mml:mi>
                        <mml:mrow>
                          <mml:mi>i</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>n</mml:mi>
                        </mml:mrow>
                        <mml:mi>l</mml:mi>
                      </mml:msubsup>
                    </mml:mrow>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:msubsup>
                    <mml:mi mathvariant="normal">Φ</mml:mi>
                    <mml:mrow>
                      <mml:mi>i</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>
          <disp-formula id="S3.E9">
            <mml:math alttext="\Phi_{v}^{l,c}=\sum\limits_{m=0}^{H+W-1}{A_{v,m,n}^{l}V_{v,m,n}^{l}}+\Phi_{v,n%&#10;}^{l}\ " display="block">
              <mml:mrow>
                <mml:msubsup>
                  <mml:mi mathvariant="normal">Φ</mml:mi>
                  <mml:mi>v</mml:mi>
                  <mml:mrow>
                    <mml:mi>l</mml:mi>
                    <mml:mo>,</mml:mo>
                    <mml:mi>c</mml:mi>
                  </mml:mrow>
                </mml:msubsup>
                <mml:mo rspace="0.111em">=</mml:mo>
                <mml:mrow>
                  <mml:mrow>
                    <mml:munderover>
                      <mml:mo movablelimits="false">∑</mml:mo>
                      <mml:mrow>
                        <mml:mi>m</mml:mi>
                        <mml:mo>=</mml:mo>
                        <mml:mn>0</mml:mn>
                      </mml:mrow>
                      <mml:mrow>
                        <mml:mrow>
                          <mml:mi>H</mml:mi>
                          <mml:mo>+</mml:mo>
                          <mml:mi>W</mml:mi>
                        </mml:mrow>
                        <mml:mo>−</mml:mo>
                        <mml:mn>1</mml:mn>
                      </mml:mrow>
                    </mml:munderover>
                    <mml:mrow>
                      <mml:msubsup>
                        <mml:mi>A</mml:mi>
                        <mml:mrow>
                          <mml:mi>v</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>n</mml:mi>
                        </mml:mrow>
                        <mml:mi>l</mml:mi>
                      </mml:msubsup>
                      <mml:mo>⁢</mml:mo>
                      <mml:msubsup>
                        <mml:mi>V</mml:mi>
                        <mml:mrow>
                          <mml:mi>v</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>,</mml:mo>
                          <mml:mi>n</mml:mi>
                        </mml:mrow>
                        <mml:mi>l</mml:mi>
                      </mml:msubsup>
                    </mml:mrow>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:msubsup>
                    <mml:mi mathvariant="normal">Φ</mml:mi>
                    <mml:mrow>
                      <mml:mi>v</mml:mi>
                      <mml:mo>,</mml:mo>
                      <mml:mi>n</mml:mi>
                    </mml:mrow>
                    <mml:mi>l</mml:mi>
                  </mml:msubsup>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="\Phi_{i}^{l,c}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math alttext="\Phi_{v}^{l,c}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> denote the global cross-attention features of infrared and visible modalities. Finally, we concatenate them to generate the fusion features <inline-formula><mml:math alttext="\Phi_{f}^{l}" display="inline"><mml:msubsup><mml:mi mathvariant="normal">Φ</mml:mi><mml:mi>f</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula>.</p>
      </sec>
      <sec id="S3.SS4">
        <label>3.4</label>
        <title>Loss function</title>
        <p id="S3.SS4.p1">To train the fusion model, we employ structural similarity (SSIM) loss, intensity loss, and gradient loss to supervise the network. Concretely, SSIM loss (<inline-formula><mml:math alttext="{L_{ssim}}" display="inline"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>⁢</mml:mo><mml:mi>s</mml:mi><mml:mo>⁢</mml:mo><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) is used to constrain the structural similarity between fused result <inline-formula><mml:math alttext="I_{f}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>f</mml:mi></mml:msub></mml:math></inline-formula> and source images <inline-formula><mml:math alttext="I_{i}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula><mml:math alttext="{I_{v}}" display="inline"><mml:msub><mml:mi>I</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:math></inline-formula>, which is defined by Eq.(<xref rid="S3.E10">10</xref>).</p>
        <p>
          <disp-formula id="S3.E10">
            <mml:math alttext="{L_{ssim}}=\omega_{1}(1-ssim({I_{f}},{I_{i}}))+\omega_{2}(1-ssim({I_{f}},{I_{v%&#10;}}))" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>L</mml:mi>
                  <mml:mrow>
                    <mml:mi>s</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>s</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>i</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>m</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mrow>
                    <mml:msub>
                      <mml:mi>ω</mml:mi>
                      <mml:mn>1</mml:mn>
                    </mml:msub>
                    <mml:mo>⁢</mml:mo>
                    <mml:mrow>
                      <mml:mo stretchy="false">(</mml:mo>
                      <mml:mrow>
                        <mml:mn>1</mml:mn>
                        <mml:mo>−</mml:mo>
                        <mml:mrow>
                          <mml:mi>s</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>s</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>i</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mrow>
                            <mml:mo stretchy="false">(</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>f</mml:mi>
                            </mml:msub>
                            <mml:mo>,</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="false">)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                      </mml:mrow>
                      <mml:mo stretchy="false">)</mml:mo>
                    </mml:mrow>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:mrow>
                    <mml:msub>
                      <mml:mi>ω</mml:mi>
                      <mml:mn>2</mml:mn>
                    </mml:msub>
                    <mml:mo>⁢</mml:mo>
                    <mml:mrow>
                      <mml:mo stretchy="false">(</mml:mo>
                      <mml:mrow>
                        <mml:mn>1</mml:mn>
                        <mml:mo>−</mml:mo>
                        <mml:mrow>
                          <mml:mi>s</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>s</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>i</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>m</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mrow>
                            <mml:mo stretchy="false">(</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>f</mml:mi>
                            </mml:msub>
                            <mml:mo>,</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>v</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="false">)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                      </mml:mrow>
                      <mml:mo stretchy="false">)</mml:mo>
                    </mml:mrow>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="ssim(\cdot)" display="inline"><mml:mrow><mml:mi>s</mml:mi><mml:mo>⁢</mml:mo><mml:mi>s</mml:mi><mml:mo>⁢</mml:mo><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>m</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the structural similarity operation. <inline-formula><mml:math alttext="\omega_{1}" display="inline"><mml:msub><mml:mi>ω</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math alttext="\omega_{2}" display="inline"><mml:msub><mml:mi>ω</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> are set to 0.5.</p>
        <p id="S3.SS4.p2">Meanwhile, the intensity loss <inline-formula><mml:math alttext="{L_{int}}" display="inline"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>⁢</mml:mo><mml:mi>n</mml:mi><mml:mo>⁢</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is designed to maintain more valuable pixel intensity information from source images, and its formalization is expressed by Eq.(<xref rid="S3.E11">11</xref>).</p>
        <p>
          <disp-formula id="S3.E11">
            <mml:math alttext="{L_{int}}=\frac{1}{{HW}}{\left\|{{I_{f}}-mean({I_{i}},{I_{v}})}\right\|_{1}}" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>L</mml:mi>
                  <mml:mrow>
                    <mml:mi>i</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>n</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>t</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mfrac>
                    <mml:mn>1</mml:mn>
                    <mml:mrow>
                      <mml:mi>H</mml:mi>
                      <mml:mo>⁢</mml:mo>
                      <mml:mi>W</mml:mi>
                    </mml:mrow>
                  </mml:mfrac>
                  <mml:mo>⁢</mml:mo>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mo>‖</mml:mo>
                      <mml:mrow>
                        <mml:msub>
                          <mml:mi>I</mml:mi>
                          <mml:mi>f</mml:mi>
                        </mml:msub>
                        <mml:mo>−</mml:mo>
                        <mml:mrow>
                          <mml:mi>m</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>e</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>a</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mi>n</mml:mi>
                          <mml:mo>⁢</mml:mo>
                          <mml:mrow>
                            <mml:mo stretchy="false">(</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>i</mml:mi>
                            </mml:msub>
                            <mml:mo>,</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>v</mml:mi>
                            </mml:msub>
                            <mml:mo stretchy="false">)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                      </mml:mrow>
                      <mml:mo>‖</mml:mo>
                    </mml:mrow>
                    <mml:mn>1</mml:mn>
                  </mml:msub>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="mean(\cdot)" display="inline"><mml:mrow><mml:mi>m</mml:mi><mml:mo>⁢</mml:mo><mml:mi>e</mml:mi><mml:mo>⁢</mml:mo><mml:mi>a</mml:mi><mml:mo>⁢</mml:mo><mml:mi>n</mml:mi><mml:mo>⁢</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the average operation.</p>
        <p id="S3.SS4.p3">Moreover, the gradient loss <inline-formula><mml:math alttext="{L_{grad}}" display="inline"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mo>⁢</mml:mo><mml:mi>r</mml:mi><mml:mo>⁢</mml:mo><mml:mi>a</mml:mi><mml:mo>⁢</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is proposed to transfer as many details as possible from different modalities, which is formulated by Eq.(<xref rid="S3.E12">12</xref>).</p>
        <p>
          <disp-formula id="S3.E12">
            <mml:math alttext="{L_{grad}}=\frac{1}{{HW}}{\left\|{\left|{\nabla{I_{f}}}\right|-\max(\left|{%&#10;\nabla{I_{i}}}\right|,\left|{\nabla{I_{v}}}\right|)}\right\|_{1}}" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>L</mml:mi>
                  <mml:mrow>
                    <mml:mi>g</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>r</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>a</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>d</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mfrac>
                    <mml:mn>1</mml:mn>
                    <mml:mrow>
                      <mml:mi>H</mml:mi>
                      <mml:mo>⁢</mml:mo>
                      <mml:mi>W</mml:mi>
                    </mml:mrow>
                  </mml:mfrac>
                  <mml:mo>⁢</mml:mo>
                  <mml:msub>
                    <mml:mrow>
                      <mml:mo>‖</mml:mo>
                      <mml:mrow>
                        <mml:mrow>
                          <mml:mo>|</mml:mo>
                          <mml:mrow>
                            <mml:mo rspace="0.167em">∇</mml:mo>
                            <mml:msub>
                              <mml:mi>I</mml:mi>
                              <mml:mi>f</mml:mi>
                            </mml:msub>
                          </mml:mrow>
                          <mml:mo>|</mml:mo>
                        </mml:mrow>
                        <mml:mo>−</mml:mo>
                        <mml:mrow>
                          <mml:mi>max</mml:mi>
                          <mml:mo>⁡</mml:mo>
                          <mml:mrow>
                            <mml:mo stretchy="false">(</mml:mo>
                            <mml:mrow>
                              <mml:mo>|</mml:mo>
                              <mml:mrow>
                                <mml:mo rspace="0.167em">∇</mml:mo>
                                <mml:msub>
                                  <mml:mi>I</mml:mi>
                                  <mml:mi>i</mml:mi>
                                </mml:msub>
                              </mml:mrow>
                              <mml:mo>|</mml:mo>
                            </mml:mrow>
                            <mml:mo>,</mml:mo>
                            <mml:mrow>
                              <mml:mo>|</mml:mo>
                              <mml:mrow>
                                <mml:mo rspace="0.167em">∇</mml:mo>
                                <mml:msub>
                                  <mml:mi>I</mml:mi>
                                  <mml:mi>v</mml:mi>
                                </mml:msub>
                              </mml:mrow>
                              <mml:mo>|</mml:mo>
                            </mml:mrow>
                            <mml:mo stretchy="false">)</mml:mo>
                          </mml:mrow>
                        </mml:mrow>
                      </mml:mrow>
                      <mml:mo>‖</mml:mo>
                    </mml:mrow>
                    <mml:mn>1</mml:mn>
                  </mml:msub>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="\nabla" display="inline"><mml:mo>∇</mml:mo></mml:math></inline-formula> is the Sobel gradient operator. <inline-formula><mml:math alttext="\max(\cdot)" display="inline"><mml:mrow><mml:mi>max</mml:mi><mml:mo>⁡</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math alttext="{\left\|\cdot\right\|_{1}}" class="ltx_math_unparsed" display="inline"><mml:mrow><mml:mo rspace="0em" stretchy="true">∥</mml:mo><mml:mo lspace="0em" rspace="0em">⋅</mml:mo><mml:msub><mml:mo lspace="0em" stretchy="true">∥</mml:mo><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> stand for the maximum and L1-norm operations, respectively.</p>
        <p id="S3.SS4.p4">Finally, the total fusion loss can be expressed by Eq.(<xref rid="S3.E13">13</xref>).</p>
        <p>
          <disp-formula id="S3.E13">
            <mml:math alttext="{L_{fusion}}={\lambda_{1}}{L_{ssim}}+{\lambda_{2}}{L_{int}}+{\lambda_{3}}{L_{%&#10;grad}}" display="block">
              <mml:mrow>
                <mml:msub>
                  <mml:mi>L</mml:mi>
                  <mml:mrow>
                    <mml:mi>f</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>u</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>s</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>i</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>o</mml:mi>
                    <mml:mo>⁢</mml:mo>
                    <mml:mi>n</mml:mi>
                  </mml:mrow>
                </mml:msub>
                <mml:mo>=</mml:mo>
                <mml:mrow>
                  <mml:mrow>
                    <mml:msub>
                      <mml:mi>λ</mml:mi>
                      <mml:mn>1</mml:mn>
                    </mml:msub>
                    <mml:mo>⁢</mml:mo>
                    <mml:msub>
                      <mml:mi>L</mml:mi>
                      <mml:mrow>
                        <mml:mi>s</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>s</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>i</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>m</mml:mi>
                      </mml:mrow>
                    </mml:msub>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:mrow>
                    <mml:msub>
                      <mml:mi>λ</mml:mi>
                      <mml:mn>2</mml:mn>
                    </mml:msub>
                    <mml:mo>⁢</mml:mo>
                    <mml:msub>
                      <mml:mi>L</mml:mi>
                      <mml:mrow>
                        <mml:mi>i</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>n</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>t</mml:mi>
                      </mml:mrow>
                    </mml:msub>
                  </mml:mrow>
                  <mml:mo>+</mml:mo>
                  <mml:mrow>
                    <mml:msub>
                      <mml:mi>λ</mml:mi>
                      <mml:mn>3</mml:mn>
                    </mml:msub>
                    <mml:mo>⁢</mml:mo>
                    <mml:msub>
                      <mml:mi>L</mml:mi>
                      <mml:mrow>
                        <mml:mi>g</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>r</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>a</mml:mi>
                        <mml:mo>⁢</mml:mo>
                        <mml:mi>d</mml:mi>
                      </mml:mrow>
                    </mml:msub>
                  </mml:mrow>
                </mml:mrow>
              </mml:mrow>
            </mml:math>
          </disp-formula>
        </p>
        <p>where <inline-formula><mml:math alttext="\lambda_{1}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula><mml:math alttext="\lambda_{2}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math alttext="\lambda_{3}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> are the hyperparameters, which are used to balance the three losses.</p>
        <p>
          <fig id="F3">
            <label>Figure 3.</label>
            <caption>
              <p>Visual descriptions of DMFuse with other SOTA competitors on the TNO benchmark.</p>
            </caption>
            <graphic xlink:href="Fig3.jpg"/>
          </fig>
        </p>
      </sec>
    </sec>
    <sec id="S4">
      <label>4.</label>
      <title>Experimental Results and Analysis</title>
      <p id="S4.p1">This section introduces the correlative experimental configurations and comparative validations of fusion tasks and downstream applications. The ablation studies are also deeply discussed.</p>
      <sec id="S4.SS1">
        <label>4.1</label>
        <title>Experimental Configurations</title>
        <p id="S4.SS1.p1">In the training phase, we first train the diffusion model on the MS-COCO benchmark. This dataset includes more than 80000 complex scenario images. The training parameter settings are consistent with DDPM [<xref rid="ref018" ref-type="bibr">18</xref>]. After that, we then train the fusion model on the TNO benchmark. To augment the training dataset, we take a sliding step of 12, crop the images into patches of size 256 <inline-formula><mml:math alttext="\times" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula> 256 and normalize their gray value range to [-1, 1]. This process yields a total of 18813 patch pairs for training. The batch size and number of epochs are set to 4 and 16, respectively. The model is optimized using the Adam optimizer. In the loss function, we empirically set <inline-formula><mml:math alttext="\lambda_{1}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula>, <inline-formula><mml:math alttext="\lambda_{2}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula>, and <inline-formula><mml:math alttext="\lambda_{3}" display="inline"><mml:msub><mml:mi>λ</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:math></inline-formula> to 1, 4, and 20. Additionally, the pre-trained diffusion model generates diffusion features at three different time steps, i.e., 5, 50, and 100. All experiments are conducted on a platform equipped with an NVIDIA GeForce GTX 3090, Intel I9-10850K, and 64 GB memory.</p>
        <p id="S4.SS1.p2">In the testing phase, we employ the TNO <xref ref-type="fn" rid="fn1">1</xref><fn id="fn1"><label><sup>1</sup></label><p id="footnotex1">[Online]. Available: https://figshare.com/articles/TN_Image_Fusion_Dataset/1008029</p></fn>, M<sup>3</sup>FD <xref ref-type="fn" rid="fn2">2</xref><fn id="fn2"><label><sup>2</sup></label><p id="footnotex2">[Online]. Available: https://github.com/dlut-dimt/TarDAL</p></fn> and Harvard MIF <xref ref-type="fn" rid="fn3">3</xref><fn id="fn3"><label><sup>3</sup></label><p id="footnotex3">[Online]. Available: http://www.med.harvard.edu/AANLIB/ home.html.</p></fn> benchmarks, and select 25, 40 and 50 image pairs to evaluate the effectiveness and superiority of the proposed model. In addition, seven SOTA competitors, including the non-generative schemes, U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>], RFN-Nest [<xref rid="ref013" ref-type="bibr">13</xref>], YDTR [<xref rid="ref015" ref-type="bibr">15</xref>], and DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>], the generative schemes, FusionGAN [<xref rid="ref016" ref-type="bibr">16</xref>], Dif-Fusion [<xref rid="ref020" ref-type="bibr">20</xref>], and DDFM [<xref rid="ref019" ref-type="bibr">19</xref>], are selected to compare with the proposed model. Moreover, we also employ eight metrics, namely entropy (EN) [<xref rid="ref038" ref-type="bibr">38</xref>], standard deviation (SD) [<xref rid="ref039" ref-type="bibr">39</xref>], phase congruency (PC) [<xref rid="ref040" ref-type="bibr">40</xref>], feature mutual information based on pixel (FMIp) [<xref rid="ref041" ref-type="bibr">41</xref>], Qe [<xref rid="ref042" ref-type="bibr">42</xref>], Qabf [<xref rid="ref043" ref-type="bibr">43</xref>], multi-scale structural similarity (MS-SSIM) [<xref rid="ref044" ref-type="bibr">44</xref>], and visual information fidelity (VIF) [<xref rid="ref045" ref-type="bibr">45</xref>] for quantitative verification. In the follow-up experiments, the red bold and blue underline indicate the optimal and suboptimal values, respectively.</p>
        <p>
          <fig id="F4">
            <label>Figure 4.</label>
            <caption>
              <p>Quantitative comparisons of DMFuse with other SOTA competitors on the TNO benchmark.</p>
            </caption>
            <graphic xlink:href="Fig4.jpg"/>
          </fig>
        </p>
        <p>
          <fig id="F5">
            <label>Figure 5.</label>
            <caption>
              <p>Visual descriptions of DMFuse with other SOTA competitors on the M<sup>3</sup>FD benchmark.</p>
            </caption>
            <graphic xlink:href="Fig5.jpg"/>
          </fig>
        </p>
      </sec>
      <sec id="S4.SS2">
        <label>4.2</label>
        <title>Results on TNO Benchmark</title>
        <p id="S4.SS2.p1">We first conduct experiments on the TNO benchmark to showcase the effectiveness of the proposed DMFuse. Three representative examples, namely <italic>Nato_camp</italic>, <italic>Street</italic>, and <italic>Kaptein_1123</italic>, are selected for subjective description, and their contrastive results are shown in Figure <xref ref-type="fig" rid="F3">3</xref>. The CNN-based methods,<italic> i.e.,</italic> U2Fusion and RFN-Nest, focus on modeling local features using image-level and feature-level frameworks, respectively. Although they manage to preserve visible details, they tend to lose brightness in the infrared targets. The Transformer-based methods, i.e., YDTR and DATFuse, attempt to integrate local and global features to achieve better visual effects. However, they still struggle to effectively control the brightness information. FusionGAN aims to retain target brightness but sacrifices visible detail information potentially due to unstable training. DDFM integrates inference solution and diffusion sampling within the same iterative framework to generate fusion images directly, but it fails to effectively combine thermal radiation information. Dif-Fusion constructs a multi-channel data distribution and yields similar results to the proposed model. In comparison, the proposed model effectively preserves rich details and control considerable intensity.</p>
        <p id="S4.SS2.p2">Subsequently, eight metrics previously mentioned are used for the quantitative evaluation of fusion performance, and the comparable results are presented in Figure <xref ref-type="fig" rid="F4">4</xref>. The proposed model is described by the red dotted line. Obviously, the proposed model demonstrates excellent performance across all metrics. The corresponding EN, FMIp, Qe, Qabf, MS-SSIM, VIF rank first, and SD, PC rank second, which follow behind Dif-Fusion and DATFuse, respectively. The optimal Qe, Qabf, and MS-SSIM indicate that the proposed model can transfer edge, gradient, and structural information into the fused results from source images. The optimal EN, FMIp, and suboptimal PC demonstrate that the proposed model can preserve significant details and meaningful information. The optimal VIF and suboptimal SD reveal that the proposed model has better visual performance and contrast definition. Quantitative experiments confirm its superiority, aligning with the above qualitative observations.</p>
      </sec>
      <sec id="S4.SS3">
        <label>4.3</label>
        <title>Results on M<sup>3</sup>FD Benchmark</title>
        <p id="S4.SS3.p1">We further carry out experiments on the M<sup>3</sup>FD benchmark, and compare the proposed model with other competitors to verify its generalization ability. For the color image fusion, we first transfer the RGB visible image to the YCbCr color space, and return it after the Y channel is integrated with the infrared image. Figure <xref ref-type="fig" rid="F5">5</xref> gives the subjective comparison results of three examples, namely 03878, 03989, and 00762. The proposed method offers significant advantages in terms of detail preservation and intensity control. For the salient pedestrian targets, the proposed model preserves high-brightness target characteristics and distinct contour edges. Meanwhile, for the background details, such as trees, windows, and handrails, it also gets the clearest detail description. In addition, Figure <xref ref-type="fig" rid="F6">6</xref> describes the objective comparison results. The proposed model achieves the top ranking for all the metrics except for EN and SD, which are in arrears of Dif-Fusion. Both subjective and objective experiments demonstrate that the proposed model yields promising fusion performance and transcends other SOTA competitors.</p>
        <p>
          <fig id="F6">
            <label>Figure 6.</label>
            <caption>
              <p>Quantitative comparisons of DMFuse with other SOTA competitors on the M<sup>3</sup>FD benchmark.</p>
            </caption>
            <graphic xlink:href="Fig6.jpg"/>
          </fig>
        </p>
        <p>
          <fig id="F7">
            <label>Figure 7.</label>
            <caption>
              <p>Visual descriptions of DMFuse with other SOTA competitors on the Harvard MIF benchmark.</p>
            </caption>
            <graphic xlink:href="Fig7.jpg"/>
          </fig>
        </p>
        <p>
          <table-wrap id="T1">
            <label>Table 1</label>
            <caption>
              <p>Quantitative comparisons of DMFuse with other SOTA competitors on the Harvard MIF benchmark.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">Models</th>
                  <th style="border-top: 1px solid black;" align="center">EN</th>
                  <th style="border-top: 1px solid black;" align="center">SD</th>
                  <th style="border-top: 1px solid black;" align="center">PC</th>
                  <th style="border-top: 1px solid black;" align="center">FMIp</th>
                  <th style="border-top: 1px solid black;" align="center">Qe</th>
                  <th style="border-top: 1px solid black;" align="center">Qabf</th>
                  <th style="border-top: 1px solid black;" align="center">MS-SSIM</th>
                  <th style="border-top: 1px solid black;" align="center">VIF</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>]</th>
                  <td style="border-top: 1px solid black;" align="center">3.7566</td>
                  <td style="border-top: 1px solid black;" align="center">33.8763</td>
                  <td style="border-top: 1px solid black;" align="center">0.3735</td>
                  <td style="border-top: 1px solid black;" align="center">0.8579</td>
                  <td style="border-top: 1px solid black;" align="center">0.3093</td>
                  <td style="border-top: 1px solid black;" align="center">0.3776</td>
                  <td style="border-top: 1px solid black;" align="center">0.8552</td>
                  <td style="border-top: 1px solid black;" align="center">0.2489</td>
                </tr>
                <tr>
                  <th align="left">RFN-Nest [<xref rid="ref013" ref-type="bibr">13</xref>]</th>
                  <td align="center">4.1351</td>
                  <td align="center">56.6246</td>
                  <td align="center">0.2396</td>
                  <td align="center">0.8616</td>
                  <td align="center">0.2229</td>
                  <td align="center">0.1983</td>
                  <td align="center">0.8928</td>
                  <td align="center">0.2256</td>
                </tr>
                <tr>
                  <th align="left">YDTR [<xref rid="ref015" ref-type="bibr">15</xref>]</th>
                  <td align="center">4.1527</td>
                  <td align="center">37.6520</td>
                  <td align="center">0.4553</td>
                  <td align="center">0.8648</td>
                  <td align="center">0.3990</td>
                  <td align="center">0.4267</td>
                  <td align="center">0.8811</td>
                  <td align="center">0.2597</td>
                </tr>
                <tr>
                  <th align="left">DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>]</th>
                  <td align="center">4.2113</td>
                  <td align="center">54.9562</td>
                  <td align="center">0.4360</td>
                  <td align="center">0.8531</td>
                  <td align="center">
                    <underline>0.5040</underline>
                  </td>
                  <td align="center">0.6113</td>
                  <td align="center">0.9262</td>
                  <td align="center">0.2605</td>
                </tr>
                <tr>
                  <th align="left">FusionGAN [<xref rid="ref016" ref-type="bibr">16</xref>]</th>
                  <td align="center">4.2226</td>
                  <td align="center">44.7076</td>
                  <td align="center">0.1375</td>
                  <td align="center">0.8496</td>
                  <td align="center">0.2095</td>
                  <td align="center">0.1662</td>
                  <td align="center">0.8079</td>
                  <td align="center">0.1708</td>
                </tr>
                <tr>
                  <th align="left">Dif-Fusion [<xref rid="ref020" ref-type="bibr">20</xref>]</th>
                  <td align="center">
                    <underline>4.7231</underline>
                  </td>
                  <td align="center">
                    <underline>60.7802</underline>
                  </td>
                  <td align="center">0.4513</td>
                  <td align="center">0.8660</td>
                  <td align="center">0.4644</td>
                  <td align="center">0.6354</td>
                  <td align="center">
                    <bold>0.9559</bold>
                  </td>
                  <td align="center">0.2994</td>
                </tr>
                <tr>
                  <th align="left">DDFM [<xref rid="ref019" ref-type="bibr">19</xref>]</th>
                  <td align="center">3.8027</td>
                  <td align="center">56.4941</td>
                  <td align="center">
                    <underline>0.4622</underline>
                  </td>
                  <td align="center">
                    <bold>0.8796</bold>
                  </td>
                  <td align="center">0.4725</td>
                  <td align="center">
                    <underline>0.6363</underline>
                  </td>
                  <td align="center">0.9507</td>
                  <td align="center">
                    <underline>0.3288</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-bottom: 1px solid black;" align="left">Ours</th>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>5.6969</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>61.8903</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.5438</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.8754</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.5546</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.7154</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.9545</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.3319</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
      </sec>
      <sec id="S4.SS4">
        <label>4.4</label>
        <title>Results on Harvard MIF Benchmark</title>
        <p id="S4.SS4.p1">In this section, we conduct experiments on the Harvard MIF benchmark to further verify the generalization of the proposed model. Figure <xref ref-type="fig" rid="F7">7</xref> gives the subjective comparison results of three examples, namely MRI_CT_21, MRI_PET_32, and MRI_SPECT_48. Compared with other methods, the proposed model remains effectively the soft tissue texture information presented in MRI images and highlights the areas of high-density contrast enhancement in T images. Table. 1 presents the quantitative results of different fusion methods. Obviously, DMFuse obtains the optimal performance in terms of EN, SD PC, Qe, Qabf and VIF. The metrics FMIp and MS-SSIM rank second, which follow behind DDFM and Dif-Fusion, respectively. Both subjective and objective experiments demonstrate that the proposed model yields excellent performance in the medical image fusion tasks.</p>
        <p>
          <fig id="F8">
            <label>Figure 8.</label>
            <caption>
              <p>Qualitative object detection comparisons of source images and the fused results obtained by different methods. </p>
            </caption>
            <graphic xlink:href="Fig8.jpg"/>
          </fig>
        </p>
        <p id="S4.SS4.p2">In summary, the above experiments on the TNO , M<sup>3</sup>FD and Harvard MIF benchmarks confirm the superior performance and generalization ability of the proposed model for different lighting and object categories. The main reasons are twofold. On the one hand, we use the MS-COCO dataset to train the diffusion model for more stable performance. More importantly, we employ the diffusion model to guide the fusion network. The diffusion features fully exhibit a strong distribution mapping capacity, and provide extra feature details for fusion tasks. Therefore, the fused results preserve rich details from source images. On the other hand, the designed cross-attention interactive fusion module can effectively implement the global interactions of different modalities. Under the supervision of the loss function, the fusion images achieve better visual effects with high-brightness targets and unambiguous details. As a result, DMFuse makes the fusion image easy to distinguish foreground objects and background edges.</p>
      </sec>
      <sec id="S4.SS5">
        <label>4.5</label>
        <title>Downstream Application</title>
        <p id="S4.SS5.p1">In addition to fusion performance evaluation, we also explore the positive role of image fusion for downstream applications. Specifically, we analyze the effects of other visual tasks, such as object detection and semantic segmentation.</p>
        <p>
          <table-wrap id="T2">
            <label>Table 2</label>
            <caption>
              <p>Quantitative object detection comparisons of different methods on the M<sup>3</sup>FD benchmark.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" rowspan="2" align="left">Methods</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" colspan="7" align="center">mAP@0.5</th>
                  <th style="border-top: 1px solid black;" colspan="7" align="center">mAP@[0.5:0.95]</th>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="center">Person</th>
                  <th style="border-top: 1px solid black;" align="center">Car</th>
                  <th style="border-top: 1px solid black;" align="center">Bus</th>
                  <th style="border-top: 1px solid black;" align="center">Lamp</th>
                  <th style="border-top: 1px solid black;" align="center">Motorcycle</th>
                  <th style="border-top: 1px solid black;" align="center">Truck</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">All</th>
                  <th style="border-top: 1px solid black;" align="center">Person</th>
                  <th style="border-top: 1px solid black;" align="center">Car</th>
                  <th style="border-top: 1px solid black;" align="center">Bus</th>
                  <th style="border-top: 1px solid black;" align="center">Lamp</th>
                  <th style="border-top: 1px solid black;" align="center">Motorcycle</th>
                  <th style="border-top: 1px solid black;" align="center">Truck</th>
                  <th style="border-top: 1px solid black;" align="center">All</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="left">Infrared</th>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.783</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.870</td>
                  <td style="border-top: 1px solid black;" align="center">0.921</td>
                  <td style="border-top: 1px solid black;" align="center">0.665</td>
                  <td style="border-top: 1px solid black;" align="center">0.760</td>
                  <td style="border-top: 1px solid black;" align="center">0.855</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">0.809</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.551</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.671</td>
                  <td style="border-top: 1px solid black;" align="center">0.780</td>
                  <td style="border-top: 1px solid black;" align="center">0.359</td>
                  <td style="border-top: 1px solid black;" align="center">0.506</td>
                  <td style="border-top: 1px solid black;" align="center">0.671</td>
                  <td style="border-top: 1px solid black;" align="center">0.590</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">Visible</th>
                  <td align="center">0.716</td>
                  <td align="center">0.869</td>
                  <td align="center">0.920</td>
                  <td align="center">0.790</td>
                  <td align="center">
                    <bold>0.790</bold>
                  </td>
                  <td align="center">0.864</td>
                  <td style="border-right: 1px solid black;" align="center">0.825</td>
                  <td align="center">0.478</td>
                  <td align="center">0.701</td>
                  <td align="center">0.796</td>
                  <td align="center">0.471</td>
                  <td align="center">
                    <underline>0.543</underline>
                  </td>
                  <td align="center">0.689</td>
                  <td align="center">0.613</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>]</th>
                  <td align="center">0.774</td>
                  <td align="center">0.883</td>
                  <td align="center">0.925</td>
                  <td align="center">0.784</td>
                  <td align="center">0.774</td>
                  <td align="center">
                    <underline>0.867</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">0.835</td>
                  <td align="center">0.549</td>
                  <td align="center">
                    <underline>0.717</underline>
                  </td>
                  <td align="center">0.799</td>
                  <td align="center">
                    <underline>0.474</underline>
                  </td>
                  <td align="center">
                    <bold>0.547</bold>
                  </td>
                  <td align="center">0.701</td>
                  <td align="center">
                    <underline>0.631</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">RFN-Nest [<xref rid="ref013" ref-type="bibr">13</xref>]</th>
                  <td align="center">0.772</td>
                  <td align="center">0.881</td>
                  <td align="center">0.924</td>
                  <td align="center">0.790</td>
                  <td align="center">0.775</td>
                  <td align="center">0.865</td>
                  <td style="border-right: 1px solid black;" align="center">0.835</td>
                  <td align="center">0.544</td>
                  <td align="center">0.716</td>
                  <td align="center">0.798</td>
                  <td align="center">0.467</td>
                  <td align="center">0.541</td>
                  <td align="center">0.700</td>
                  <td align="center">0.628</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">YDTR [<xref rid="ref015" ref-type="bibr">15</xref>]</th>
                  <td align="center">0.768</td>
                  <td align="center">0.885</td>
                  <td align="center">0.925</td>
                  <td align="center">0.781</td>
                  <td align="center">0.766</td>
                  <td align="center">0.859</td>
                  <td style="border-right: 1px solid black;" align="center">0.831</td>
                  <td align="center">0.546</td>
                  <td align="center">0.714</td>
                  <td align="center">
                    <underline>0.800</underline>
                  </td>
                  <td align="center">0.473</td>
                  <td align="center">0.539</td>
                  <td align="center">0.700</td>
                  <td align="center">0.629</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>]</th>
                  <td align="center">0.764</td>
                  <td align="center">0.881</td>
                  <td align="center">0.919</td>
                  <td align="center">0.781</td>
                  <td align="center">0.766</td>
                  <td align="center">0.859</td>
                  <td style="border-right: 1px solid black;" align="center">0.829</td>
                  <td align="center">0.541</td>
                  <td align="center">0.711</td>
                  <td align="center">0.794</td>
                  <td align="center">0.469</td>
                  <td align="center">0.542</td>
                  <td align="center">0.696</td>
                  <td align="center">0.626</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">FusionGAN [<xref rid="ref016" ref-type="bibr">16</xref>]</th>
                  <td align="center">0.766</td>
                  <td align="center">0.873</td>
                  <td align="center">0.923</td>
                  <td align="center">0.779</td>
                  <td align="center">0.761</td>
                  <td align="center">0.857</td>
                  <td style="border-right: 1px solid black;" align="center">0.827</td>
                  <td align="center">0.542</td>
                  <td align="center">0.712</td>
                  <td align="center">0.792</td>
                  <td align="center">0.468</td>
                  <td align="center">0.538</td>
                  <td align="center">0.691</td>
                  <td align="center">0.624</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">Dif-Fusion [<xref rid="ref020" ref-type="bibr">20</xref>]</th>
                  <td align="center">0.775</td>
                  <td align="center">
                    <underline>0.886</underline>
                  </td>
                  <td align="center">
                    <underline>0.926</underline>
                  </td>
                  <td align="center">
                    <bold>0.796</bold>
                  </td>
                  <td align="center">0.772</td>
                  <td align="center">0.858</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>0.836</underline>
                  </td>
                  <td align="center">0.549</td>
                  <td align="center">0.716</td>
                  <td align="center">0.787</td>
                  <td align="center">0.473</td>
                  <td align="center">0.538</td>
                  <td align="center">
                    <underline>0.702</underline>
                  </td>
                  <td align="center">0.628</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">DDFM [<xref rid="ref019" ref-type="bibr">19</xref>]</th>
                  <td align="center">0.771</td>
                  <td align="center">0.882</td>
                  <td align="center">0.919</td>
                  <td align="center">0.790</td>
                  <td align="center">
                    <underline>0.782</underline>
                  </td>
                  <td align="center">0.865</td>
                  <td style="border-right: 1px solid black;" align="center">0.835</td>
                  <td align="center">0.544</td>
                  <td align="center">0.712</td>
                  <td align="center">0.795</td>
                  <td align="center">0.470</td>
                  <td align="center">0.540</td>
                  <td align="center">0.700</td>
                  <td align="center">0.627</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;border-bottom: 1px solid black;" align="left">Ours</th>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.776</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.887</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.927</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.791</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">0.774</td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.875</bold>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.838</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.550</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.719</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.806</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.475</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">0.541</td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.710</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.634</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p>
          <fig id="F9">
            <label>Figure 9.</label>
            <caption>
              <p>Qualitative semantic segmentation comparisons of DMFuse with other competitors on the FMB benchmark.</p>
            </caption>
            <graphic xlink:href="Fig9.jpg"/>
          </fig>
        </p>
        <p id="S4.SS5.p2">Image fusion for object detection: We first discuss how image fusion affects object detection performance [<xref rid="ref046" ref-type="bibr">46</xref>, <xref rid="ref047" ref-type="bibr">47</xref>]. The experiments are implemented on the M<sup>3</sup>FD benchmark, which contains 4200 images annotated with 33,603 objects, including six classes, i.e., People, Car, Bus, Motorcycle, Truck and Lamp. The YOLOv5 network is used as the detection baseline, and mean average precision (mAP) is employed as the evaluation metric. Especially, mAP@0.5 represents the precision value at an intersection-over-union (IoU) threshold of 0.5, and mAP@[0.5:0.97] indicates the mean value at IoU thresholds of between 0.5 and 0.97, with steps of 0.05. For a fair comparison, we employ the detection model to source images and fused results.</p>
        <p id="S4.SS5.p3">Figure <xref ref-type="fig" rid="F8">8</xref> presents the visual results of object detection. For the representative objects, such as People and Car, the proposed model achieves higher precision values than source images and other competitors, indicating that our fused results are more conducive to object detection tasks. Moreover, the objective comparison results are shown in Table <xref rid="T2" ref-type="table">2</xref>. Almost all fusion methods yield good detection performance, and their mAP values are much better than those using only infrared or visible images. Notably, the proposed model outperforms other competitors in terms of mAP value, which has an improvement of 1.09% and 1.77% for mAP@0.5 and mAP@[0.5:0.97]. This indicates that the proposed model can fully discover unique information from different modalities, and offer effective complementary characteristics for the detector to achieve better performance.</p>
        <p>
          <table-wrap id="T3">
            <label>Table 3</label>
            <caption>
              <p>Quantitative semantic segmentation comparisons of different methods on the FMB benchmark.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" rowspan="2" align="left">Methods</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Road</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Sidewalk</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Lamp</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Sign</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Vegetation</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Sky</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Person</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" colspan="2" align="center">Pole</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">mAcc</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">mIoU</th>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                  <th style="border-top: 1px solid black;" align="center">Acc</th>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="center">IoU</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-right: 1px solid black;border-top: 1px solid black;" align="left">Infrared</th>
                  <td style="border-top: 1px solid black;" align="center">83.8</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">79.9</td>
                  <td style="border-top: 1px solid black;" align="center">51.4</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">30.4</td>
                  <td style="border-top: 1px solid black;" align="center">70.4</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">12.2</td>
                  <td style="border-top: 1px solid black;" align="center">79.2</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">54.6</td>
                  <td style="border-top: 1px solid black;" align="center">84.6</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">74.7</td>
                  <td style="border-top: 1px solid black;" align="center">95.4</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">90.2</td>
                  <td style="border-top: 1px solid black;" align="center">84.9</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">63.0</td>
                  <td style="border-top: 1px solid black;" align="center">46.1</td>
                  <td style="border-right: 1px solid black;border-top: 1px solid black;" align="center">24.4</td>
                  <td style="border-top: 1px solid black;" align="center">74.5</td>
                  <td style="border-top: 1px solid black;" align="center">53.7</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">Visible</th>
                  <td align="center">84.6</td>
                  <td style="border-right: 1px solid black;" align="center">82.7</td>
                  <td align="center">66.4</td>
                  <td style="border-right: 1px solid black;" align="center">32.1</td>
                  <td align="center">57.4</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>33.0</underline>
                  </td>
                  <td align="center">83.5</td>
                  <td style="border-right: 1px solid black;" align="center">65.0</td>
                  <td align="center">
                    <bold>93.0</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">81.4</td>
                  <td align="center">93.5</td>
                  <td style="border-right: 1px solid black;" align="center">91.4</td>
                  <td align="center">84.8</td>
                  <td style="border-right: 1px solid black;" align="center">41.1</td>
                  <td align="center">63.2</td>
                  <td style="border-right: 1px solid black;" align="center">37.6</td>
                  <td align="center">78.3</td>
                  <td align="center">58.0</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>]</th>
                  <td align="center">
                    <bold>91.1</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">
                    <bold>85.3</bold>
                  </td>
                  <td align="center">56.0</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <bold>39.6</bold>
                  </td>
                  <td align="center">72.3</td>
                  <td style="border-right: 1px solid black;" align="center">31.9</td>
                  <td align="center">
                    <bold>86.5</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">57.0</td>
                  <td align="center">86.0</td>
                  <td style="border-right: 1px solid black;" align="center">82.0</td>
                  <td align="center">96.6</td>
                  <td style="border-right: 1px solid black;" align="center">92.8</td>
                  <td align="center">
                    <underline>87.0</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">56.4</td>
                  <td align="center">70.6</td>
                  <td style="border-right: 1px solid black;" align="center">35.5</td>
                  <td align="center">80.8</td>
                  <td align="center">60.1</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">RFN-Nest [<xref rid="ref013" ref-type="bibr">13</xref>]</th>
                  <td align="center">84.7</td>
                  <td style="border-right: 1px solid black;" align="center">76.3</td>
                  <td align="center">62.1</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>36.3</underline>
                  </td>
                  <td align="center">
                    <bold>80.4</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">24.9</td>
                  <td align="center">77.8</td>
                  <td style="border-right: 1px solid black;" align="center">68.3</td>
                  <td align="center">91.9</td>
                  <td style="border-right: 1px solid black;" align="center">82.2</td>
                  <td align="center">
                    <underline>96.7</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>93.9</underline>
                  </td>
                  <td align="center">85.6</td>
                  <td style="border-right: 1px solid black;" align="center">60.8</td>
                  <td align="center">70.1</td>
                  <td style="border-right: 1px solid black;" align="center">39.2</td>
                  <td align="center">
                    <underline>81.2</underline>
                  </td>
                  <td align="center">60.2</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">YDTR [<xref rid="ref015" ref-type="bibr">15</xref>]</th>
                  <td align="center">83.9</td>
                  <td style="border-right: 1px solid black;" align="center">81.3</td>
                  <td align="center">
                    <underline>72.4</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">33.5</td>
                  <td align="center">61.6</td>
                  <td style="border-right: 1px solid black;" align="center">27.8</td>
                  <td align="center">73.3</td>
                  <td style="border-right: 1px solid black;" align="center">66.4</td>
                  <td align="center">89.7</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>84.0</underline>
                  </td>
                  <td align="center">95.6</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>93.9</underline>
                  </td>
                  <td align="center">83.4</td>
                  <td style="border-right: 1px solid black;" align="center">58.5</td>
                  <td align="center">
                    <bold>74.7</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">39.0</td>
                  <td align="center">79.4</td>
                  <td align="center">
                    <underline>60.6</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>]</th>
                  <td align="center">85.1</td>
                  <td style="border-right: 1px solid black;" align="center">80.0</td>
                  <td align="center">50.3</td>
                  <td style="border-right: 1px solid black;" align="center">21.7</td>
                  <td align="center">51.4</td>
                  <td style="border-right: 1px solid black;" align="center">30.0</td>
                  <td align="center">
                    <underline>84.0</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">61.5</td>
                  <td align="center">81.7</td>
                  <td style="border-right: 1px solid black;" align="center">78.4</td>
                  <td align="center">95.6</td>
                  <td style="border-right: 1px solid black;" align="center">92.6</td>
                  <td align="center">77.9</td>
                  <td style="border-right: 1px solid black;" align="center">63.1</td>
                  <td align="center">
                    <underline>71.8</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>39.4</underline>
                  </td>
                  <td align="center">74.7</td>
                  <td align="center">58.3</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">FusionGAN [<xref rid="ref016" ref-type="bibr">16</xref>]</th>
                  <td align="center">84.8</td>
                  <td style="border-right: 1px solid black;" align="center">80.0</td>
                  <td align="center">57.8</td>
                  <td style="border-right: 1px solid black;" align="center">32.6</td>
                  <td align="center">50.4</td>
                  <td style="border-right: 1px solid black;" align="center">28.5</td>
                  <td align="center">82.6</td>
                  <td style="border-right: 1px solid black;" align="center">61.5</td>
                  <td align="center">90.4</td>
                  <td style="border-right: 1px solid black;" align="center">82.3</td>
                  <td align="center">93.7</td>
                  <td style="border-right: 1px solid black;" align="center">91.3</td>
                  <td align="center">
                    <bold>89.2</bold>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">62.6</td>
                  <td align="center">62.1</td>
                  <td style="border-right: 1px solid black;" align="center">35.7</td>
                  <td align="center">76.4</td>
                  <td align="center">59.3</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">Dif-Fusion [<xref rid="ref020" ref-type="bibr">20</xref>]</th>
                  <td align="center">83.7</td>
                  <td style="border-right: 1px solid black;" align="center">80.7</td>
                  <td align="center">66.8</td>
                  <td style="border-right: 1px solid black;" align="center">26.4</td>
                  <td align="center">46.9</td>
                  <td style="border-right: 1px solid black;" align="center">32.5</td>
                  <td align="center">78.4</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>68.7</underline>
                  </td>
                  <td align="center">87.0</td>
                  <td style="border-right: 1px solid black;" align="center">80.7</td>
                  <td align="center">
                    <underline>96.7</underline>
                  </td>
                  <td style="border-right: 1px solid black;" align="center">92.8</td>
                  <td align="center">86.0</td>
                  <td style="border-right: 1px solid black;" align="center">
                    <underline>64.5</underline>
                  </td>
                  <td align="center">66.7</td>
                  <td style="border-right: 1px solid black;" align="center">35.3</td>
                  <td align="center">76.5</td>
                  <td align="center">60.2</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;" align="left">DDFM [<xref rid="ref019" ref-type="bibr">19</xref>]</th>
                  <td align="center">81.2</td>
                  <td style="border-right: 1px solid black;" align="center">79.9</td>
                  <td align="center">53.7</td>
                  <td style="border-right: 1px solid black;" align="center">24.0</td>
                  <td align="center">46.1</td>
                  <td style="border-right: 1px solid black;" align="center">31.0</td>
                  <td align="center">75.4</td>
                  <td style="border-right: 1px solid black;" align="center">65.3</td>
                  <td align="center">87.7</td>
                  <td style="border-right: 1px solid black;" align="center">81.2</td>
                  <td align="center">95.1</td>
                  <td style="border-right: 1px solid black;" align="center">91.8</td>
                  <td align="center">79.0</td>
                  <td style="border-right: 1px solid black;" align="center">54.6</td>
                  <td align="center">49.1</td>
                  <td style="border-right: 1px solid black;" align="center">35.1</td>
                  <td align="center">70.9</td>
                  <td align="center">57.9</td>
                </tr>
                <tr>
                  <th style="border-right: 1px solid black;border-bottom: 1px solid black;" align="left">Ours</th>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>85.2</underline>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <underline>83.9</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>73.0</bold>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">33.6</td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>73.4</underline>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>43.6</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">82.7</td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>70.3</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>92.3</underline>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>85.6</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>97.3</bold>
                  </td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>94.5</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">82.6</td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>67.5</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">67.2</td>
                  <td style="border-right: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>48.2</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>81.7</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>65.9</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p>
          <fig id="F10">
            <label>Figure 10.</label>
            <caption>
              <p>Visual comparisons of ablation experiments for two examples selected from the TNO and M<sup>3</sup>FD benchmarks.</p>
            </caption>
            <graphic xlink:href="Fig10.jpg"/>
          </fig>
        </p>
        <p id="S4.SS5.p4">Image fusion for semantic segmentation: We further evaluate the proposed DMFuse with other competitors on the semantic segmentation task. A full-time multi-modality benchmark (FMB) <xref ref-type="fn" rid="fn4">4</xref><fn id="fn4"><label><sup>4</sup></label><p id="footnotex4">[Online]. Available: https://github.com/JinyuanLiu-CV/SegMiF</p></fn> collected from the M<sup>3</sup>FD benchmark is proposed for the segmentation baseline. The FMB dataset contains rich driving scenes under different lighting and weather conditions, and is labeled into fourteen categories. We select 1120 image pairs as the training set and verify the segmentation performance of different models on the 280 pairs. The relevant experimental configuration is derived from SegMiF [<xref rid="ref032" ref-type="bibr">32</xref>]. The metrics, accuracy (ACC) and intersection-over-union (IoU) are employed for segmentation evaluation.</p>
        <p id="S4.SS5.p5">The qualitative semantic segmentation comparisons are depicted in Figure <xref ref-type="fig" rid="F9">9</xref>. For the representative objects and details, such as pedestrians and buildings, single-modality infrared and visible images cannot produce accurate classifications. However, the fusion methods improve the semantic segmentation performance to some extent. This indicates that the complementary characteristics of image fusion facilitate the segmentation accuracy. More importantly, the proposed model effectively classifies objects and scenes with high accuracy, which is closest to ground truth. Table <xref rid="T3" ref-type="table">3</xref> reports the quantitative semantic segmentation comparisons. The numerical results demonstrate the proposed model is ahead of other SOTA competitors in terms of mACC and mIoU. In short, the proposed model can exploit and strengthen the complementary information of different modalities, which generates a positive effect on semantic segmentation.</p>
      </sec>
      <sec id="S4.SS6">
        <label>4.6</label>
        <title>Ablation Study</title>
        <p id="S4.SS6.p1">This section presents several specialized designs incorporated into the proposed DMFuse, and their effectiveness is evaluated through ablation experiments that focus on the model architecture and training strategy. The qualitative and quantitative comparisons are also presented in this section.</p>
        <p>
          <table-wrap id="T4">
            <label>Table 4</label>
            <caption>
              <p>Quantitative validations of different training datasets.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">Testing Datasets</th>
                  <th style="border-top: 1px solid black;" align="left">Training Datasets</th>
                  <th style="border-top: 1px solid black;" align="center">EN</th>
                  <th style="border-top: 1px solid black;" align="center">SD</th>
                  <th style="border-top: 1px solid black;" align="center">PC</th>
                  <th style="border-top: 1px solid black;" align="center">FMIp</th>
                  <th style="border-top: 1px solid black;" align="center">Qe</th>
                  <th style="border-top: 1px solid black;" align="center">Qabf</th>
                  <th style="border-top: 1px solid black;" align="center">MS-SSIM</th>
                  <th style="border-top: 1px solid black;" align="center">VIF</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-top: 1px solid black;" rowspan="3" align="left">TNO Benchmark</th>
                  <th style="border-top: 1px solid black;" align="left">TNO</th>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>6.8466</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>35.7474</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.3086</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.9026</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.4073</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.5009</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.9090</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.4154</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">M<sup>3</sup>FD</th>
                  <td style="border-top: 1px solid black;" align="center">6.8466</td>
                  <td style="border-top: 1px solid black;" align="center">34.0896</td>
                  <td style="border-top: 1px solid black;" align="center">0.3032</td>
                  <td style="border-top: 1px solid black;" align="center">0.9002</td>
                  <td style="border-top: 1px solid black;" align="center">0.3936</td>
                  <td style="border-top: 1px solid black;" align="center">0.4767</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.9156</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.3901</td>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">MS-COCO (Ours)</th>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>6.9324</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>37.0730</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.3500</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.9060</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.4573</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.5467</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.9130</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.4233</bold>
                  </td>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;border-bottom: 1px solid black;" rowspan="3" align="left">M<sup>3</sup>FD Benchmark</th>
                  <th style="border-top: 1px solid black;" align="left">TNO</th>
                  <td style="border-top: 1px solid black;" align="center">7.0188</td>
                  <td style="border-top: 1px solid black;" align="center">36.4068</td>
                  <td style="border-top: 1px solid black;" align="center">0.2798</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.8538</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.2723</td>
                  <td style="border-top: 1px solid black;" align="center">0.4244</td>
                  <td style="border-top: 1px solid black;" align="center">0.8990</td>
                  <td style="border-top: 1px solid black;" align="center">0.2786</td>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">M<sup>3</sup>FD</th>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>7.1955</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>40.2199</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.3149</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.8487</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.3697</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.5227</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.9195</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.3068</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;border-bottom: 1px solid black;" align="left">MS-COCO (Ours)</th>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>7.2045</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>40.6980</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.5056</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.8726</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.4821</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.6818</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.9392</bold>
                  </td>
                  <td style="border-top: 1px solid black;border-bottom: 1px solid black;" align="center">
                    <bold>0.4133</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p>
          <table-wrap id="T5">
            <label>Table 5</label>
            <caption>
              <p>Quantitative validations of different channels on the TNO benchmark.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">Metrics</th>
                  <th style="border-top: 1px solid black;" align="center">EN</th>
                  <th style="border-top: 1px solid black;" align="center">SD</th>
                  <th style="border-top: 1px solid black;" align="center">PC</th>
                  <th style="border-top: 1px solid black;" align="center">FMIp</th>
                  <th style="border-top: 1px solid black;" align="center">Qe</th>
                  <th style="border-top: 1px solid black;" align="center">Qabf</th>
                  <th style="border-top: 1px solid black;" align="center">MS-SSIM</th>
                  <th style="border-top: 1px solid black;" align="center">VIF</th>
                  <th style="border-top: 1px solid black;" align="center">Params(M)</th>
                  <th style="border-top: 1px solid black;" align="center">FLOPs(G)</th>
                  <th style="border-top: 1px solid black;" align="center">Time(s)</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">Original</th>
                  <td style="border-top: 1px solid black;" align="center">6.9135</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>37.6477</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.3845</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.9106</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.4861</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.5898</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.9150</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.4336</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">392.724</td>
                  <td style="border-top: 1px solid black;" align="center">1516.136</td>
                  <td style="border-top: 1px solid black;" align="center">74.110</td>
                </tr>
                <tr>
                  <th align="left">1/2</th>
                  <td align="center">6.9150</td>
                  <td align="center">
                    <underline>37.1946</underline>
                  </td>
                  <td align="center">
                    <underline>0.3738</underline>
                  </td>
                  <td align="center">
                    <underline>0.9084</underline>
                  </td>
                  <td align="center">
                    <underline>0.4794</underline>
                  </td>
                  <td align="center">
                    <underline>0.5754</underline>
                  </td>
                  <td align="center">
                    <underline>0.9125</underline>
                  </td>
                  <td align="center">
                    <underline>0.4296</underline>
                  </td>
                  <td align="center">98.680</td>
                  <td align="center">382.052</td>
                  <td align="center">6.403</td>
                </tr>
                <tr>
                  <th align="left">1/4(Ours)</th>
                  <td align="center">
                    <underline>6.9324</underline>
                  </td>
                  <td align="center">37.0730</td>
                  <td align="center">0.3500</td>
                  <td align="center">0.9060</td>
                  <td align="center">0.4573</td>
                  <td align="center">0.5467</td>
                  <td align="center">0.9130</td>
                  <td align="center">0.4233</td>
                  <td align="center">
                    <underline>24.967</underline>
                  </td>
                  <td align="center">
                    <underline>106.584</underline>
                  </td>
                  <td align="center">
                    <underline>2.624</underline>
                  </td>
                </tr>
                <tr>
                  <th style="border-bottom: 1px solid black;" align="left">1/8</th>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>6.9402</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">36.9426</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.2405</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.8899</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.3849</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.4181</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.9036</td>
                  <td style="border-bottom: 1px solid black;" align="center">0.3786</td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>6.433</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>35.967</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>2.163</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p>
          <table-wrap id="T6">
            <label>Table 6</label>
            <caption>
              <p>Quantitative validations of component effectiveness.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">Models</th>
                  <th style="border-top: 1px solid black;" align="center">EN</th>
                  <th style="border-top: 1px solid black;" align="center">SD</th>
                  <th style="border-top: 1px solid black;" align="center">PC</th>
                  <th style="border-top: 1px solid black;" align="center">FMIp</th>
                  <th style="border-top: 1px solid black;" align="center">Qe</th>
                  <th style="border-top: 1px solid black;" align="center">Qabf</th>
                  <th style="border-top: 1px solid black;" align="center">MS-SSIM</th>
                  <th style="border-top: 1px solid black;" align="center">VIF</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">w/o Dif</th>
                  <td style="border-top: 1px solid black;" align="center">6.8480</td>
                  <td style="border-top: 1px solid black;" align="center">35.1861</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.3196</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.8975</underline>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">
                    <bold>0.4735</bold>
                  </td>
                  <td style="border-top: 1px solid black;" align="center">0.4862</td>
                  <td style="border-top: 1px solid black;" align="center">0.8830</td>
                  <td style="border-top: 1px solid black;" align="center">
                    <underline>0.4228</underline>
                  </td>
                </tr>
                <tr>
                  <th align="left">w/o CAIM</th>
                  <td align="center">
                    <underline>6.8574</underline>
                  </td>
                  <td align="center">
                    <underline>35.9839</underline>
                  </td>
                  <td align="center">0.3155</td>
                  <td align="center">0.8886</td>
                  <td align="center">0.3477</td>
                  <td align="center">
                    <underline>0.4902</underline>
                  </td>
                  <td align="center">
                    <underline>0.8985</underline>
                  </td>
                  <td align="center">0.3439</td>
                </tr>
                <tr>
                  <th style="border-bottom: 1px solid black;" align="left">Ours</th>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>6.9324</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>37.0730</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.3500</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.9060</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <underline>0.4573</underline>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.5467</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.9130</bold>
                  </td>
                  <td style="border-bottom: 1px solid black;" align="center">
                    <bold>0.4233</bold>
                  </td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
        <p id="S4.SS6.p2">Training on Different Datasets: To assess the generalization performance of the diffusion model, we train it on the different datasets, including TNO, M<sup>3</sup>FD, and the proposed MS-COCO. From the results of Figure <xref ref-type="fig" rid="F10">10</xref> (c) and (d), the fusion images of TNO and M<sup>3</sup>FD trained models exist in detail confusion and color degradation to a certain extent. The quantitative verification is compared in Table <xref rid="T4" ref-type="table">4</xref>. A typical phenomenon is that a fusion model trained by a certain dataset maintains superior performance on the corresponding testing. Overall, the proposed method achieves more stable and outstanding performance on different testing datasets.</p>
        <p id="S4.SS6.p3">Channel in Diffusion UNet: We compress the channel numbers of diffusion UNet at each layer to 1/4 in our fusion model, and compare it with other competitive models, i.e., original parameters, 1/2, and 1/8. Noting that we omit the qualitative descriptions because their results are similar. Table <xref rid="T5" ref-type="table">5</xref> shows the quantitative validations on the TNO benchmark. It can be observed that the fusion performance decreases with the reduction in channel numbers, while the model parameters and operation efficiency exhibit an opposite trend. When the channel parameter is reduced to 1/8, the performance becomes comparable to other fusion methods, such as Dif-Fusion and DDFM. In conclusion, the proposed model suggests adopting 1/4 channel parameters to achieve a better balance between fusion performance and computational efficiency.</p>
        <p>
          <fig id="F11">
            <label>Figure 11.</label>
            <caption>
              <p>The visualization maps of different encoders.</p>
            </caption>
            <graphic xlink:href="Fig11.png"/>
          </fig>
        </p>
        <p id="S4.SS6.p4">Verification of Each Component: We employ the diffusion model to extract generative features and develop a cross-attention interactive fusion module to perform the global interactions. To verify their effectiveness, we propose an UNet-style CNN encoder to replace the diffusion model encoder and utilize addition operation instead of CAIM, respectively. As shown in Figure <xref ref-type="fig" rid="F10">10</xref> (e) and (f), the fusion images without the diffusion model, termed w/o Dif, lose some target brightness and meaningful details, while the fused results without CAIM, termed w/o CAIM, have limited visual effects. Meanwhile, we visualize the feature maps of diffusion model encoder and CNN encoder (referred to as w/o Dif) in Figure <xref ref-type="fig" rid="F11">11</xref>. The diffusion features (the first row) demonstrate obvious advantages over CNN features (the second row) in the characterization of infrared salient targets and visible typical details. In addition, the quantitative results, as shown in Table <xref rid="T6" ref-type="table">6</xref>, indicate that the proposed model achieves all the optimal values except for Qe, which is behind w/o Dif. The experiments prove that both diffusion model and CAIM are beneficial to fusion performance improvement.</p>
        <p>
          <table-wrap id="T7">
            <label>Table 7</label>
            <caption>
              <p>The computational efficiency comparisons.</p>
            </caption>
            <table>
              <thead>
                <tr>
                  <th style="border-top: 1px solid black;" rowspan="2" align="left">Methods</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">Params.(M)</th>
                  <th style="border-top: 1px solid black;" rowspan="2" align="center">FLOPs(G)</th>
                  <th style="border-top: 1px solid black;" colspan="2" align="center">Time(s)</th>
                </tr>
                <tr>
                  <th style="border-top: 1px solid black;" align="center">TNO</th>
                  <th style="border-top: 1px solid black;" align="center">M<sup>3</sup>FD</th>
                </tr>
              </thead>
              <tbody>
                <tr>
                  <th style="border-top: 1px solid black;" align="left">U2Fusion [<xref rid="ref012" ref-type="bibr">12</xref>]</th>
                  <td style="border-top: 1px solid black;" align="center">0.659</td>
                  <td style="border-top: 1px solid black;" align="center">43.17</td>
                  <td style="border-top: 1px solid black;" align="center">1.722</td>
                  <td style="border-top: 1px solid black;" align="center">4.646</td>
                </tr>
                <tr>
                  <th align="left">RFN-Nest [<xref rid="ref013" ref-type="bibr">13</xref>]</th>
                  <td align="center">7.524</td>
                  <td align="center">111.1</td>
                  <td align="center">0.235</td>
                  <td align="center">0.864</td>
                </tr>
                <tr>
                  <th align="left">YDTR [<xref rid="ref015" ref-type="bibr">15</xref>]</th>
                  <td align="center">
                    <underline>0.107</underline>
                  </td>
                  <td align="center">
                    <underline>20.58</underline>
                  </td>
                  <td align="center">
                    <underline>0.201</underline>
                  </td>
                  <td align="center">
                    <underline>0.771</underline>
                  </td>
                </tr>
                <tr>
                  <th align="left">DATFuse [<xref rid="ref029" ref-type="bibr">29</xref>]</th>
                  <td align="center">
                    <bold>0.011</bold>
                  </td>
                  <td align="center">
                    <bold>1.185</bold>
                  </td>
                  <td align="center">
                    <bold>0.019</bold>
                  </td>
                  <td align="center">
                    <bold>0.047</bold>
                  </td>
                </tr>
                <tr>
                  <th align="left">FusionGAN [<xref rid="ref016" ref-type="bibr">16</xref>]</th>
                  <td align="center">1.314</td>
                  <td align="center">57.09</td>
                  <td align="center">0.513</td>
                  <td align="center">0.988</td>
                </tr>
                <tr>
                  <th align="left">Dif-Fusion [<xref rid="ref020" ref-type="bibr">20</xref>]</th>
                  <td align="center">434.2</td>
                  <td align="center">726.1</td>
                  <td align="center">4.820</td>
                  <td align="center">17.21</td>
                </tr>
                <tr>
                  <th align="left">DDFM [<xref rid="ref019" ref-type="bibr">19</xref>]</th>
                  <td align="center">988.3</td>
                  <td align="center">2946</td>
                  <td align="center">59.18</td>
                  <td align="center">162.1</td>
                </tr>
                <tr>
                  <th style="border-bottom: 1px solid black;" align="left">Ours</th>
                  <td style="border-bottom: 1px solid black;" align="center">24.96</td>
                  <td style="border-bottom: 1px solid black;" align="center">106.6</td>
                  <td style="border-bottom: 1px solid black;" align="center">2.624</td>
                  <td style="border-bottom: 1px solid black;" align="center">5.342</td>
                </tr>
              </tbody>
            </table>
          </table-wrap>
        </p>
      </sec>
      <sec id="S4.SS7">
        <label>4.7</label>
        <title>Efficiency Comparison</title>
        <p id="S4.SS7.p1">We also conduct experiments to evaluate the operational efficiency of different methods, including training parameters (Params.), floating-point operations per second (FLOPs), and runtime (Time). Table <xref rid="T7" ref-type="table">7</xref> presents their computational complexity. Note that the computation of FLOPs is implemented by a testing image with the size of 256<inline-formula><mml:math alttext="\times" display="inline"><mml:mo>×</mml:mo></mml:math></inline-formula>256. Compared with the diffusion-based methods, the non-generative fusion schemes, including U2Fusion, RFN-Nest, YDTR, DATFuse, and the GAN-based method, i.e., FusionGAN, have a significant advantage in terms of training parameters, FLOPs, and runtime. The main reason is that the diffusion model requires many iteration steps and consumes massive computational resources. However, since we train a more efficient model by compressing quadruple channels of diffusion UNet, the proposed model has higher operational efficiency than Dif-Fusion and DDFM, indicating the effectiveness of model training.</p>
      </sec>
    </sec>
    <sec id="S5">
      <label>5.</label>
      <title>Discussion</title>
      <p id="S5.p1">The diffusion model showcases powerful generative capabilities and has manifested outstanding performance in the domain of image fusion. Nevertheless, its computational inefficiency constitutes a significant challenge because of the large quantity of iterative steps and the complexity of the calculations. These factors lead to a slow diffusion process, which restricts its applicability in scenarios demanding low computing resources. In future works, we aim to tackle these challenges by exploring optimization strategies such as sampling optimization [<xref rid="ref048" ref-type="bibr">48</xref>] to reduce the number of iteration steps and latent space transformation [<xref rid="ref049" ref-type="bibr">49</xref>] to streamline computations. These efforts will concentrate on enhancing computational efficiency while maintaining or improving the quality of the fused results.</p>
    </sec>
    <sec id="S6">
      <label>6.</label>
      <title>Conclusion</title>
      <p id="S6.p1">This paper presents DMFuse, a novel diffusion model-guided cross-attention learning network, designed for infrared and visible image fusion. Unlike existing methods, the proposed model involves training a lightweight diffusion model to serve as an autoencoder, effectively integrating its high-quality generative capability into the fusion tasks. Moreover, we develop a cross-attention interactive fusion module that facilitates global interactions, strengthening the complementary characteristics of different modalities. We evaluate the performance of DMFuse against seven SOTA methods on TNO, M<sup>3</sup>FD and Harvard MIF benchmarks. The experimental results validate the proposed model achieves predominant fusion performance and competitive computational efficiency. Furthermore, DMFuse exhibits positive implications for downstream applications, including object detection and semantic segmentation. In future work, we will explore the integration of diffusion models with large language models (LLMs) , introducing text descriptions as a semantic guide to further enhance the quality of the fused images.</p>
    </sec>
  </body>
  <back>
    <ack>
      <title>Acknowledgments</title>
      <p id="ack.p1">This work was supported in part by the Fundamental Research Program of Shanxi Province under Grant 202203021221144, and the Patent Transformation Program of Shanxi Province under Grant 202405012.</p>
    </ack>
    <sec id="sec0100" sec-type="COI-statement">
      <title>Conflict of interest</title>
      <p>The authors declare no conflicts of interest.</p>
    </sec>
    <ref-list>
      <title>References</title>
      <ref id="ref001">
        <label>[1]</label>
        <mixed-citation> Liu, J., Wang, J., Huang, N., Zhang, Q., &amp; Han, J. (2022). Revisiting modality-specific feature compensation for visible-infrared person re-identification. <italic>IEEE Transactions on Circuits and Systems for Video Technology, 32</italic>(10), 7226-7240. [<uri>https://doi.org/10.1109/TCSVT.2022.3168999</uri>] </mixed-citation>
      </ref>
      <ref id="ref002">
        <label>[2]</label>
        <mixed-citation> Wang, J., Song, K., Bao, Y., Huang, L., &amp; Yan, Y. (2021). CGFNet: Cross-guided fusion network for RGB-T salient object detection. <italic>IEEE Transactions on Circuits and Systems for Video Technology, 32</italic>(5), 2949-2961. [<uri>https://doi.org/10.1109/TCSVT.2021.3099120</uri>] </mixed-citation>
      </ref>
      <ref id="ref003">
        <label>[3]</label>
        <mixed-citation> Wang, Y., Wei, X., Tang, X., Yu, K., &amp; Luo, L. (2023). RGBT tracking using randomly projected CNN features. <italic>Expert Systems with Applications, 223</italic>, 119865. [<uri>https://doi.org/10.1016/j.eswa.2023.119865</uri>] </mixed-citation>
      </ref>
      <ref id="ref004">
        <label>[4]</label>
        <mixed-citation> Chen, J., Li, X., Luo, L., Mei, X., &amp; Ma, J. (2020). Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. <italic>Information Sciences, 508</italic>, 64-78. [<uri>https://doi.org/10.1016/j.ins.2019.08.066</uri>] </mixed-citation>
      </ref>
      <ref id="ref005">
        <label>[5]</label>
        <mixed-citation> Li, H., Wu, X. J., &amp; Kittler, J. (2020). MDLatLRR: A novel decomposition method for infrared and visible image fusion. <italic>IEEE Transactions on Image Processing, 29</italic>, 4733-4746. [<uri>https://doi.org/10.1109/TIP.2020.2975984</uri>] </mixed-citation>
      </ref>
      <ref id="ref006">
        <label>[6]</label>
        <mixed-citation> Kong, W., Lei, Y., &amp; Zhao, H. (2014). Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization. <italic>Infrared Physics &amp; Technology, 67</italic>, 161-172. [<uri>https://doi.org/10.1016/j.infrared.2014.07.019</uri>] </mixed-citation>
      </ref>
      <ref id="ref007">
        <label>[7]</label>
        <mixed-citation> Ma, C., Nie, R., Ding, H., Cao, J., &amp; Mei, J. (2023). A fractional-order variation with a novel norm to fuse infrared and visible images. <italic>IEEE Transactions on Instrumentation and Measurement, 72</italic>, 1-12. [<uri>https://doi.org/10.1109/TIM.2023.3244817</uri>] </mixed-citation>
      </ref>
      <ref id="ref008">
        <label>[8]</label>
        <mixed-citation> Zou, D., &amp; Yang, B. (2023). Infrared and low-light visible image fusion based on hybrid multiscale decomposition and adaptive light adjustment. <italic>Optics and Lasers in Engineering, 160</italic>, 107268. [<uri>https://doi.org/10.1016/j.optlaseng.2022.107268</uri>] </mixed-citation>
      </ref>
      <ref id="ref009">
        <label>[9]</label>
        <mixed-citation> Zhao, Z., Xu, S., Zhang, C., Liu, J., &amp; Zhang, J. (2020). Bayesian fusion for infrared and visible images. <italic>Signal Processing, 177</italic>, 107734. [<uri>https://doi.org/10.1016/j.sigpro.2020.107734</uri>] </mixed-citation>
      </ref>
      <ref id="ref010">
        <label>[10]</label>
        <mixed-citation> Li, H., &amp; Wu, X. J. (2018). DenseFuse: A fusion approach to infrared and visible images. <italic>IEEE Transactions on Image Processing, 28</italic>(5), 2614-2623. [<uri>https://doi.org/10.1109/TIP.2018.2887342</uri>] </mixed-citation>
      </ref>
      <ref id="ref011">
        <label>[11]</label>
        <mixed-citation> Li, H., Wu, X. J., &amp; Durrani, T. (2020). NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. <italic>IEEE Transactions on Instrumentation and Measurement, 69</italic>(12), 9645-9656. [<uri>https://doi.org/10.1109/TIM.2020.3005230</uri>] </mixed-citation>
      </ref>
      <ref id="ref012">
        <label>[12]</label>
        <mixed-citation> Xu, H., Ma, J., Jiang, J., Guo, X., &amp; Ling, H. (2020). U2Fusion: A unified unsupervised image fusion network. <italic>IEEE Transactions on Pattern Analysis and Machine Intelligence, 44</italic>(1), 502-518. [<uri>https://doi.org/10.1109/TPAMI.2020.3012548</uri>] </mixed-citation>
      </ref>
      <ref id="ref013">
        <label>[13]</label>
        <mixed-citation> Li, H., Wu, X. J., &amp; Kittler, J. (2021). RFN-Nest: An end-to-end residual fusion network for infrared and visible images. <italic>Information Fusion, 73</italic>, 72-86. [<uri>https://doi.org/10.1016/j.inffus.2021.02.023</uri>] </mixed-citation>
      </ref>
      <ref id="ref014">
        <label>[14]</label>
        <mixed-citation> Pang, S., Huo, H., Liu, X., Zheng, B., &amp; Li, J. (2024). SDTFusion: A split-head dense transformer based network for infrared and visible image fusion. <italic>Infrared Physics &amp; Technology, 138</italic>, 105209. [<uri>https://doi.org/10.1016/j.infrared.2024.105209</uri>] </mixed-citation>
      </ref>
      <ref id="ref015">
        <label>[15]</label>
        <mixed-citation> Tang, W., He, F., &amp; Liu, Y. (2022). YDTR: Infrared and visible image fusion via Y-shape dynamic transformer. <italic>IEEE Transactions on Multimedia, 25</italic>, 5413-5428. [<uri>https://doi.org/10.1109/TMM.2022.3192661</uri>] </mixed-citation>
      </ref>
      <ref id="ref016">
        <label>[16]</label>
        <mixed-citation> Ma, J., Yu, W., Liang, P., Li, C., &amp; Jiang, J. (2019). FusionGAN: A generative adversarial network for infrared and visible image fusion. <italic>Information Fusion, 48</italic>, 11-26. [<uri>https://doi.org/10.1016/j.inffus.2018.09.004</uri>] </mixed-citation>
      </ref>
      <ref id="ref017">
        <label>[17]</label>
        <mixed-citation> Ma, J., Zhang, H., Shao, Z., Liang, P., &amp; Xu, H. (2020). GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. <italic>IEEE Transactions on Instrumentation and Measurement, 70</italic>, 1-14. [<uri>https://doi.org/10.1109/TIM.2020.3038013</uri>] </mixed-citation>
      </ref>
      <ref id="ref018">
        <label>[18]</label>
        <mixed-citation> Ho, J., Jain, A., &amp; Abbeel, P. (2020). Denoising diffusion probabilistic models. <italic>Advances in neural information processing systems, 33</italic>, 6840-6851. </mixed-citation>
      </ref>
      <ref id="ref019">
        <label>[19]</label>
        <mixed-citation> Zhao, Z., Bai, H., Zhu, Y., Zhang, J., Xu, S., Zhang, Y., … &amp; Van Gool, L. (2023). DDFM: denoising diffusion model for multi-modality image fusion. In <italic>Proceedings of the IEEE/CVF International Conference on Computer Vision</italic> (pp. 8082-8093). [<uri>https://doi.org/10.1109/ICCV51070.2023.00742</uri>] </mixed-citation>
      </ref>
      <ref id="ref020">
        <label>[20]</label>
        <mixed-citation> Yue, J., Fang, L., Xia, S., Deng, Y., &amp; Ma, J. (2023). Dif-fusion: Towards high color fidelity in infrared and visible image fusion with diffusion models. <italic>IEEE Transactions on Image Processing</italic>. [<uri>https://doi.org/10.1109/TIP.2023.3322046</uri>] </mixed-citation>
      </ref>
      <ref id="ref021">
        <label>[21]</label>
        <mixed-citation> Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … &amp; Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In <italic>Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13</italic> (pp. 740-755). Springer International Publishing. </mixed-citation>
      </ref>
      <ref id="ref022">
        <label>[22]</label>
        <mixed-citation> Zhao, Z., Xu, S., Zhang, J., Liang, C., Zhang, C., &amp; Liu, J. (2021). Efficient and model-based infrared and visible image fusion via algorithm unrolling. <italic>IEEE Transactions on Circuits and Systems for Video Technology, 32</italic>(3), 1186-1196. [<uri>https://doi.org/10.1109/TCSVT.2021.3075745</uri>] </mixed-citation>
      </ref>
      <ref id="ref023">
        <label>[23]</label>
        <mixed-citation> Jian, L., Yang, X., Liu, Z., Jeon, G., Gao, M., &amp; Chisholm, D. (2020). SEDRFuse: A symmetric encoder–decoder with residual block network for infrared and visible image fusion. <italic>IEEE Transactions on Instrumentation and Measurement, 70</italic>, 1-15. [<uri>https://doi.org/10.1109/TIM.2020.3022438</uri>] </mixed-citation>
      </ref>
      <ref id="ref024">
        <label>[24]</label>
        <mixed-citation> Jian, L., Rayhana, R., Ma, L., Wu, S., Liu, Z., &amp; Jiang, H. (2021). Infrared and visible image fusion based on deep decomposition network and saliency analysis. <italic>IEEE Transactions on Multimedia, 24</italic>, 3314-3326. [<uri>https://doi.org/10.1109/TMM.2021.3096088</uri>] </mixed-citation>
      </ref>
      <ref id="ref025">
        <label>[25]</label>
        <mixed-citation> Li, H., Xu, T., Wu, X. J., Lu, J., &amp; Kittler, J. (2023). Lrrnet: A novel representation learning guided fusion network for infrared and visible images. <italic>IEEE transactions on pattern analysis and machine intelligence, 45</italic>(9), 11040-11052. [<uri>https://doi.org/10.1109/TPAMI.2023.3268209</uri>] </mixed-citation>
      </ref>
      <ref id="ref026">
        <label>[26]</label>
        <mixed-citation> An, R., Liu, G., Qian, Y., Xing, M., &amp; Tang, H. (2024). MRASFusion: A multi-scale residual attention infrared and visible image fusion network based on semantic segmentation guidance. <italic>Infrared Physics &amp; Technology, 139</italic>, 105343. [<uri>https://doi.org/10.1016/j.infrared.2024.105343</uri>] </mixed-citation>
      </ref>
      <ref id="ref027">
        <label>[27]</label>
        <mixed-citation> Chen, B., Luo, S., Wu, H., Chen, M., &amp; He, C. (2024). Infrared and visible image fusion and detection based on interactive training strategy and feature filter extraction module. <italic>Optics &amp; Laser Technology, 179</italic>, 111383. [<uri>https://doi.org/10.1016/j.optlastec.2024.111383</uri>] </mixed-citation>
      </ref>
      <ref id="ref028">
        <label>[28]</label>
        <mixed-citation> Zhu, P., Yin, Y., &amp; Zhou, X. (2025). MGRCFusion: An infrared and visible image fusion network based on multi-scale group residual convolution. <italic>Optics &amp; Laser Technology, 180</italic>, 111576. [<uri>https://doi.org/10.1016/j.optlastec.2024.111576</uri>] </mixed-citation>
      </ref>
      <ref id="ref029">
        <label>[29]</label>
        <mixed-citation> Tang, W., He, F., Liu, Y., Duan, Y., &amp; Si, T. (2023). DATFuse: Infrared and visible image fusion via dual attention transformer. <italic>IEEE Transactions on Circuits and Systems for Video Technology, 33</italic>(7), 3159-3172. [<uri>https://doi.org/10.1109/TCSVT.2023.3234340</uri>] </mixed-citation>
      </ref>
      <ref id="ref030">
        <label>[30]</label>
        <mixed-citation> Ma, J., Tang, L., Fan, F., Huang, J., Mei, X., &amp; Ma, Y. (2022). SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. <italic>IEEE/CAA Journal of Automatica Sinica, 9</italic>(7), 1200-1217. [<uri>https://doi.org/10.1109/JAS.2022.105686</uri>] </mixed-citation>
      </ref>
      <ref id="ref031">
        <label>[31]</label>
        <mixed-citation> Tang, W., He, F., &amp; Liu, Y. (2023). TCCFusion: An infrared and visible image fusion method based on transformer and cross correlation. <italic>Pattern Recognition, 137</italic>, 109295. [<uri>https://doi.org/10.1016/j.patcog.2022.109295</uri>] </mixed-citation>
      </ref>
      <ref id="ref032">
        <label>[32]</label>
        <mixed-citation> Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., … &amp; Fan, X. (2023). Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In <italic>Proceedings of the IEEE/CVF international conference on computer vision</italic> (pp. 8115-8124). [<uri>https://doi.org/10.1109/ICCV51070.2023.00745</uri>] </mixed-citation>
      </ref>
      <ref id="ref033">
        <label>[33]</label>
        <mixed-citation> Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., &amp; Luo, Z. (2022). Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In <italic>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</italic> (pp. 5802-5811). [<uri>https://doi.org/10.1109/CVPR52688.2022.00571</uri>] </mixed-citation>
      </ref>
      <ref id="ref034">
        <label>[34]</label>
        <mixed-citation> Wang, Z., Shao, W., Chen, Y., Xu, J., &amp; Zhang, X. (2022). Infrared and visible image fusion via interactive compensatory attention adversarial learning. <italic>IEEE Transactions on Multimedia, 25</italic>, 7800-7813. [<uri>https://doi.org/10.1109/TMM.2022.3228685</uri>] </mixed-citation>
      </ref>
      <ref id="ref035">
        <label>[35]</label>
        <mixed-citation> Wang, Z., Shao, W., Chen, Y., Xu, J., &amp; Zhang, L. (2023). A cross-scale iterative attentional adversarial fusion network for infrared and visible images. <italic>IEEE Transactions on Circuits and Systems for Video Technology, 33</italic>(8), 3677-3688. [<uri>https://doi.org/10.1109/TCSVT.2023.3239627</uri>] </mixed-citation>
      </ref>
      <ref id="ref036">
        <label>[36]</label>
        <mixed-citation> Wang, Z., Zhang, Z., Qi, W., Yang, F., &amp; Xu, J. (2024). FreqGAN: Infrared and Visible Image Fusion via Unified Frequency Adversarial Learning. <italic>IEEE Transactions on Circuits and Systems for Video Technology</italic>. [<uri>https://doi.org/10.1109/TCSVT.2024.3460172</uri>] </mixed-citation>
      </ref>
      <ref id="ref037">
        <label>[37]</label>
        <mixed-citation> Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., &amp; Huang, T. S. (2023). CCNet: Criss-Cross Attention for Semantic Segmentation. <italic>IEEE Transactions on Pattern Analysis and Machine Intelligence, 45</italic>(6), 6896-6908. [<uri>https://doi.org/10.1109/TPAMI.2020.3007032</uri>] </mixed-citation>
      </ref>
      <ref id="ref038">
        <label>[38]</label>
        <mixed-citation> Roberts, J. W., Van Aardt, J. A., &amp; Ahmed, F. B. (2008). Assessment of image fusion procedures using entropy, image quality, and multispectral classification. <italic>Journal of Applied Remote Sensing, 2</italic>(1), 023522. [<uri>https://doi.org/10.1117/1.2945910</uri>] </mixed-citation>
      </ref>
      <ref id="ref039">
        <label>[39]</label>
        <mixed-citation> Rao, Y. J. (1997). In-fibre Bragg grating sensors. <italic>Measurement science and technology, 8</italic>(4), 355. [<uri>https://doi.org/10.1088/0957-0233/8/4/002</uri>] </mixed-citation>
      </ref>
      <ref id="ref040">
        <label>[40]</label>
        <mixed-citation> Liu, Z., Forsyth, D. S., &amp; Laganière, R. (2008). A feature-based metric for the quantitative evaluation of pixel-level image fusion. <italic>Computer Vision and Image Understanding, 109</italic>(1), 56-68. [<uri>https://doi.org/10.1016/j.cviu.2007.04.003</uri>] </mixed-citation>
      </ref>
      <ref id="ref041">
        <label>[41]</label>
        <mixed-citation> Haghighat, M., &amp; Razian, M. A. (2014, October). Fast-FMI: Non-reference image fusion metric. In <italic>2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT)</italic> (pp. 1-3). IEEE. [<uri>https://doi.org/10.1109/ICAICT.2014.7036000</uri>] </mixed-citation>
      </ref>
      <ref id="ref042">
        <label>[42]</label>
        <mixed-citation> Piella, G., &amp; Heijmans, H. (2003, September). A new quality metric for image fusion. In <italic>Proceedings 2003 international conference on image processing</italic> (Cat. No. 03CH37429) (Vol. 3, pp. III-173). IEEE. [<uri>https://doi.org/10.1109/ICIP.2003.1247209</uri>] </mixed-citation>
      </ref>
      <ref id="ref043">
        <label>[43]</label>
        <mixed-citation> Xydeas, C. S., &amp; Petrovic, V. (2000). Objective image fusion performance measure. <italic>Electronics letters, 36</italic>(4), 308-309. </mixed-citation>
      </ref>
      <ref id="ref044">
        <label>[44]</label>
        <mixed-citation> Ma, K., Zeng, K., &amp; Wang, Z. (2015). Perceptual quality assessment for multi-exposure image fusion. <italic>IEEE Transactions on Image Processing, 24</italic>(11), 3345-3356. [<uri>https://doi.org/10.1109/TIP.2015.2442920</uri>] </mixed-citation>
      </ref>
      <ref id="ref045">
        <label>[45]</label>
        <mixed-citation> Han, Y., Cai, Y., Cao, Y., &amp; Xu, X. (2013). A new image fusion performance metric based on visual information fidelity. <italic>Information fusion, 14</italic>(2), 127-135. [<uri>https://doi.org/10.1016/j.inffus.2011.08.002</uri>] </mixed-citation>
      </ref>
      <ref id="ref046">
        <label>[46]</label>
        <mixed-citation> Sun, Y., Meng, Y., Wang, Q., Tang, M., Shen, T., &amp; Wang, Q. (2023, August). Visible and infrared image fusion for object detection: a survey. In <italic>International Conference on Image, Vision and Intelligent Systems</italic> (pp. 236-248). Singapore: Springer Nature Singapore. [<uri>https://doi.org/10.1007/978-981-97-0855-0_24</uri>] </mixed-citation>
      </ref>
      <ref id="ref047">
        <label>[47]</label>
        <mixed-citation> Wang, D., Liu, J., Liu, R., &amp; Fan, X. (2023). An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. <italic>Information Fusion, 98</italic>, 101828. [<uri>https://doi.org/10.1016/j.inffus.2023.101828</uri>] </mixed-citation>
      </ref>
      <ref id="ref048">
        <label>[48]</label>
        <mixed-citation> Xue, S., Liu, Z., Chen, F., Zhang, S., Hu, T., Xie, E., &amp; Li, Z. (2024). Accelerating diffusion sampling with optimized time steps. In <italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (pp. 8292-8301). </mixed-citation>
      </ref>
      <ref id="ref049">
        <label>[49]</label>
        <mixed-citation> Rombach, R., Blattmann, A., Lorenz, D., Esser, P., &amp; Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In <italic>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</italic> (pp. 10684-10695). [<uri>https://doi.org/10.1109/CVPR52688.2022.01042</uri>] </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>
