Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection

Muhammad Zain Hassan; Alexandros Gazis; Abdurrahman Khan; Zainab Ghazanfar

doi:10.62762/TSCC.2025.210523

Volume 3, Issue 1, ICCK Transactions on Sensing, Communication, and Control

Volume 3, Issue 1, 2026

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Reservoir Science: A Multi-Coupling Communication Platform to Promote Energy Transformation, Climate Change and Environmental Protection Reinforcement Learning for Prompt Optimization in Language Models: A Comprehensive Survey of Methods, Representations, and Evaluation Challenges From CO$_2$ Sequestration to Hydrogen Storage: Further Utilization of Depleted Gas Reservoirs AI and the Future of Education: Advancing Personalized Learning and Intelligent Tutoring Systems Effects of Crosslinking Agents and Reservoir Conditions on the Propagation of Fractures in Coal Reservoirs During Hydraulic Fracturing The Influence of Geological Factors and Transmission Fluids on the Exploitation of Reservoir Geothermal Resources: Factor Discussion and Mechanism Analysis YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Plant Disease Detection Using Deep Learning Techniques Current Status and Development Prospects of Carbon Capture, Utilization, and Storage (CCUS) in China: Technical, Policy, and Market Perspectives

ICCK Transactions on Sensing, Communication, and Control, Volume 3, Issue 1, 2026: 1-14

Free to Read | Research Article | 29 January 2026

Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection

Muhammad Zain Hassan 1

Alexandros Gazis 2,3

Abdurrahman Khan 4

Zainab Ghazanfar 5 *

1 Department of Software Engineering, University of Haripur, Haripur 22620, Pakistan

2 Democritus University of Thrace, Xanthi 67100, Greece

3 Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom

4 Global Degree College, Peshawar 25000, Pakistan

5 Department of AI and Software, Gachon University, Seongnam 13120, Republic of Korea

* Corresponding Author: Zainab Ghazanfar, [email protected]

DOI: 10.62762/TSCC.2025.210523

ARK: ark:/57805/tscc.2025.210523

Received: 05 December 2025, Accepted: 29 December 2025, Published: 29 January 2026

PDF (8.07 MB)

Article Metrics Cite This Article

Abstract

RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.

Graphical Abstract

Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection

Keywords

salient object detection

RGB-thermal fusion

cross-interactive dual attention

multi-modal learning

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

AI Use Statement

The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate

Not applicable.

References

Wang, Y., Li, G., & Liu, Z. (2023). SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 33(12), 7737-7748.
[CrossRef] [Google Scholar]
Xu, X., Zhao, J., Wu, J., & Shen, F. (2022). Switch and refine: A long-term tracking and segmentation framework. IEEE Transactions on Circuits and Systems for Video Technology, 33(3), 1291-1304.
[CrossRef] [Google Scholar]
Yang, J., Wei, P., & Zheng, N. (2023). Cross time-frequency transformer for temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 34(6), 4625-4638.
[CrossRef] [Google Scholar]
Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2019). Salient object detection: A survey. Computational Visual Media, 5(2), 117--150.
[CrossRef] [Google Scholar]
Han, J., Zhang, D., Hu, X., Guo, L., Ren, J., & Wu, F. (2014). Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology, 25(8), 1309-1321.
[CrossRef] [Google Scholar]
Zhou, T., Fan, D.-P., Cheng, M.-M., Shen, J., & Shao, L. (2021). RGB-D salient object detection: A survey. Computational Visual Media, 7(1), 37-69.
[CrossRef] [Google Scholar]
Hu, X., Sun, F., Sun, J., Wang, F., & Li, H. (2024). Cross-modal fusion and progressive decoding network for RGB-D salient object detection. International Journal of Computer Vision, 132(8), 3067-3085.
[CrossRef] [Google Scholar]
Chen, G., Shao, F., Chai, X., Chen, H., Jiang, Q., Meng, X., & Ho, Y.-S. (2022). CGMDRNet: Cross-guided modality difference reduction network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 6308-6323.
[CrossRef] [Google Scholar]
Ma, Y., Sun, D., Meng, Q., Ding, Z., & Li, C. (2017, December). Learning multiscale deep features and SVM regressors for adaptive RGB-T saliency detection. In 2017 10th International Symposium on Computational Intelligence and Design (ISCID) (Vol. 1, pp. 389-392). IEEE.
[CrossRef] [Google Scholar]
Wang, G., Li, C., Ma, Y., Zheng, A., Tang, J., & Luo, B. (2018). RGB-T saliency detection benchmark: Dataset, baselines, analysis and a novel approach. In Chinese Conference on Image and Graphics Technologies (pp. 359-369). Springer.
[CrossRef] [Google Scholar]
Tu, Z., Xia, T., Li, C., Wang, X., Ma, Y., & Tang, J. (2019). RGB-T image saliency detection via collaborative graph learning. IEEE Transactions on Multimedia, 22(1), 160-173.
[CrossRef] [Google Scholar]
Zhang, Z., Wang, J., & Han, Y. (2023). Saliency prototype for RGB-D and RGB-T salient object detection. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 3696-3705).
[CrossRef] [Google Scholar]
Zhou, W., Guo, Q., Lei, J., Yu, L., & Hwang, J.-N. (2021). ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1224-1235.
[CrossRef] [Google Scholar]
Tu, Z., Li, Z., Li, C., Lang, Y., & Tang, J. (2021). Multi-interactive dual-decoder for RGB-thermal salient object detection. IEEE Transactions on Image Processing, 30, 5678-5691.
[CrossRef] [Google Scholar]
Wang, K., Tu, Z., Li, C., Zhang, C., & Luo, B. (2024). Learning adaptive fusion bank for multi-modal salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 34(8), 7344-7358.
[CrossRef] [Google Scholar]
Liu, N., Zhang, N., & Han, J. (2020). Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13756-13765).
[CrossRef] [Google Scholar]
Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S.-M. (2014). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569-582.
[CrossRef] [Google Scholar]
Jiang, Z., & Davis, L. S. (2013). Submodular salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2043-2050).
[CrossRef] [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
[Google Scholar]
Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., & Torr, P. H. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3203-3212).
[Google Scholar]
Zhang, X., Wang, T., Qi, J., Lu, H., & Wang, G. (2018). Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 714-722).
[CrossRef] [Google Scholar]
Zhao, J. X., Liu, J. J., Fan, D. P., Cao, Y., Yang, J., & Cheng, M. M. (2019). EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8779-8788).
[CrossRef] [Google Scholar]
Wei, J., Wang, S., & Huang, Q. (2020). F$^3$Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12321-12328).
[CrossRef] [Google Scholar]
Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9413-9422).
[CrossRef] [Google Scholar]
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7254-7263).
[CrossRef] [Google Scholar]
Li, C., Cong, R., Kwong, S., Hou, J., Fu, H., Zhu, G., Zhang, D., & Huang, Q. (2020). ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection. IEEE Transactions on Cybernetics, 51(1), 88-100.
[CrossRef] [Google Scholar]
Fu, K., Fan, D. P., Ji, G. P., & Zhao, Q. (2020). JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3052-3062).
[CrossRef] [Google Scholar]
Hu, X., Sun, F., Sun, J., Wang, F., & Li, H. (2024). Cross-modal fusion and progressive decoding network for RGB-D salient object detection. International Journal of Computer Vision, 132(8), 3067-3085.
[CrossRef] [Google Scholar]
Huo, F., Zhu, X., Zhang, L., Liu, Q., & Shu, Y. (2021). Efficient context-guided stacked refinement network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 3111-3124.
[CrossRef] [Google Scholar]
Wang, J., Song, K., Bao, Y., Huang, L., & Yan, Y. (2021). CGFNet: Cross-guided fusion network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 2949-2961.
[CrossRef] [Google Scholar]
Tu, Z., Li, Z., Li, C., Lang, Y., & Tang, J. (2021). Multi-interactive dual-decoder for RGB-thermal salient object detection. IEEE Transactions on Image Processing, 30, 5678-5691.
[CrossRef] [Google Scholar]
Liu, Z., Tan, Y., He, Q., & Xiao, Y. (2021). SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4486-4497.
[CrossRef] [Google Scholar]
Cong, R., Zhang, K., Zhang, C., Zheng, F., Zhao, Y., Huang, Q., & Kwong, S. (2022). Does thermal really always matter for RGB-T salient object detection? IEEE Transactions on Multimedia, 25, 6971-6982.
[CrossRef] [Google Scholar]
Tu, Z., Ma, Y., Li, Z., Li, C., Xu, J., & Liu, Y. (2022). RGBT salient object detection: A large-scale dataset and benchmark. IEEE Transactions on Multimedia, 25, 4163-4176.
[CrossRef] [Google Scholar]
Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., & Zhang, Z. (2022). Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection. Neurocomputing, 490, 132-145.
[CrossRef] [Google Scholar]
Gao, W., Liao, G., Ma, S., Li, G., Liang, Y., & Lin, W. (2021). Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(4), 2091-2106.
[CrossRef] [Google Scholar]
Huo, F., Zhu, X., Zhang, Q., Liu, Z., & Yu, W. (2022). Real-time one-stream semantic-guided refinement network for RGB-thermal salient object detection. IEEE Transactions on Instrumentation and Measurement, 71, 1-12.
[CrossRef] [Google Scholar]
Zhou, W., Zhu, Y., Lei, J., Yang, R., & Yu, L. (2023). LSNet: Lightweight spatial boosting network for detecting salient objects in RGB-thermal images. IEEE Transactions on Image Processing, 32, 1329-1340.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Hassan, M. Z, Gazis, A., Khan, A., & Ghazanfar, Z. (2026). Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection. ICCK Transactions on Sensing, Communication, and Control, 3(1), 1–14. https://doi.org/10.62762/TSCC.2025.210523

Export Citation

RIS Format

Compatible with EndNote, Zotero, Mendeley, and other reference managers

RIS format data for reference managers

TY  - JOUR
AU  - Hassan, Muhammad Zain
AU  - Gazis, Alexandros
AU  - Khan, Abdurrahman
AU  - Ghazanfar, Zainab
PY  - 2026
DA  - 2026/01/29
TI  - Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection
JO  - ICCK Transactions on Sensing, Communication, and Control
T2  - ICCK Transactions on Sensing, Communication, and Control
JF  - ICCK Transactions on Sensing, Communication, and Control
VL  - 3
IS  - 1
SP  - 1
EP  - 14
DO  - 10.62762/TSCC.2025.210523
UR  - https://www.icck.org/article/abs/TSCC.2025.210523
KW  - salient object detection
KW  - RGB-thermal fusion
KW  - cross-interactive dual attention
KW  - multi-modal learning
AB  - RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.
SN  - 3068-9287
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  -

BibTeX Format

Compatible with LaTeX, BibTeX, and other reference managers

BibTeX format data for LaTeX and reference managers

@article{Hassan2026Learning,
  author = {Muhammad Zain Hassan and Alexandros Gazis and Abdurrahman Khan and Zainab Ghazanfar},
  title = {Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection},
  journal = {ICCK Transactions on Sensing, Communication, and Control},
  year = {2026},
  volume = {3},
  number = {1},
  pages = {1-14},
  doi = {10.62762/TSCC.2025.210523},
  url = {https://www.icck.org/article/abs/TSCC.2025.210523},
  abstract = {RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.},
  keywords = {salient object detection, RGB-thermal fusion, cross-interactive dual attention, multi-modal learning},
  issn = {3068-9287},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 8

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Sensing, Communication, and Control

ISSN: 3068-9287 (Online) | ISSN: 3068-9279 (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

User

Unlimited Downloads

Complete Library Access

Membership Eligibility

Community Leadership Opportunities

Google Scholar

Crossref

Scopus

Web of Science