Volume 3, Issue 1, ICCK Transactions on Sensing, Communication, and Control
Volume 3, Issue 1, 2026
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Sensing, Communication, and Control, Volume 3, Issue 1, 2026: 1-14

Free to Read | Research Article | 29 January 2026
Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection
1 Department of Software Engineering, University of Haripur, Haripur 22620, Pakistan
2 Democritus University of Thrace, Xanthi 67100, Greece
3 Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom
4 Global Degree College, Peshawar 25000, Pakistan
5 Department of AI and Software, Gachon University, Seongnam 13120, Republic of Korea
* Corresponding Author: Zainab Ghazanfar, [email protected]
ARK: ark:/57805/tscc.2025.210523
Received: 05 December 2025, Accepted: 29 December 2025, Published: 29 January 2026  
Abstract
RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.

Graphical Abstract
Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection

Keywords
salient object detection
RGB-thermal fusion
cross-interactive dual attention
multi-modal learning

Data Availability Statement
Data will be made available on request.

Funding
This work was supported without any funding.

Conflicts of Interest
The authors declare no conflicts of interest.

AI Use Statement
The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Wang, Y., Li, G., & Liu, Z. (2023). SGFNet: Semantic-guided fusion network for RGB-thermal semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 33(12), 7737-7748.
    [CrossRef]   [Google Scholar]
  2. Xu, X., Zhao, J., Wu, J., & Shen, F. (2022). Switch and refine: A long-term tracking and segmentation framework. IEEE Transactions on Circuits and Systems for Video Technology, 33(3), 1291-1304.
    [CrossRef]   [Google Scholar]
  3. Yang, J., Wei, P., & Zheng, N. (2023). Cross time-frequency transformer for temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 34(6), 4625-4638.
    [CrossRef]   [Google Scholar]
  4. Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., & Li, J. (2019). Salient object detection: A survey. Computational Visual Media, 5(2), 117--150.
    [CrossRef]   [Google Scholar]
  5. Han, J., Zhang, D., Hu, X., Guo, L., Ren, J., & Wu, F. (2014). Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology, 25(8), 1309-1321.
    [CrossRef]   [Google Scholar]
  6. Zhou, T., Fan, D.-P., Cheng, M.-M., Shen, J., & Shao, L. (2021). RGB-D salient object detection: A survey. Computational Visual Media, 7(1), 37-69.
    [CrossRef]   [Google Scholar]
  7. Hu, X., Sun, F., Sun, J., Wang, F., & Li, H. (2024). Cross-modal fusion and progressive decoding network for RGB-D salient object detection. International Journal of Computer Vision, 132(8), 3067-3085.
    [CrossRef]   [Google Scholar]
  8. Chen, G., Shao, F., Chai, X., Chen, H., Jiang, Q., Meng, X., & Ho, Y.-S. (2022). CGMDRNet: Cross-guided modality difference reduction network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 6308-6323.
    [CrossRef]   [Google Scholar]
  9. Ma, Y., Sun, D., Meng, Q., Ding, Z., & Li, C. (2017, December). Learning multiscale deep features and SVM regressors for adaptive RGB-T saliency detection. In 2017 10th International Symposium on Computational Intelligence and Design (ISCID) (Vol. 1, pp. 389-392). IEEE.
    [CrossRef]   [Google Scholar]
  10. Wang, G., Li, C., Ma, Y., Zheng, A., Tang, J., & Luo, B. (2018). RGB-T saliency detection benchmark: Dataset, baselines, analysis and a novel approach. In Chinese Conference on Image and Graphics Technologies (pp. 359-369). Springer.
    [CrossRef]   [Google Scholar]
  11. Tu, Z., Xia, T., Li, C., Wang, X., Ma, Y., & Tang, J. (2019). RGB-T image saliency detection via collaborative graph learning. IEEE Transactions on Multimedia, 22(1), 160-173.
    [CrossRef]   [Google Scholar]
  12. Zhang, Z., Wang, J., & Han, Y. (2023). Saliency prototype for RGB-D and RGB-T salient object detection. In Proceedings of the 31st ACM International Conference on Multimedia (pp. 3696-3705).
    [CrossRef]   [Google Scholar]
  13. Zhou, W., Guo, Q., Lei, J., Yu, L., & Hwang, J.-N. (2021). ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1224-1235.
    [CrossRef]   [Google Scholar]
  14. Tu, Z., Li, Z., Li, C., Lang, Y., & Tang, J. (2021). Multi-interactive dual-decoder for RGB-thermal salient object detection. IEEE Transactions on Image Processing, 30, 5678-5691.
    [CrossRef]   [Google Scholar]
  15. Wang, K., Tu, Z., Li, C., Zhang, C., & Luo, B. (2024). Learning adaptive fusion bank for multi-modal salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 34(8), 7344-7358.
    [CrossRef]   [Google Scholar]
  16. Liu, N., Zhang, N., & Han, J. (2020). Learning selective self-mutual attention for RGB-D saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13756-13765).
    [CrossRef]   [Google Scholar]
  17. Cheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., & Hu, S.-M. (2014). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569-582.
    [CrossRef]   [Google Scholar]
  18. Jiang, Z., & Davis, L. S. (2013). Submodular salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2043-2050).
    [CrossRef]   [Google Scholar]
  19. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
    [Google Scholar]
  20. Hou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., & Torr, P. H. (2017). Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3203-3212).
    [Google Scholar]
  21. Zhang, X., Wang, T., Qi, J., Lu, H., & Wang, G. (2018). Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 714-722).
    [CrossRef]   [Google Scholar]
  22. Zhao, J. X., Liu, J. J., Fan, D. P., Cao, Y., Yang, J., & Cheng, M. M. (2019). EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8779-8788).
    [CrossRef]   [Google Scholar]
  23. Wei, J., Wang, S., & Huang, Q. (2020). F$^3$Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12321-12328).
    [CrossRef]   [Google Scholar]
  24. Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9413-9422).
    [CrossRef]   [Google Scholar]
  25. Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7254-7263).
    [CrossRef]   [Google Scholar]
  26. Li, C., Cong, R., Kwong, S., Hou, J., Fu, H., Zhu, G., Zhang, D., & Huang, Q. (2020). ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection. IEEE Transactions on Cybernetics, 51(1), 88-100.
    [CrossRef]   [Google Scholar]
  27. Fu, K., Fan, D. P., Ji, G. P., & Zhao, Q. (2020). JL-DCF: Joint learning and densely-cooperative fusion framework for RGB-D salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3052-3062).
    [CrossRef]   [Google Scholar]
  28. Hu, X., Sun, F., Sun, J., Wang, F., & Li, H. (2024). Cross-modal fusion and progressive decoding network for RGB-D salient object detection. International Journal of Computer Vision, 132(8), 3067-3085.
    [CrossRef]   [Google Scholar]
  29. Huo, F., Zhu, X., Zhang, L., Liu, Q., & Shu, Y. (2021). Efficient context-guided stacked refinement network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 3111-3124.
    [CrossRef]   [Google Scholar]
  30. Wang, J., Song, K., Bao, Y., Huang, L., & Yan, Y. (2021). CGFNet: Cross-guided fusion network for RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 2949-2961.
    [CrossRef]   [Google Scholar]
  31. Tu, Z., Li, Z., Li, C., Lang, Y., & Tang, J. (2021). Multi-interactive dual-decoder for RGB-thermal salient object detection. IEEE Transactions on Image Processing, 30, 5678-5691.
    [CrossRef]   [Google Scholar]
  32. Liu, Z., Tan, Y., He, Q., & Xiao, Y. (2021). SwinNet: Swin transformer drives edge-aware RGB-D and RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(7), 4486-4497.
    [CrossRef]   [Google Scholar]
  33. Cong, R., Zhang, K., Zhang, C., Zheng, F., Zhao, Y., Huang, Q., & Kwong, S. (2022). Does thermal really always matter for RGB-T salient object detection? IEEE Transactions on Multimedia, 25, 6971-6982.
    [CrossRef]   [Google Scholar]
  34. Tu, Z., Ma, Y., Li, Z., Li, C., Xu, J., & Liu, Y. (2022). RGBT salient object detection: A large-scale dataset and benchmark. IEEE Transactions on Multimedia, 25, 4163-4176.
    [CrossRef]   [Google Scholar]
  35. Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., & Zhang, Z. (2022). Multi-modal interactive attention and dual progressive decoding network for RGB-D/T salient object detection. Neurocomputing, 490, 132-145.
    [CrossRef]   [Google Scholar]
  36. Gao, W., Liao, G., Ma, S., Li, G., Liang, Y., & Lin, W. (2021). Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(4), 2091-2106.
    [CrossRef]   [Google Scholar]
  37. Huo, F., Zhu, X., Zhang, Q., Liu, Z., & Yu, W. (2022). Real-time one-stream semantic-guided refinement network for RGB-thermal salient object detection. IEEE Transactions on Instrumentation and Measurement, 71, 1-12.
    [CrossRef]   [Google Scholar]
  38. Zhou, W., Zhu, Y., Lei, J., Yang, R., & Yu, L. (2023). LSNet: Lightweight spatial boosting network for detecting salient objects in RGB-thermal images. IEEE Transactions on Image Processing, 32, 1329-1340.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Hassan, M. Z, Gazis, A., Khan, A., & Ghazanfar, Z. (2026). Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection. ICCK Transactions on Sensing, Communication, and Control, 3(1), 1–14. https://doi.org/10.62762/TSCC.2025.210523
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
RIS format data for reference managers
TY  - JOUR
AU  - Hassan, Muhammad Zain
AU  - Gazis, Alexandros
AU  - Khan, Abdurrahman
AU  - Ghazanfar, Zainab
PY  - 2026
DA  - 2026/01/29
TI  - Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection
JO  - ICCK Transactions on Sensing, Communication, and Control
T2  - ICCK Transactions on Sensing, Communication, and Control
JF  - ICCK Transactions on Sensing, Communication, and Control
VL  - 3
IS  - 1
SP  - 1
EP  - 14
DO  - 10.62762/TSCC.2025.210523
UR  - https://www.icck.org/article/abs/TSCC.2025.210523
KW  - salient object detection
KW  - RGB-thermal fusion
KW  - cross-interactive dual attention
KW  - multi-modal learning
AB  - RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.
SN  - 3068-9287
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
BibTeX format data for LaTeX and reference managers
@article{Hassan2026Learning,
  author = {Muhammad Zain Hassan and Alexandros Gazis and Abdurrahman Khan and Zainab Ghazanfar},
  title = {Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection},
  journal = {ICCK Transactions on Sensing, Communication, and Control},
  year = {2026},
  volume = {3},
  number = {1},
  pages = {1-14},
  doi = {10.62762/TSCC.2025.210523},
  url = {https://www.icck.org/article/abs/TSCC.2025.210523},
  abstract = {RGB–thermal (RGB-T) salient object detection exploits complementary cues from visible and thermal sensors to maintain reliable performance in adverse environments. However, many existing methods (i) fuse modalities before sufficiently enhancing intra-modal semantics and (ii) are sensitive to modality discrepancies caused by heterogeneous sensor characteristics. To address these issues, we propose PACNet (Pyramid Attention Collaboration Network), a hierarchical RGB-T framework that jointly models multi-scale and global context and performs refinement-before-fusion with cross-modal collaboration. Specifically, Dense Atrous Spatial Pyramid Pooling (DASPP) captures multi-scale contextual cues across semantic stages, while Multi-Head Self-Attention (MHSA) establishes long-range dependencies for global context modeling. We further design a hierarchical feature integration scheme that constructs two complementary feature streams, preserving fine-grained spatial details and strengthening high-level semantics. These streams are refined using a cross-interactive dual-attention module that enables bidirectional interaction between spatial and channel attention, improving localization and semantic discrimination while mitigating modality imbalance. Experiments on three public benchmarks (VT821, VT1000, and VT5000) demonstrate that PACNet achieves state-of-the-art performance and delivers consistent gains in challenging conditions such as low illumination, thermal clutter, and multi-scale targets.},
  keywords = {salient object detection, RGB-thermal fusion, cross-interactive dual attention, multi-modal learning},
  issn = {3068-9287},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 39
PDF Downloads: 8

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Sensing, Communication, and Control

ICCK Transactions on Sensing, Communication, and Control

ISSN: 3068-9287 (Online) | ISSN: 3068-9279 (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/