Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention

Ikram Majeed Khan; Wisal Khan

doi:10.62762/TSCC.2026.664028

Article Information

Published in ICCK Transactions on Sensing, Communication, and Control

Volume/Issue Volume 3, Issue 2, 2026

Pages 109-123

Abstract

Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios.

Graphical Abstract

Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention

Keywords

medical imaging hybrid CNN-Transformer multi-scale attention semantic fusion deep learning

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

AI Use Statement

The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate

Not applicable.

References

Hossain, M. S., Karuniawati, H., Jairoun, A. A., Urbi, Z., Ooi, D. J., John, A., ... & Hadi, M. A. (2022). Colorectal cancer: a review of carcinogenesis, global epidemiology, current challenges, risk factors, preventive and treatment strategies. Cancers, 14(7), 1732.
[CrossRef] [Google Scholar]
Kim, N. H., Jung, Y. S., Jeong, W. S., Yang, H. J., Park, S. K., Choi, K., & Park, D. I. (2017). Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intestinal research, 15(3), 411.
[CrossRef] [Google Scholar]
Sanchez-Peralta, L. F., Bote-Curiel, L., Picon, A., Sanchez-Margallo, F. M., & Pagador, J. B. (2020). Deep learning to find colorectal polyps in colonoscopy: A systematic literature review. Artificial intelligence in medicine, 108, 101923.
[CrossRef] [Google Scholar]
Zhao, X., Jia, H., Pang, Y., Lv, L., Tian, F., Zhang, L., ... & Lu, H. (2023). M$^{2$ SNet: Multi-scale in multi-scale subtraction network for medical image segmentation. arXiv preprint arXiv:2303.10894.
[CrossRef] [Google Scholar]
Hu, K., Chen, W., Sun, Y., Hu, X., Zhou, Q., & Zheng, Z. (2023). PPNet: Pyramid pooling based network for polyp segmentation. Computers in biology and medicine, 160, 107028.
[CrossRef] [Google Scholar]
Tomar, N. K., Jha, D., Riegler, M. A., Johansen, H. D., Johansen, D., Rittscher, J., ... & Ali, S. (2022). Fanet: A feedback attention network for improved biomedical image segmentation. IEEE Transactions on Neural Networks and Learning Systems, 34(11), 9375-9388.
[CrossRef] [Google Scholar]
Su, Y., Cheng, J., Zhong, C., Jiang, C., Ye, J., & He, J. (2023). Accurate polyp segmentation through enhancing feature fusion and boosting boundary performance. Neurocomputing, 545, 126233.
[CrossRef] [Google Scholar]
Zhou, T., Zhou, Y., He, K., Gong, C., Yang, J., Fu, H., & Shen, D. (2023). Cross-level feature aggregation network for polyp segmentation. Pattern Recognition, 140, 109555.
[CrossRef] [Google Scholar]
Yue, G., Han, W., Jiang, B., Zhou, T., Cong, R., & Wang, T. (2022). Boundary constraint network with cross layer feature integration for polyp segmentation. IEEE Journal of Biomedical and Health Informatics, 26(8), 4090-4099.
[CrossRef] [Google Scholar]
Tomar, N. K., Jha, D., & Bagci, U. (2023, January). Dilatedsegnet: A deep dilated segmentation network for polyp segmentation. In International conference on multimedia modeling (pp. 334-344). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Yang, H., Chen, Q., Fu, K., Zhu, L., Jin, L., Qiu, B., ... & Lu, Y. (2022). Boosting medical image segmentation via conditional-synergistic convolution and lesion decoupling. Computerized Medical Imaging and Graphics, 101, 102110.
[CrossRef] [Google Scholar]
Xiao, H., Li, L., Liu, Q., Zhu, X., & Zhang, Q. (2023). Transformers in medical image segmentation: A review. Biomedical Signal Processing and Control, 84, 104791.
[CrossRef] [Google Scholar]
Shamshad, F., Khan, S., Zamir, S. W., Khan, M. H., Hayat, M., Khan, F. S., & Fu, H. (2023). Transformers in medical imaging: A survey. Medical image analysis, 88, 102802.
[CrossRef] [Google Scholar]
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
[CrossRef] [Google Scholar]
Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Cham: Springer international publishing.
[CrossRef] [Google Scholar]
Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., & Liang, J. (2018, September). Unet++: A nested u-net architecture for medical image segmentation. In International workshop on deep learning in medical image analysis (pp. 3-11). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Fang, Y., Chen, C., Yuan, Y., & Tong, K. Y. (2019, October). Selective feature aggregation network with area-boundary constraints for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 302-310). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Hatamizadeh, A., Terzopoulos, D., & Myronenko, A. (2019, October). End-to-end boundary aware networks for medical image segmentation. In International Workshop on Machine Learning in Medical Imaging (pp. 187-194). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., ... & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
[CrossRef] [Google Scholar]
Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021, September). Medical transformer: Gated axial-attention for medical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 36-46). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Usman, M. T., Khan, H., Khan, H., Rida, I., Zhu, X., & Koo, J. (2025). HMPFormer: Hierarchical vision transformer with multi-perspective feature learning for precise polyp segmentation. Image and Vision Computing, 105777.
[CrossRef] [Google Scholar]
Zhao, X., Zhang, L., & Lu, H. (2021, September). Automatic polyp segmentation via multi-scale subtraction network. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 120-130). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Fan, D. P., Ji, G. P., Zhou, T., Chen, G., Fu, H., Shen, J., & Shao, L. (2020, September). Pranet: Parallel reverse attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 263-273). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Cai, L., Wu, M., Chen, L., Bai, W., Yang, M., Lyu, S., & Zhao, Q. (2022, September). Using guided self-attention with local information for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 629-638). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Lou, A., Guan, S., Ko, H., & Loew, M. H. (2022, April). CaraNet: context axial reverse attention network for segmentation of small medical objects. In Medical Imaging 2022: Image Processing (Vol. 12032, pp. 81-92). SPIE.
[CrossRef] [Google Scholar]
Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S. K., & Cui, S. (2021, September). Shallow attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 699-708). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Liu, F., Hua, Z., Li, J., & Fan, L. (2022). Dbmf: Dual branch multiscale feature fusion network for polyp segmentation. Computers in Biology and Medicine, 151, 106304.
[CrossRef] [Google Scholar]
Li, X., Wang, W., Hu, X., & Yang, J. (2019, June). Selective Kernel Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 510-519). IEEE.
[CrossRef] [Google Scholar]
Song, P., Li, J., & Fan, H. (2022). Attention based multi-scale parallel network for polyp segmentation. Computers in Biology and Medicine, 146, 105476.
[CrossRef] [Google Scholar]
He, J., Deng, Z., & Qiao, Y. (2019). Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3562-3572).
[CrossRef] [Google Scholar]
Tomar, N. K., Jha, D., Bagci, U., & Ali, S. (2022). TGANet: Text-guided attention for improved polyp segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 (pp. 151–160). Springer
[CrossRef] [Google Scholar]
Sinha, A., & Dolz, J. (2020). Multi-scale self-guided attention for medical image segmentation. IEEE journal of biomedical and health informatics, 25(1), 121-130.
[CrossRef] [Google Scholar]
Srivastava, A., Jha, D., Chanda, S., Pal, U., Johansen, H. D., Johansen, D., ... & Halvorsen, P. (2021). MSRF-Net: A multi-scale residual fusion network for biomedical image segmentation. IEEE Journal of Biomedical and Health Informatics, 26(5), 2252-2263.
[CrossRef] [Google Scholar]
Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P., De Lange, T., Johansen, D., & Johansen, H. D. (2019, December). Kvasir-seg: A segmented polyp dataset. In International conference on multimedia modeling (pp. 451-462). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., Gil, D., Rodríguez, C., & Vilariño, F. (2015). WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43, 99-111.
[CrossRef] [Google Scholar]
Tajbakhsh, N., Gurudu, S. R., & Liang, J. (2015). Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging, 35(2), 630-644.
[CrossRef] [Google Scholar]
Vázquez, D., Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., López, A. M., Romero, A., ... & Courville, A. (2017). A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering, 2017(1), 4037190.
[CrossRef] [Google Scholar]
Silva, J., Histace, A., Romain, O., Dray, X., & Granado, B. (2014). Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9(2), 283-293.
[CrossRef] [Google Scholar]
Zhang, R., Li, G., Li, Z., Cui, S., Qian, D., & Yu, Y. (2020, September). Adaptive context selection for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 253-262). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Kim, T., Lee, H., & Kim, D. (2021, October). Uacanet: Uncertainty augmented context attention for polyp segmentation. In Proceedings of the 29th ACM international conference on multimedia (pp. 2167-2175).
[CrossRef] [Google Scholar]
Dong, B., Wang, W., Fan, D. P., Li, J., Fu, H., & Shao, L. (2021). Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932.
[CrossRef] [Google Scholar]
Qiu, Z., Wang, Z., Zhang, M., Xu, Z., Fan, J., & Xu, L. (2022, April). BDG-Net: boundary distribution guided network for accurate polyp segmentation. In Medical Imaging 2022: Image Processing (Vol. 12032, pp. 792-799). SPIE.
[CrossRef] [Google Scholar]
Wang, J., Huang, Q., Tang, F., Meng, J., Su, J., & Song, S. (2022, September). Stepwise feature fusion: Local guides global. In International conference on medical image computing and computer-assisted intervention (pp. 110-120). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Rahman, M. M., & Marculescu, R. (2023, January). Medical Image Segmentation via Cascaded Attention Decoding. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 6211-6220). IEEE.
[CrossRef] [Google Scholar]
Bui, N. T., Hoang, D. H., Nguyen, Q. T., Tran, M. T., & Le, N. (2024, January). MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 7970-7979). IEEE.
[CrossRef] [Google Scholar]
Rahman, M. M., Munir, M., & Marculescu, R. (2024, June). EMCAD: Efficient Multi-Scale Convolutional Attention Decoding for Medical Image Segmentation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11769-11779). IEEE.
[CrossRef] [Google Scholar]
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018, September). CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision (pp. 3-19). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13713--13722). IEEE.
[CrossRef] [Google Scholar]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012--10022). IEEE.
[CrossRef] [Google Scholar]
Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11976--11986). IEEE.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Khan, I. M., & Khan, W. (2026). Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention. ICCK Transactions on Sensing, Communication, and Control, 3(2), 109-123. https://doi.org/10.62762/TSCC.2026.664028

Export Citation

RIS Format

Compatible with EndNote, Zotero, Mendeley, and other reference managers

TY  - JOUR
AU  - Khan, Ikram Majeed
AU  - Khan, Wisal
PY  - 2026
DA  - 2026/06/28
TI  - Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention
JO  - ICCK Transactions on Sensing, Communication, and Control
T2  - ICCK Transactions on Sensing, Communication, and Control
JF  - ICCK Transactions on Sensing, Communication, and Control
VL  - 3
IS  - 2
SP  - 109
EP  - 123
DO  - 10.62762/TSCC.2026.664028
UR  - https://www.icck.org/article/abs/TSCC.2026.664028
KW  - medical imaging
KW  - hybrid CNN-Transformer
KW  - multi-scale attention
KW  - semantic fusion
KW  - deep learning
AB  - Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios.
SN  - 3068-9287
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  -

BibTeX Format

Compatible with LaTeX, BibTeX, and other reference managers

@article{Khan2026ScaleSpeci,
  author = {Ikram Majeed Khan and Wisal Khan},
  title = {Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention},
  journal = {ICCK Transactions on Sensing, Communication, and Control},
  year = {2026},
  volume = {3},
  number = {2},
  pages = {109-123},
  doi = {10.62762/TSCC.2026.664028},
  url = {https://www.icck.org/article/abs/TSCC.2026.664028},
  abstract = {Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios.},
  keywords = {medical imaging, hybrid CNN-Transformer, multi-scale attention, semantic fusion, deep learning},
  issn = {3068-9287},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations

Crossref

0

Scopus

0

Views

29

PDF Downloads

2

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Sensing, Communication, and Control

ISSN: 3068-9287 (Online) | ISSN: 3068-9279 (Print)

[email protected]

Preserved at
Portico

User

Unlimited Downloads

Complete Library Access

Membership Eligibility

Community Leadership Opportunities