Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention
Article Information
Abstract
Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios.
Graphical Abstract
Keywords
Data Availability Statement
Funding
Conflicts of Interest
AI Use Statement
Ethical Approval and Consent to Participate
References
- Hossain, M. S., Karuniawati, H., Jairoun, A. A., Urbi, Z., Ooi, D. J., John, A., ... & Hadi, M. A. (2022). Colorectal cancer: a review of carcinogenesis, global epidemiology, current challenges, risk factors, preventive and treatment strategies. Cancers, 14(7), 1732.
[CrossRef] [Google Scholar] - Kim, N. H., Jung, Y. S., Jeong, W. S., Yang, H. J., Park, S. K., Choi, K., & Park, D. I. (2017). Miss rate of colorectal neoplastic polyps and risk factors for missed polyps in consecutive colonoscopies. Intestinal research, 15(3), 411.
[CrossRef] [Google Scholar] - Sanchez-Peralta, L. F., Bote-Curiel, L., Picon, A., Sanchez-Margallo, F. M., & Pagador, J. B. (2020). Deep learning to find colorectal polyps in colonoscopy: A systematic literature review. Artificial intelligence in medicine, 108, 101923.
[CrossRef] [Google Scholar] - Zhao, X., Jia, H., Pang, Y., Lv, L., Tian, F., Zhang, L., ... & Lu, H. (2023). M$^{2$ SNet: Multi-scale in multi-scale subtraction network for medical image segmentation. arXiv preprint arXiv:2303.10894.
[CrossRef] [Google Scholar] - Hu, K., Chen, W., Sun, Y., Hu, X., Zhou, Q., & Zheng, Z. (2023). PPNet: Pyramid pooling based network for polyp segmentation. Computers in biology and medicine, 160, 107028.
[CrossRef] [Google Scholar] - Tomar, N. K., Jha, D., Riegler, M. A., Johansen, H. D., Johansen, D., Rittscher, J., ... & Ali, S. (2022). Fanet: A feedback attention network for improved biomedical image segmentation. IEEE Transactions on Neural Networks and Learning Systems, 34(11), 9375-9388.
[CrossRef] [Google Scholar] - Su, Y., Cheng, J., Zhong, C., Jiang, C., Ye, J., & He, J. (2023). Accurate polyp segmentation through enhancing feature fusion and boosting boundary performance. Neurocomputing, 545, 126233.
[CrossRef] [Google Scholar] - Zhou, T., Zhou, Y., He, K., Gong, C., Yang, J., Fu, H., & Shen, D. (2023). Cross-level feature aggregation network for polyp segmentation. Pattern Recognition, 140, 109555.
[CrossRef] [Google Scholar] - Yue, G., Han, W., Jiang, B., Zhou, T., Cong, R., & Wang, T. (2022). Boundary constraint network with cross layer feature integration for polyp segmentation. IEEE Journal of Biomedical and Health Informatics, 26(8), 4090-4099.
[CrossRef] [Google Scholar] - Tomar, N. K., Jha, D., & Bagci, U. (2023, January). Dilatedsegnet: A deep dilated segmentation network for polyp segmentation. In International conference on multimedia modeling (pp. 334-344). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Yang, H., Chen, Q., Fu, K., Zhu, L., Jin, L., Qiu, B., ... & Lu, Y. (2022). Boosting medical image segmentation via conditional-synergistic convolution and lesion decoupling. Computerized Medical Imaging and Graphics, 101, 102110.
[CrossRef] [Google Scholar] - Xiao, H., Li, L., Liu, Q., Zhu, X., & Zhang, Q. (2023). Transformers in medical image segmentation: A review. Biomedical Signal Processing and Control, 84, 104791.
[CrossRef] [Google Scholar] - Shamshad, F., Khan, S., Zamir, S. W., Khan, M. H., Hayat, M., Khan, F. S., & Fu, H. (2023). Transformers in medical imaging: A survey. Medical image analysis, 88, 102802.
[CrossRef] [Google Scholar] - Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
[CrossRef] [Google Scholar] - Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Cham: Springer international publishing.
[CrossRef] [Google Scholar] - Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., & Liang, J. (2018, September). Unet++: A nested u-net architecture for medical image segmentation. In International workshop on deep learning in medical image analysis (pp. 3-11). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Fang, Y., Chen, C., Yuan, Y., & Tong, K. Y. (2019, October). Selective feature aggregation network with area-boundary constraints for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 302-310). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Hatamizadeh, A., Terzopoulos, D., & Myronenko, A. (2019, October). End-to-end boundary aware networks for medical image segmentation. In International Workshop on Machine Learning in Medical Imaging (pp. 187-194). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., ... & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
[CrossRef] [Google Scholar] - Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021, September). Medical transformer: Gated axial-attention for medical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 36-46). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Usman, M. T., Khan, H., Khan, H., Rida, I., Zhu, X., & Koo, J. (2025). HMPFormer: Hierarchical vision transformer with multi-perspective feature learning for precise polyp segmentation. Image and Vision Computing, 105777.
[CrossRef] [Google Scholar] - Zhao, X., Zhang, L., & Lu, H. (2021, September). Automatic polyp segmentation via multi-scale subtraction network. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 120-130). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Fan, D. P., Ji, G. P., Zhou, T., Chen, G., Fu, H., Shen, J., & Shao, L. (2020, September). Pranet: Parallel reverse attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 263-273). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Cai, L., Wu, M., Chen, L., Bai, W., Yang, M., Lyu, S., & Zhao, Q. (2022, September). Using guided self-attention with local information for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 629-638). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar] - Lou, A., Guan, S., Ko, H., & Loew, M. H. (2022, April). CaraNet: context axial reverse attention network for segmentation of small medical objects. In Medical Imaging 2022: Image Processing (Vol. 12032, pp. 81-92). SPIE.
[CrossRef] [Google Scholar] - Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S. K., & Cui, S. (2021, September). Shallow attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 699-708). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Liu, F., Hua, Z., Li, J., & Fan, L. (2022). Dbmf: Dual branch multiscale feature fusion network for polyp segmentation. Computers in Biology and Medicine, 151, 106304.
[CrossRef] [Google Scholar] - Li, X., Wang, W., Hu, X., & Yang, J. (2019, June). Selective Kernel Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 510-519). IEEE.
[CrossRef] [Google Scholar] - Song, P., Li, J., & Fan, H. (2022). Attention based multi-scale parallel network for polyp segmentation. Computers in Biology and Medicine, 146, 105476.
[CrossRef] [Google Scholar] - He, J., Deng, Z., & Qiao, Y. (2019). Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3562-3572).
[CrossRef] [Google Scholar] - Tomar, N. K., Jha, D., Bagci, U., & Ali, S. (2022). TGANet: Text-guided attention for improved polyp segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 (pp. 151–160). Springer
[CrossRef] [Google Scholar] - Sinha, A., & Dolz, J. (2020). Multi-scale self-guided attention for medical image segmentation. IEEE journal of biomedical and health informatics, 25(1), 121-130.
[CrossRef] [Google Scholar] - Srivastava, A., Jha, D., Chanda, S., Pal, U., Johansen, H. D., Johansen, D., ... & Halvorsen, P. (2021). MSRF-Net: A multi-scale residual fusion network for biomedical image segmentation. IEEE Journal of Biomedical and Health Informatics, 26(5), 2252-2263.
[CrossRef] [Google Scholar] - Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P., De Lange, T., Johansen, D., & Johansen, H. D. (2019, December). Kvasir-seg: A segmented polyp dataset. In International conference on multimedia modeling (pp. 451-462). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., Gil, D., Rodríguez, C., & Vilariño, F. (2015). WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43, 99-111.
[CrossRef] [Google Scholar] - Tajbakhsh, N., Gurudu, S. R., & Liang, J. (2015). Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging, 35(2), 630-644.
[CrossRef] [Google Scholar] - Vázquez, D., Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., López, A. M., Romero, A., ... & Courville, A. (2017). A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering, 2017(1), 4037190.
[CrossRef] [Google Scholar] - Silva, J., Histace, A., Romain, O., Dray, X., & Granado, B. (2014). Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9(2), 283-293.
[CrossRef] [Google Scholar] - Zhang, R., Li, G., Li, Z., Cui, S., Qian, D., & Yu, Y. (2020, September). Adaptive context selection for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 253-262). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Kim, T., Lee, H., & Kim, D. (2021, October). Uacanet: Uncertainty augmented context attention for polyp segmentation. In Proceedings of the 29th ACM international conference on multimedia (pp. 2167-2175).
[CrossRef] [Google Scholar] - Dong, B., Wang, W., Fan, D. P., Li, J., Fu, H., & Shao, L. (2021). Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932.
[CrossRef] [Google Scholar] - Qiu, Z., Wang, Z., Zhang, M., Xu, Z., Fan, J., & Xu, L. (2022, April). BDG-Net: boundary distribution guided network for accurate polyp segmentation. In Medical Imaging 2022: Image Processing (Vol. 12032, pp. 792-799). SPIE.
[CrossRef] [Google Scholar] - Wang, J., Huang, Q., Tang, F., Meng, J., Su, J., & Song, S. (2022, September). Stepwise feature fusion: Local guides global. In International conference on medical image computing and computer-assisted intervention (pp. 110-120). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar] - Rahman, M. M., & Marculescu, R. (2023, January). Medical Image Segmentation via Cascaded Attention Decoding. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 6211-6220). IEEE.
[CrossRef] [Google Scholar] - Bui, N. T., Hoang, D. H., Nguyen, Q. T., Tran, M. T., & Le, N. (2024, January). MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation. In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 7970-7979). IEEE.
[CrossRef] [Google Scholar] - Rahman, M. M., Munir, M., & Marculescu, R. (2024, June). EMCAD: Efficient Multi-Scale Convolutional Attention Decoding for Medical Image Segmentation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11769-11779). IEEE.
[CrossRef] [Google Scholar] - Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018, September). CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision (pp. 3-19). Cham: Springer International Publishing.
[CrossRef] [Google Scholar] - Hou, Q., Zhou, D., & Feng, J. (2021). Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13713--13722). IEEE.
[CrossRef] [Google Scholar] - Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012--10022). IEEE.
[CrossRef] [Google Scholar] - Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11976--11986). IEEE.
[CrossRef] [Google Scholar]
Cite This Article
TY - JOUR AU - Khan, Ikram Majeed AU - Khan, Wisal PY - 2026 DA - 2026/06/28 TI - Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention JO - ICCK Transactions on Sensing, Communication, and Control T2 - ICCK Transactions on Sensing, Communication, and Control JF - ICCK Transactions on Sensing, Communication, and Control VL - 3 IS - 2 SP - 109 EP - 123 DO - 10.62762/TSCC.2026.664028 UR - https://www.icck.org/article/abs/TSCC.2026.664028 KW - medical imaging KW - hybrid CNN-Transformer KW - multi-scale attention KW - semantic fusion KW - deep learning AB - Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios. SN - 3068-9287 PB - Institute of Central Computation and Knowledge LA - English ER -
@article{Khan2026ScaleSpeci,
author = {Ikram Majeed Khan and Wisal Khan},
title = {Scale-Specific Visual Sensing for Colonoscopy Polyp Segmentation via Hybrid CNN-Transformer Attention},
journal = {ICCK Transactions on Sensing, Communication, and Control},
year = {2026},
volume = {3},
number = {2},
pages = {109-123},
doi = {10.62762/TSCC.2026.664028},
url = {https://www.icck.org/article/abs/TSCC.2026.664028},
abstract = {Precise segmentation of colorectal polyps in colonoscopy images is essential for timely cancer diagnosis and prevention. Nevertheless, current segmentation methods contend with intrinsic variability in polyp appearance, differences in size, shape, and texture, while preserving computational efficiency necessary for clinical implementation. In this paper, we present a novel segmentation architecture that integrates scale-specific attention mechanisms within a hybrid CNN-Transformer backbone to address these challenges. Our model employs Coordinate Attention for high-resolution feature maps to preserve spatial details essential for boundary delineation, and Channel Attention for deep semantic features to enhance discriminative capacity. These representations are progressively integrated through a hierarchical decoder with specialized fusion modules: Semantic Fusion for high-level features, and Detail-Preserving Fusion for low-level features. The proposed architecture achieves state-of-the-art performance across five benchmark datasets, demonstrating superior generalization and robustness in challenging clinical scenarios.},
keywords = {medical imaging, hybrid CNN-Transformer, multi-scale attention, semantic fusion, deep learning},
issn = {3068-9287},
publisher = {Institute of Central Computation and Knowledge}
}
Article Metrics
Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and Permissions
Portico