Dual-Pathway Sensing with Optimized Attention Network for Video Summarization in Surveillance Systems

Taimur Ali Khan; Danish Ali; Zainab Ghazanfar; Bilal Ahmad

doi:10.62762/TSCC.2025.308540

CiteScore

Impact Factor

Volume 2, Issue 4, ICCK Transactions on Sensing, Communication, and Control

Volume 2, Issue 4, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Reservoir Science: A Multi-Coupling Communication Platform to Promote Energy Transformation, Climate Change and Environmental Protection From CO$_2$ Sequestration to Hydrogen Storage: Further Utilization of Depleted Gas Reservoirs Reinforcement Learning for Prompt Optimization in Language Models: A Comprehensive Survey of Methods, Representations, and Evaluation Challenges Effects of Crosslinking Agents and Reservoir Conditions on the Propagation of Fractures in Coal Reservoirs During Hydraulic Fracturing AI and the Future of Education: Advancing Personalized Learning and Intelligent Tutoring Systems The Influence of Geological Factors and Transmission Fluids on the Exploitation of Reservoir Geothermal Resources: Factor Discussion and Mechanism Analysis YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Plant Disease Detection Using Deep Learning Techniques Current Status and Development Prospects of Carbon Capture, Utilization, and Storage (CCUS) in China: Technical, Policy, and Market Perspectives

ICCK Transactions on Sensing, Communication, and Control, Volume 2, Issue 4, 2025: 276-289

Free to Read | Research Article | 30 December 2025

Dual-Pathway Sensing with Optimized Attention Network for Video Summarization in Surveillance Systems

Taimur Ali Khan 1

Danish Ali 2

Zainab Ghazanfar 3

Bilal Ahmad 4 *

1 Department of IT, Saudi Media Systems, Riyadh 11482, Saudi Arabia

2 Department of Electrical and Computer Engineering, Villanova University, Villanova, PA 19085, United States

3 Department of Software and Artificial Intelligence, Gachon University, Seongnam 13120, South Korea

4 Department of Computer Science, Govt Degree College Lal Qilla Maidan Dir Lower, Pakistan

* Corresponding Author: Bilal Ahmad, [email protected]

DOI: 10.62762/TSCC.2025.308540

ARK: ark:/57805/tscc.2025.308540

Received: 20 October 2025, Accepted: 05 December 2025, Published: 30 December 2025

PDF (7.64 MB)

Article Metrics Cite This Article

Abstract

Video summarization (VS) aims to generate concise representations of long videos by extracting the most informative frames while maintaining essential content. Existing methods struggle to capture multi-scale dependencies and often rely on suboptimal feature representations, limiting their ability to model complex inter-frame relationships. To address these issues, we propose a multi-scale sensing network that incorporates three key innovations to improve VS. First, we introduce multi-scale dilated convolution blocks with progressively increasing dilation rates to capture temporal context at multiple levels, enabling the network to understand both local transitions and long-range dependencies. Second, we develop a Dual-Pathway Efficient Channel Attention (DECA) module that leverages statistics from Global Average Pooling and Global Max Pooling pathways. Third, we suggest an Optimized Spatial Attention (OSA) module that replaces standard $7\times7$ convolutions with more efficient operations while maintaining spatial dependency modeling. The proposed framework uses EfficientNetB7 as the backbone for robust spatial feature extraction, followed by multi-scale dilated blocks and dual attention mechanisms for detailed feature refinement. Extensive tests on the TVSum and SumMe benchmark datasets demonstrate the superiority of our method, achieving F1 Scores of 63.5% and 53.3%, respectively.

Graphical Abstract

Keywords

video summarization

visual intelligence

surveillance systems

dual-pathway

attention network

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Bleu, N. (2022). 25 Latest Facebook Video Statistics, Facts, And Trends (2022). Retrieved from https://bloggingwizard.com/facebook-video-statistics/ (accessed on 29 December 2025).
[Google Scholar]
Ajmal, M., Naseer, M., Ahmad, F., & Saleem, A. (2017, December). Human motion trajectory analysis based video summarization. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (pp. 550-555). IEEE.
[CrossRef] [Google Scholar]
Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., & Wang, J. (2021). Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia, 25, 1161-1174.
[CrossRef] [Google Scholar]
Li, H., Ke, Q., Gong, M., & Drummond, T. (2023, January). Progressive Video Summarization via Multimodal Self-supervised Learning. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 5573-5582). IEEE.
[CrossRef] [Google Scholar]
Liang, G., Lv, Y., Li, S., Wang, X., & Zhang, Y. (2022). Video summarization with a dual-path attentive network. Neurocomputing, 467, 1-9.
[CrossRef] [Google Scholar]
Zhang, Y., Zhang, T., Wang, S., & Yu, P. (2025). An efficient perceptual video compression scheme based on deep learning-assisted video saliency and just noticeable distortion. Engineering Applications of Artificial Intelligence, 141, 109806.
[CrossRef] [Google Scholar]
Elhamifar, E., Sapiro, G., & Sastry, S. S. (2015). Dissimilarity-based sparse subset selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11), 2182–2197.
[CrossRef] [Google Scholar]
Zhou, K., Qiao, Y., & Xiang, T. (2018). Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
[CrossRef] [Google Scholar]
Yuan, L., Tay, F. E. H., Li, P., & Feng, J. (2019). Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Transactions on Multimedia, 22(10), 2711–2722.
[CrossRef] [Google Scholar]
Muhammad, K., Hussain, T., & Baik, S. W. (2020). Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognition Letters, 130, 370-375.
[CrossRef] [Google Scholar]
Zhang, K., Chao, W. L., Sha, F., & Grauman, K. (2016, September). Video summarization with long short-term memory. In European conference on computer vision (pp. 766-782). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Zhao, B., Li, X., & Lu, X. (2017). Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia (pp. 863–871).
[CrossRef] [Google Scholar]
Rochan, M., Ye, L., & Wang, Y. (2018). Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 347–363).
[CrossRef] [Google Scholar]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[Google Scholar]
Karthik, R., Hariharan, M., Anand, S., Mathikshara, P., Johnson, A., & Menaka, R. (2020). Attention embedded residual CNN for disease detection in tomato leaves. Applied Soft Computing, 86, 105933.
[CrossRef] [Google Scholar]
Ji, Z., Xiong, K., Pang, Y., & Li, X. (2019). Video summarization with attention-based encoder–decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 30(6), 1709–1717.
[CrossRef] [Google Scholar]
Li, P., Ye, Q., Zhang, L., Yuan, L., Xu, X., & Shao, L. (2021). Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition, 111, 107677.
[CrossRef] [Google Scholar]
Ji, Z., Zhao, Y., Pang, Y., Li, X., & Han, J. (2020). Deep attentive video summarization with distribution consistency learning. IEEE Transactions on Neural Networks and Learning Systems, 32(4), 1765–1775.
[CrossRef] [Google Scholar]
Zhu, W., Lu, J., Han, Y., & Zhou, J. (2022). Learning multiscale hierarchical attention for video summarization. Pattern Recognition, 122, 108312.
[CrossRef] [Google Scholar]
An, Y., & Zhao, S. (2022). SHTVS: Shot-level based Hierarchical Transformer for Video Summarization. In Proceedings of the 2022 5th International Conference on Image and Graphics Processing (pp. 268–274).
[CrossRef] [Google Scholar]
Ngo, C. W., Ma, Y. F., & Zhang, H. J. (2005). Video summarization and scene detection by graph modeling. IEEE Transactions on circuits and systems for video technology, 15(2), 296-305.
[CrossRef] [Google Scholar]
Zhou, H., Sadka, A. H., Swash, M. R., Azizi, J., & Sadiq, U. A. (2010). Feature extraction and clustering for dynamic video summarisation. Neurocomputing, 73(10–12), 1718–1729.
[CrossRef] [Google Scholar]
Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition (pp. 1346-1353). IEEE.
[CrossRef] [Google Scholar]
Mundur, P., Rao, Y., & Yesha, Y. (2006). Keyframe-based video summarization using delaunay clustering. International journal on digital libraries, 6(2), 219-232.
[CrossRef] [Google Scholar]
De Avila, S. E. F., Lopes, A. P. B., da Luz Jr, A., & de Albuquerque Araújo, A. (2011). VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1), 56–68.
[CrossRef] [Google Scholar]
Chu, W. S., Song, Y., & Jaimes, A. (2015, June). Video co-summarization: Video summarization by visual co-occurrence. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3584-3592). IEEE.
[CrossRef] [Google Scholar]
Mei, S., Guan, G., Wang, Z., Wan, S., He, M., & Feng, D. D. (2015). Video summarization via minimum sparse reconstruction. Pattern Recognition, 48(2), 522–533.
[CrossRef] [Google Scholar]
Li, X., Zhao, B., & Lu, X. (2017). A general framework for edited video and raw video summarization. IEEE Transactions on Image Processing, 26(8), 3652–3664.
[CrossRef] [Google Scholar]
Mei, S., Ma, M., Wan, S., Hou, J., Wang, Z., & Feng, D. D. (2020). Patch based video summarization with block sparse representation. IEEE Transactions on Multimedia, 23, 732–747.
[CrossRef] [Google Scholar]
Muhammad, K., Hussain, T., Tanveer, M., Sannino, G., & De Albuquerque, V. H. C. (2019). Cost-effective video summarization using deep CNN with hierarchical weighted fusion for IoT surveillance networks. IEEE Internet of Things Journal, 7(5), 4455–4463.
[CrossRef] [Google Scholar]
Fei, M., Jiang, W., & Mao, W. (2017). Memorable and rich video summarization. Journal of Visual Communication and Image Representation, 42, 207–217.
[CrossRef] [Google Scholar]
Muhammad, K., Hussain, T., Del Ser, J., Palade, V., & De Albuquerque, V. H. C. (2019). DeepReS: A deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Transactions on Industrial Informatics, 16(9), 5938–5947.
[CrossRef] [Google Scholar]
Mohan, J., & Nair, M. S. (2019). Static video summarization using sparse autoencoders. In 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT) (pp. 1–8). IEEE.
[CrossRef] [Google Scholar]
Zhong, R., Wang, R., Zou, Y., Hong, Z., & Hu, M. (2021). Graph attention networks adjusted bi-LSTM for video summarization. IEEE Signal Processing Letters, 28, 663–667.
[CrossRef] [Google Scholar]
Sahu, A., & Chowdhury, A. S. (2021). First person video summarization using different graph representations. Pattern Recognition Letters, 146, 185–192.
[CrossRef] [Google Scholar]
Potapov, D., Douze, M., Harchaoui, Z., & Schmid, C. (2014, September). Category-specific video summarization. In European conference on computer vision (pp. 540-555). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Gygli, M., Grabner, H., Riemenschneider, H., & Van Gool, L. (2014, September). Creating summaries from user videos. In European conference on computer vision (pp. 505-520). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Zhang, K., Grauman, K., & Sha, F. (2018, September). Retrospective Encoders for Video Summarization. In European Conference on Computer Vision (pp. 391-408).
[CrossRef] [Google Scholar]
Zhao, B., Li, X., & Lu, X. (2020). TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Transactions on Industrial Electronics, 68(4), 3629–3637.
[CrossRef] [Google Scholar]
Fajtl, J., Sokeh, H. S., Argyriou, V., Monekosso, D., & Remagnino, P. (2018, December). Summarizing videos with attention. In Asian conference on computer vision (pp. 39-54). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Zhu, W., Lu, J., Li, J., & Zhou, J. (2020). Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 30, 948-962.
[CrossRef] [Google Scholar]
Munsif, M., Khan, N., Hussain, A., Kim, M. J., & Baik, S. W. (2024). Darkness-adaptive action recognition: Leveraging efficient tubelet slow-fast network for industrial applications. IEEE Transactions on Industrial Informatics.
[CrossRef] [Google Scholar]
Amin, S. U., Abbas, M. S., Kim, B., Jung, Y., & Seo, S. (2024). Enhanced anomaly detection in pandemic surveillance videos: An attention approach with EfficientNet-B0 and CBAM integration. IEEE Access.
[CrossRef] [Google Scholar]
Samel, K., Beedu, A., Sontakke, N., & Essa, I. (2024). Exploring Efficient Foundational Multi-modal Models for Video Summarization. arXiv preprint arXiv:2410.07405.
[Google Scholar]
Lebron Casas, L., & Koblents, E. (2018). Video summarization with LSTM and deep attention models. In International Conference on Multimedia Modeling (pp. 67–79). Springer.
[CrossRef] [Google Scholar]
Zhang, Y., Wang, S., Zhang, Y., & Yu, P. (2025). Asymmetric light-aware progressive decoding network for RGB-thermal salient object detection. Journal of Electronic Imaging, 34(1), 013005–013005.
[CrossRef] [Google Scholar]
Chen, Z., Xu, Q., Cong, R., & Huang, Q. (2020, April). Global context-aware progressive aggregation network for salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 10599-10606).
[CrossRef] [Google Scholar]
Zhang, Q., Cong, R., Li, C., Cheng, M. M., Fang, Y., Cao, X., ... & Kwong, S. (2020). Dense attention fluid network for salient object detection in optical remote sensing images. IEEE Transactions on Image Processing, 30, 1305-1317.
[CrossRef] [Google Scholar]
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018, September). CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision (pp. 3-19). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Liang, B., Luo, H., Wang, J., & Shark, L. K. (2025). Multi-scale attention-edge interactive refinement network for salient object detection. Expert Systems with Applications, 275, 127056.
[CrossRef] [Google Scholar]
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[Google Scholar]
Song, Y., Vallmitjana, J., Stent, A., & Jaimes, A. (2015, June). TVSum: Summarizing web videos using titles. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5179-5187). IEEE.
[CrossRef] [Google Scholar]
Zhang, K., Chao, W. L., Sha, F., & Grauman, K. (2016, June). Summary Transfer: Exemplar-Based Subset Selection for Video Summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1059-1067). IEEE.
[CrossRef] [Google Scholar]
Li, Y., Wang, L., Yang, T., & Gong, B. (2018, September). How Local Is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization. In European Conference on Computer Vision (pp. 156-174).
[CrossRef] [Google Scholar]
Huang, C., & Wang, H. (2019). A novel key-frames selection framework for comprehensive video summarization. IEEE Transactions on Circuits and Systems for Video Technology, 30(2), 577–589.
[CrossRef] [Google Scholar]
Zhao, B., Li, X., & Lu, X. (2018, June). HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7405-7414). IEEE.
[CrossRef] [Google Scholar]
Elfeki, M., & Borji, A. (2019, January). Video summarization via actionness ranking. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 754-763). IEEE.
[CrossRef] [Google Scholar]
Fu, H., & Wang, H. (2021). Self-attention binary neural tree for video summarization. Pattern recognition letters, 143, 19-26.
[CrossRef] [Google Scholar]
Lin, J., Zhong, S. H., & Fares, A. (2022). Deep hierarchical LSTM networks with attention for video summarization. Computers & Electrical Engineering, 97, 107618.
[CrossRef] [Google Scholar]
Alharbi, F., Habib, S., Albattah, W., Jan, Z., Alanazi, M. D., & Islam, M. (2024). Effective video summarization using channel attention-assisted encoder–decoder framework. Symmetry, 16(6), 680.
[CrossRef] [Google Scholar]
Zhang, K., Wang, W., Lv, Z., Fan, Y., & Song, Y. (2021). Computer vision detection of foreign objects in coal processing using attention CNN. Engineering Applications of Artificial Intelligence, 102, 104242.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Khan, T. A., Ali, D., Ghazanfar, Z., & Ahmad, B. (2025). Dual-Pathway Sensing with Optimized Attention Network for Video Summarization in Surveillance Systems. ICCK Transactions on Sensing, Communication, and Control, 2(4), 276–289. https://doi.org/10.62762/TSCC.2025.308540

Export Citation

RIS Format

Compatible with EndNote, Zotero, Mendeley, and other reference managers

RIS format data for reference managers

TY  - JOUR
AU  - Khan, Taimur Ali
AU  - Ali, Danish
AU  - Ghazanfar, Zainab
AU  - Ahmad, Bilal
PY  - 2025
DA  - 2025/12/30
TI  - Dual-Pathway Sensing with Optimized Attention Network for Video Summarization in Surveillance Systems
JO  - ICCK Transactions on Sensing, Communication, and Control
T2  - ICCK Transactions on Sensing, Communication, and Control
JF  - ICCK Transactions on Sensing, Communication, and Control
VL  - 2
IS  - 4
SP  - 276
EP  - 289
DO  - 10.62762/TSCC.2025.308540
UR  - https://www.icck.org/article/abs/TSCC.2025.308540
KW  - video summarization
KW  - visual intelligence
KW  - surveillance systems
KW  - dual-pathway
KW  - attention network
AB  - Video summarization (VS) aims to generate concise representations of long videos by extracting the most informative frames while maintaining essential content. Existing methods struggle to capture multi-scale dependencies and often rely on suboptimal feature representations, limiting their ability to model complex inter-frame relationships. To address these issues, we propose a multi-scale sensing network that incorporates three key innovations to improve VS. First, we introduce multi-scale dilated convolution blocks with progressively increasing dilation rates to capture temporal context at multiple levels, enabling the network to understand both local transitions and long-range dependencies. Second, we develop a Dual-Pathway Efficient Channel Attention (DECA) module that leverages statistics from Global Average Pooling and Global Max Pooling pathways. Third, we suggest an Optimized Spatial Attention (OSA) module that replaces standard $7\times7$ convolutions with more efficient operations while maintaining spatial dependency modeling. The proposed framework uses EfficientNetB7 as the backbone for robust spatial feature extraction, followed by multi-scale dilated blocks and dual attention mechanisms for detailed feature refinement. Extensive tests on the TVSum and SumMe benchmark datasets demonstrate the superiority of our method, achieving F1 Scores of 63.5% and 53.3%, respectively.
SN  - 3068-9287
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  -

BibTeX Format

Compatible with LaTeX, BibTeX, and other reference managers

BibTeX format data for LaTeX and reference managers

@article{Khan2025DualPathwa,
  author = {Taimur Ali Khan and Danish Ali and Zainab Ghazanfar and Bilal Ahmad},
  title = {Dual-Pathway Sensing with Optimized Attention Network for Video Summarization in Surveillance Systems},
  journal = {ICCK Transactions on Sensing, Communication, and Control},
  year = {2025},
  volume = {2},
  number = {4},
  pages = {276-289},
  doi = {10.62762/TSCC.2025.308540},
  url = {https://www.icck.org/article/abs/TSCC.2025.308540},
  abstract = {Video summarization (VS) aims to generate concise representations of long videos by extracting the most informative frames while maintaining essential content. Existing methods struggle to capture multi-scale dependencies and often rely on suboptimal feature representations, limiting their ability to model complex inter-frame relationships. To address these issues, we propose a multi-scale sensing network that incorporates three key innovations to improve VS. First, we introduce multi-scale dilated convolution blocks with progressively increasing dilation rates to capture temporal context at multiple levels, enabling the network to understand both local transitions and long-range dependencies. Second, we develop a Dual-Pathway Efficient Channel Attention (DECA) module that leverages statistics from Global Average Pooling and Global Max Pooling pathways. Third, we suggest an Optimized Spatial Attention (OSA) module that replaces standard \$7\times7\$ convolutions with more efficient operations while maintaining spatial dependency modeling. The proposed framework uses EfficientNetB7 as the backbone for robust spatial feature extraction, followed by multi-scale dilated blocks and dual attention mechanisms for detailed feature refinement. Extensive tests on the TVSum and SumMe benchmark datasets demonstrate the superiority of our method, achieving F1 Scores of 63.5\% and 53.3\%, respectively.},
  keywords = {video summarization, visual intelligence, surveillance systems, dual-pathway, attention network},
  issn = {3068-9287},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 30

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Sensing, Communication, and Control

ISSN: 3068-9287 (Online) | ISSN: 3068-9279 (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

User

Unlimited Downloads

Complete Library Access

Membership Eligibility

Community Leadership Opportunities

Google Scholar

Crossref

Scopus

Web of Science