LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition
Research Article  ·  Published: 09 August 2024
Issue cover
ICCK Transactions on Emerging Topics in Artificial Intelligence
Volume 1, Issue 1, 2024: 58-70
Research Article Open Access Code Available

LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition

1 Beijing iQIYI Technology Co., Ltd., Beijing 100080, China
2 School of Computer Science and Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China
3 Department of Information Engineering, University of Padua, 35131 Padua, Italy
* Corresponding Author: Xuebo Jin, [email protected]
Volume 1, Issue 1

Abstract

This paper proposes an improved video action recognition method, primarily consisting of three key components. Firstly, in the data preprocessing stage, we developed multi-temporal scale video frame extraction and multi-spatial scale video cropping techniques to enhance content information and standardize input formats. Secondly, we propose a lightweight Inception-3D networks (LI3D) network structure for spatio-temporal feature extraction and design a soft-association feature aggregation module to improve the recognition accuracy of key actions in videos. Lastly, we employ a bidirectional LSTM network to contextualize the feature sequences extracted by LI3D, enhancing the representation capability for temporal data. To improve the model’s robustness and generalization ability, we introduced spatial and temporal scale data augmentation techniques in the preprocessing stage, effectively extracting video key frames and capturing key regional actions. Furthermore, we conducted an in-depth study on spatio-temporal feature extraction methods for video data, effectively extracting spatial and temporal information through the LI3D network and transfer learning. Experimental results demonstrate that the proposed method achieves significant performance improvements in video action recognition tasks, providing new insights and approaches for research in related fields.

Code (Data) Available

https://drive.google.com/file/d/1oJMhtkD2SPBO_g5uReY5GZQvltis7-9E/view?usp=sharing

Graphical Abstract

LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition

Keywords

video action recognition multi-scale preprocessing lightweight I3D (LI3D) spatio-temporal feature extraction bidirectional LSTM

Data Availability Statement

The code and data supporting the findings of this study are available from the following link: https://drive.google.com/file/d/1oJMhtkD2SPBO_g5uReY5GZQvltis7-9E/view?usp=sharing.

Funding

This work was supported without any funding.

Conflicts of Interest

Fafa Wang is affiliated with the Beijing iQIYI Technology Co., Ltd., Beijing 100080, China. The authors declare that this affiliation had no influence on the study design, data collection, analysis, interpretation, or the decision to publish, and that no other competing interests exist.

Ethical Approval and Consent to Participate

Not applicable.

References

  1. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91-110.
    [CrossRef] [Google Scholar]
  2. Andrade-Ambriz, Y. A., Ledesma, S., Ibarra-Manzano, M. A., Oros-Flores, M. I., & Almanza-Ojeda, D. L. (2022). Human activity recognition using temporal convolutional neural network architecture. Expert Systems with Applications, 191, 116287.
    [CrossRef] [Google Scholar]
  3. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
    [CrossRef] [Google Scholar]
  4. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
    [Google Scholar]
  5. Zha, S., Luisier, F., Andrews, W., Srivastava, N., & Salakhutdinov, R. (2015). Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144.
    [CrossRef] [Google Scholar]
  6. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4694-4702).
    [Google Scholar]
  7. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
    [CrossRef] [Google Scholar]
  8. Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., ... & Smola, A. (2022). Resnest: Split-attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2736-2746).
    [Google Scholar]
  9. Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., & Snoek, C. G. (2018). Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166, 41-50.
    [CrossRef] [Google Scholar]
  10. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016, September). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Cham: Springer International Publishing.
    [CrossRef] [Google Scholar]
  11. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.
    [Google Scholar]
  12. Diba, A., Fayyaz, M., Sharma, V., Karami, A. H., Arzani, M. M., Yousefzadeh, R., & Van Gool, L. (2017). Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.
    [CrossRef] [Google Scholar]
  13. Bertasius, G., Wang, H., & Torresani, L. (2021, July). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, No. 3, p. 4).https://proceedings.mlr.press/v139/bertasius21a/bertasius21a-supp.pdf
    [Google Scholar]
  14. Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6824-6835).
    [CrossRef] [Google Scholar]
  15. Tong, Z., Song, Y., Wang, J., & Wang, L. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35, 10078-10093.
    [Google Scholar]
  16. Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083-7093).
    [CrossRef] [Google Scholar]
  17. Ryoo, M., Piergiovanni, A. J., Arnab, A., Dehghani, M., & Angelova, A. (2021). Tokenlearner: Adaptive space-time tokenization for videos. Advances in neural information processing systems, 34, 12786-12797.
    [Google Scholar]
  18. Yang, H., Huang, D., Wen, B., Wu, J., Yao, H., Jiang, Y., ... & Yuan, Z. (2022). Self-supervised video representation learning with motion-aware masked autoencoders. arXiv preprint arXiv:2210.04154.
    [CrossRef] [Google Scholar]
  19. Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14668-14678).
    [CrossRef] [Google Scholar]
  20. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
    [Google Scholar]
  21. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4724-4733).
    [CrossRef] [Google Scholar]
  22. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
    [CrossRef] [Google Scholar]

Cited By (6)

  1. Aiman Khan Nazir, Lanfang Dong, Bei Hua, Qasim Jan, Kang Dong, Xiahan Feng. . 2026 11th International Conference on Intelligent Computing and Signal Processing (ICSP), 2026 .
    [CrossRef]
  2. Zhiqun Lin, Kexin Feng, Diaa Ahmed Mohamed Ahmedien. Improved generative adversarial networks model for movie dance generation. PLOS One, 2025 , 20 (5).
    [CrossRef]
  3. Marjan Kia, Soroush Sadeghi, Homayoun Safarpour, Mohammadreza Kamsari, Saeid Jafarzadeh Ghoushchi, Ramin Ranjbarzadeh. Innovative fusion of VGG16, MobileNet, EfficientNet, AlexNet, and ResNet50 for MRI-based brain tumor identification. Iran Journal of Computer Science, 2025 , 8 (1).
    [CrossRef]
  4. Tianjing Wang, Chao Ren, Zhijun Zhang, Zhao Yang Dong. GPT-Assisted Multi-View Multi-Label Learning for Non-Intrusive Load Monitoring With Incomplete Labels and Data. IEEE Transactions on Consumer Electronics, 2025 , 71 (4).
    [CrossRef]
  5. Marjan Kia. Attention-guided deep learning for effective customer loyalty management and multi-criteria decision analysis. Iran Journal of Computer Science, 2025 , 8 (1).
    [CrossRef]
  6. Lei Shi, Yaodong Gu. Verification and application of deep learning models in daily sports activities of teenagers. PLOS One, 2025 , 20 (6).
    [CrossRef]
* Citation data provided by Crossref Cited-by.

Cite This Article

APA Style
Wang, F., Jin, X., & Yi, S. (2024). LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition. ICCK Transactions on Emerging Topics in Artificial Intelligence, 1(1), 58-70. https://doi.org/10.62762/TETAI.2024.628205
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
TY  - JOUR
AU  - Wang, Fafa
AU  - Jin, Xuebo
AU  - Yi, Shenglun
PY  - 2024
DA  - 2024/08/09
TI  - LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition
JO  - ICCK Transactions on Emerging Topics in Artificial Intelligence
T2  - ICCK Transactions on Emerging Topics in Artificial Intelligence
JF  - ICCK Transactions on Emerging Topics in Artificial Intelligence
VL  - 1
IS  - 1
SP  - 58
EP  - 70
DO  - 10.62762/TETAI.2024.628205
UR  - https://www.icck.org/article/abs/TETAI.2024.628205
KW  - video action recognition
KW  - multi-scale preprocessing
KW  - lightweight I3D (LI3D)
KW  - spatio-temporal feature extraction
KW  - bidirectional LSTM
AB  - This paper proposes an improved video action recognition method, primarily consisting of three key components. Firstly, in the data preprocessing stage, we developed multi-temporal scale video frame extraction and multi-spatial scale video cropping techniques to enhance content information and standardize input formats. Secondly, we propose a lightweight Inception-3D networks (LI3D) network structure for spatio-temporal feature extraction and design a soft-association feature aggregation module to improve the recognition accuracy of key actions in videos. Lastly, we employ a bidirectional LSTM network to contextualize the feature sequences extracted by LI3D, enhancing the representation capability for temporal data. To improve the model’s robustness and generalization ability, we introduced spatial and temporal scale data augmentation techniques in the preprocessing stage, effectively extracting video key frames and capturing key regional actions. Furthermore, we conducted an in-depth study on spatio-temporal feature extraction methods for video data, effectively extracting spatial and temporal information through the LI3D network and transfer learning. Experimental results demonstrate that the proposed method achieves significant performance improvements in video action recognition tasks, providing new insights and approaches for research in related fields.
SN  - 3068-6652
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
@article{Wang2024LI3DBiLSTM,
  author = {Fafa Wang and Xuebo Jin and Shenglun Yi},
  title = {LI3D-BiLSTM: A Lightweight Inception-3D Networks with BiLSTM for Video Action Recognition},
  journal = {ICCK Transactions on Emerging Topics in Artificial Intelligence},
  year = {2024},
  volume = {1},
  number = {1},
  pages = {58-70},
  doi = {10.62762/TETAI.2024.628205},
  url = {https://www.icck.org/article/abs/TETAI.2024.628205},
  abstract = {This paper proposes an improved video action recognition method, primarily consisting of three key components. Firstly, in the data preprocessing stage, we developed multi-temporal scale video frame extraction and multi-spatial scale video cropping techniques to enhance content information and standardize input formats. Secondly, we propose a lightweight Inception-3D networks (LI3D) network structure for spatio-temporal feature extraction and design a soft-association feature aggregation module to improve the recognition accuracy of key actions in videos. Lastly, we employ a bidirectional LSTM network to contextualize the feature sequences extracted by LI3D, enhancing the representation capability for temporal data. To improve the model’s robustness and generalization ability, we introduced spatial and temporal scale data augmentation techniques in the preprocessing stage, effectively extracting video key frames and capturing key regional actions. Furthermore, we conducted an in-depth study on spatio-temporal feature extraction methods for video data, effectively extracting spatial and temporal information through the LI3D network and transfer learning. Experimental results demonstrate that the proposed method achieves significant performance improvements in video action recognition tasks, providing new insights and approaches for research in related fields.},
  keywords = {video action recognition, multi-scale preprocessing, lightweight I3D (LI3D), spatio-temporal feature extraction, bidirectional LSTM},
  issn = {3068-6652},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations
Views
5293
PDF Downloads
818

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

CC BY Copyright © 2024 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
ICCK Transactions on Emerging Topics in Artificial Intelligence
ICCK Transactions on Emerging Topics in Artificial Intelligence
ISSN: 3068-6652 (Online)
Portico
Preserved at
Portico