-
CiteScore
-
Impact Factor
Volume 2, Issue 4, Chinese Journal of Information Fusion
Volume 2, Issue 4, 2025
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
Chinese Journal of Information Fusion, Volume 2, Issue 4, 2025: 356-369

Open Access | Research Article | 13 November 2025
VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification
1 Graduate School of Advanced Technology and Science, Tokushima University, Tokushima 770-8506, Japan
2 Graduate School of Technology, Industrial and Social Sciences, Tokushima University, Tokushima 770-8506, Japan
* Corresponding Author: Xin Kang, [email protected]
Received: 19 May 2025, Accepted: 23 October 2025, Published: 13 November 2025  
Abstract
Multimodal Sentiment Analysis (MSA), a pivotal task in affective computing, aims to enhance sentiment understanding by integrating heterogeneous data from modalities such as text, images, and audio. However, existing methods continue to face challenges in semantic alignment, modality fusion, and interpretability. To address these limitations, we propose VBCSNet, a hybrid attention-based multimodal framework that leverages the complementary strengths of Vision Transformer (ViT), BERT, and CLIP architectures. VBCSNet employs a Structured Self-Attention (SSA) mechanism to optimize intra-modal feature representation and a Cross-Attention module to achieve fine-grained semantic alignment across modalities. Furthermore, we introduce a multi-objective optimization strategy that jointly minimizes classification loss, modality alignment loss, and contrastive loss, thereby enhancing semantic consistency and feature discriminability. We evaluated VBCSNet on three multilingual multimodal sentiment datasets, including MVSA, IJCAI2019, and a self-constructed Japanese Twitter corpus(JP-Buzz). Experimental results demonstrated that VBCSNet significantly outperformed state-of-the-art baselines in terms of Accuracy, Macro-F1, and cross-lingual generalization. Per-class performance analysis further highlighted the model’s interpretability and robustness. Overall, VBCSNet advances sentiment classification across languages and domains while offering a transparent reasoning mechanism suitable for real-world applications in affective computing, human-computer interaction, and socially aware AI systems.

Graphical Abstract
VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification

Keywords
multimodal sentiment analysis
vision-language models
structured self-attention
cross-attention
contrastive learning
interpretability
cross-lingual evaluation

Data Availability Statement
Data will be made available on request.

Funding
This work was supported by the JSPS KAKENHI under Grant JP20K12027 and by JKA and its promotion funds from KEIRIN RACE.

Conflicts of Interest
The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125.
    [CrossRef]   [Google Scholar]
  2. Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
    [Google Scholar]
  3. Sun, S., An, W., Tian, F., Nan, F., Liu, Q., Liu, J., Shah, N., & Chen, P. (2024). A review of multimodal explainable artificial intelligence: Past, present and future. arXiv preprint arXiv:2412.14056.
    [Google Scholar]
  4. Kaur, R., & Kautish, S. (2019). Multimodal sentiment analysis: A survey and comparison. International Journal of Service Science, Management, Engineering, and Technology (IJSSMET), 10(2), 38–58.
    [CrossRef]   [Google Scholar]
  5. Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019, July). Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting (Vol. 2019, p. 6558).
    [CrossRef]   [Google Scholar]
  6. Tan, H., & Bansal, M. (2019). LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
    [Google Scholar]
  7. Li, J., Wang, C., Luo, Z., Wu, Y., & Jiang, X. (2024). Modality-dependent sentiments exploring for multi-modal sentiment classification. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7930–7934).
    [CrossRef]   [Google Scholar]
  8. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).
    [CrossRef]   [Google Scholar]
  9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
    [Google Scholar]
  10. Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
    [Google Scholar]
  11. Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
    [Google Scholar]
  12. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PmLR.
    [Google Scholar]
  13. Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016, January). Sentiment analysis on multi-view social data. In International conference on multimedia modeling (pp. 15-27). Cham: Springer International Publishing.
    [CrossRef]   [Google Scholar]
  14. Yu, J., & Jiang, J. (2019). Adapting BERT for target-oriented multimodal sentiment classification. IJCAI.
    [CrossRef]   [Google Scholar]
  15. Matsumoto, K., Amitani, R., Yoshida, M., & Kita, K. (2022). Trend prediction based on multi-modal affective analysis from social networking posts. Electronics, 11(21), 3431.
    [CrossRef]   [Google Scholar]
  16. Amitani, R., Matsumoto, K., Yoshida, M., & Kita, K. (2022). Affective Analysis and Visualization from Posted Text, Replies, and Images for Analysis of Buzz Factors. In Fuzzy Systems and Data Mining VIII (pp. 191-203). IOS Press.
    [CrossRef]   [Google Scholar]
  17. Zhang, J., & Chen, Z. (2024). Exploring human resource management digital transformation in the digital age. Journal of the Knowledge Economy, 15(1), 1482–1498.
    [CrossRef]   [Google Scholar]
  18. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
    [CrossRef]   [Google Scholar]
  19. Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
    [Google Scholar]
  20. Kim, W., Son, B., & Kim, I. (2021, July). Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning (pp. 5583-5594). PMLR.
    [Google Scholar]
  21. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
    [Google Scholar]
  22. Kumar, A., & Garg, G. (2019). Sentiment analysis of multimodal twitter data. Multimedia Tools and Applications, 78(17), 24103-24119.
    [CrossRef]   [Google Scholar]
  23. Liu, Y., & Matsumoto, K. (2024, December). Enhancing Multimodal Tweet Analysis Accuracy through Integration of CLIP Model and Multi-layer Attention Mechanism. In Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval (pp. 310-316).
    [CrossRef]   [Google Scholar]
  24. Ba, J. L., & Caruana, R. (2013). Do deep nets really need to be deep? arXiv preprint arXiv:1312.6184.
    [Google Scholar]
  25. Cipolla, R., Gal, Y., & Kendall, A. (2018, June). Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7482-7491). IEEE Computer Society.
    [CrossRef]   [Google Scholar]
  26. Hazarika, D., Zimmermann, R., & Poria, S. (2020). MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 1122–1131).
    [CrossRef]   [Google Scholar]
  27. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
    [Google Scholar]
  28. Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., & others. (2022). An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18166–18176).
    [CrossRef]   [Google Scholar]
  29. Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
    [Google Scholar]
  30. Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
    [Google Scholar]
  31. Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., & Hoi, S. (2023). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 49250–49267.
    [Google Scholar]
  32. Islam, A., Biswas, M. R., Zaghouani, W., Belhaouari, S. B., & Shah, Z. (2023). Pushing boundaries: Exploring zero shot object classification with large multimodal models. In 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 1–5).
    [CrossRef]   [Google Scholar]
  33. Zhang, X., Guo, J., Zhao, S., Fu, M., Duan, L., Wang, G. H., Chen, Q. G., Xu, Z., Luo, W., & Zhang, K. (2025). Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567.
    [Google Scholar]
  34. Cambria, E., Poria, S., Gelbukh, A., & Thelwall, M. (2017). Sentiment analysis is a big suitcase. IEEE Intelligent Systems, 32(6), 74–80.
    [CrossRef]   [Google Scholar]
  35. Gonzalez-Varona, J. M., López-Paredes, A., Poza, D., & Acebes, F. (2024). Building and development of an organizational competence for digital transformation in SMEs. arXiv preprint arXiv:2406.01615.
    [Google Scholar]

Cite This Article
APA Style
Liu, Y., Kang, X., Matsumoto, K., & Zhou, J. (2025). VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification. Chinese Journal of Information Fusion, 2(4), 356–369. https://doi.org/10.62762/CJIF.2025.537775

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 96
PDF Downloads: 25

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
CC BY Copyright © 2025 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
Chinese Journal of Information Fusion

Chinese Journal of Information Fusion

ISSN: 2998-3371 (Online) | ISSN: 2998-3363 (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/