VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification

Yupu Liu; Xin Kang; Kazuyuki Matsumoto; Jiazheng Zhou

doi:10.62762/CJIF.2025.537775

CiteScore

Impact Factor

Volume 2, Issue 4, Chinese Journal of Information Fusion

Volume 2, Issue 4, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Reinforcement Learning for Prompt Optimization in Language Models: A Comprehensive Survey of Methods, Representations, and Evaluation Challenges Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval AI and the Future of Education: Advancing Personalized Learning and Intelligent Tutoring Systems Enhancing Fake News Detection with a Hybrid NLP-Machine Learning Framework Plant Disease Detection Using Deep Learning Techniques Acrylamide in Food: Sources and Prevention Modeling Brain Functional Networks Using Graph Neural Networks: A Review and Clinical Application Analyzing the Translation and Impact of Popular Science Literature in China: A Case Study Approach

Chinese Journal of Information Fusion, Volume 2, Issue 4, 2025: 356-369

Open Access | Research Article | 13 November 2025

VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification

Yupu Liu 1

Xin Kang 1 *

Kazuyuki Matsumoto 2

Jiazheng Zhou 1

1 Graduate School of Advanced Technology and Science, Tokushima University, Tokushima 770-8506, Japan

2 Graduate School of Technology, Industrial and Social Sciences, Tokushima University, Tokushima 770-8506, Japan

* Corresponding Author: Xin Kang, [email protected]

DOI: 10.62762/CJIF.2025.537775

Received: 19 May 2025, Accepted: 23 October 2025, Published: 13 November 2025

PDF (1.48 MB)

Article Metrics Cite This Article

Abstract

Multimodal Sentiment Analysis (MSA), a pivotal task in affective computing, aims to enhance sentiment understanding by integrating heterogeneous data from modalities such as text, images, and audio. However, existing methods continue to face challenges in semantic alignment, modality fusion, and interpretability. To address these limitations, we propose VBCSNet, a hybrid attention-based multimodal framework that leverages the complementary strengths of Vision Transformer (ViT), BERT, and CLIP architectures. VBCSNet employs a Structured Self-Attention (SSA) mechanism to optimize intra-modal feature representation and a Cross-Attention module to achieve fine-grained semantic alignment across modalities. Furthermore, we introduce a multi-objective optimization strategy that jointly minimizes classification loss, modality alignment loss, and contrastive loss, thereby enhancing semantic consistency and feature discriminability. We evaluated VBCSNet on three multilingual multimodal sentiment datasets, including MVSA, IJCAI2019, and a self-constructed Japanese Twitter corpus(JP-Buzz). Experimental results demonstrated that VBCSNet significantly outperformed state-of-the-art baselines in terms of Accuracy, Macro-F1, and cross-lingual generalization. Per-class performance analysis further highlighted the model’s interpretability and robustness. Overall, VBCSNet advances sentiment classification across languages and domains while offering a transparent reasoning mechanism suitable for real-world applications in affective computing, human-computer interaction, and socially aware AI systems.

Graphical Abstract

VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification

Keywords

multimodal sentiment analysis

vision-language models

structured self-attention

cross-attention

contrastive learning

interpretability

cross-lingual evaluation

Data Availability Statement

Data will be made available on request.

Funding

This work was supported by the JSPS KAKENHI under Grant JP20K12027 and by JKA and its promotion funds from KEIRIN RACE.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125.
[CrossRef] [Google Scholar]
Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. P. (2017). Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
[Google Scholar]
Sun, S., An, W., Tian, F., Nan, F., Liu, Q., Liu, J., Shah, N., & Chen, P. (2024). A review of multimodal explainable artificial intelligence: Past, present and future. arXiv preprint arXiv:2412.14056.
[Google Scholar]
Kaur, R., & Kautish, S. (2019). Multimodal sentiment analysis: A survey and comparison. International Journal of Service Science, Management, Engineering, and Technology (IJSSMET), 10(2), 38–58.
[CrossRef] [Google Scholar]
Tsai, Y. H. H., Bai, S., Liang, P. P., Kolter, J. Z., Morency, L. P., & Salakhutdinov, R. (2019, July). Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for computational linguistics. Meeting (Vol. 2019, p. 6558).
[CrossRef] [Google Scholar]
Tan, H., & Bansal, M. (2019). LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
[Google Scholar]
Li, J., Wang, C., Luo, Z., Wu, Y., & Jiang, X. (2024). Modality-dependent sentiments exploring for multi-modal sentiment classification. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7930–7934).
[CrossRef] [Google Scholar]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).
[CrossRef] [Google Scholar]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
[Google Scholar]
Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
[Google Scholar]
Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[Google Scholar]
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PmLR.
[Google Scholar]
Niu, T., Zhu, S., Pang, L., & El Saddik, A. (2016, January). Sentiment analysis on multi-view social data. In International conference on multimedia modeling (pp. 15-27). Cham: Springer International Publishing.
[CrossRef] [Google Scholar]
Yu, J., & Jiang, J. (2019). Adapting BERT for target-oriented multimodal sentiment classification. IJCAI.
[CrossRef] [Google Scholar]
Matsumoto, K., Amitani, R., Yoshida, M., & Kita, K. (2022). Trend prediction based on multi-modal affective analysis from social networking posts. Electronics, 11(21), 3431.
[CrossRef] [Google Scholar]
Amitani, R., Matsumoto, K., Yoshida, M., & Kita, K. (2022). Affective Analysis and Visualization from Posted Text, Replies, and Images for Analysis of Buzz Factors. In Fuzzy Systems and Data Mining VIII (pp. 191-203). IOS Press.
[CrossRef] [Google Scholar]
Zhang, J., & Chen, Z. (2024). Exploring human resource management digital transformation in the digital age. Journal of the Knowledge Economy, 15(1), 1482–1498.
[CrossRef] [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
[CrossRef] [Google Scholar]
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems, 32.
[Google Scholar]
Kim, W., Son, B., & Kim, I. (2021, July). Vilt: Vision-and-language transformer without convolution or region supervision. In International conference on machine learning (pp. 5583-5594). PMLR.
[Google Scholar]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[Google Scholar]
Kumar, A., & Garg, G. (2019). Sentiment analysis of multimodal twitter data. Multimedia Tools and Applications, 78(17), 24103-24119.
[CrossRef] [Google Scholar]
Liu, Y., & Matsumoto, K. (2024, December). Enhancing Multimodal Tweet Analysis Accuracy through Integration of CLIP Model and Multi-layer Attention Mechanism. In Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval (pp. 310-316).
[CrossRef] [Google Scholar]
Ba, J. L., & Caruana, R. (2013). Do deep nets really need to be deep? arXiv preprint arXiv:1312.6184.
[Google Scholar]
Cipolla, R., Gal, Y., & Kendall, A. (2018, June). Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 7482-7491). IEEE Computer Society.
[CrossRef] [Google Scholar]
Hazarika, D., Zimmermann, R., & Poria, S. (2020). MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 1122–1131).
[CrossRef] [Google Scholar]
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
[Google Scholar]
Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., & others. (2022). An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18166–18176).
[CrossRef] [Google Scholar]
Li, J., Li, D., Xiong, C., & Hoi, S. (2022, June). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888-12900). PMLR.
[Google Scholar]
Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
[Google Scholar]
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., & Hoi, S. (2023). InstructBLIP: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 49250–49267.
[Google Scholar]
Islam, A., Biswas, M. R., Zaghouani, W., Belhaouari, S. B., & Shah, Z. (2023). Pushing boundaries: Exploring zero shot object classification with large multimodal models. In 2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 1–5).
[CrossRef] [Google Scholar]
Zhang, X., Guo, J., Zhao, S., Fu, M., Duan, L., Wang, G. H., Chen, Q. G., Xu, Z., Luo, W., & Zhang, K. (2025). Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv preprint arXiv:2505.02567.
[Google Scholar]
Cambria, E., Poria, S., Gelbukh, A., & Thelwall, M. (2017). Sentiment analysis is a big suitcase. IEEE Intelligent Systems, 32(6), 74–80.
[CrossRef] [Google Scholar]
Gonzalez-Varona, J. M., López-Paredes, A., Poza, D., & Acebes, F. (2024). Building and development of an organizational competence for digital transformation in SMEs. arXiv preprint arXiv:2406.01615.
[Google Scholar]

Cite This Article

APA Style

Liu, Y., Kang, X., Matsumoto, K., & Zhou, J. (2025). VBCSNet: A Hybrid Attention-Based Multimodal Framework with Structured Self-Attention for Sentiment Classification. Chinese Journal of Information Fusion, 2(4), 356–369. https://doi.org/10.62762/CJIF.2025.537775

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 25

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Copyright © 2025 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

Chinese Journal of Information Fusion

ISSN: 2998-3371 (Online) | ISSN: 2998-3363 (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies