Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning

Sheng Hong; Xuanqi Wang; Zeyu Mei; Thisura Bojitha Wickramaratne

doi:10.62762/TIS.2025.610574

CiteScore

Impact Factor

Volume 2, Issue 4, ICCK Transactions on Intelligent Systematics

Volume 2, Issue 4, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval Enhancing Fake News Detection with a Hybrid NLP-Machine Learning Framework A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Plant Disease Detection Using Deep Learning Techniques Inaugural Editorial of the Chinese Journal of Information Fusion

ICCK Transactions on Intelligent Systematics, Volume 2, Issue 4, 2025: 203-212

Free to Read | Research Article | 04 October 2025

Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning

Sheng Hong 1,2 *

Xuanqi Wang 3

Zeyu Mei 4

Thisura Bojitha Wickramaratne 5

1 School of Cyber Science and Technology, Beihang University, Beijing 100191, China

2 Nanchang University, Nanchang 330031, China

3 School of Information Engineering, Nanchang University, Nanchang 330031, China

4 International Business School, Beijing Foreign Studies University, Beijing 100089, China

5 School of Cyber Science and Technology, Beihang University, Beijing 100191, China

* Corresponding Author: Sheng Hong, [email protected]

DOI: 10.62762/TIS.2025.610574

Received: 07 June 2025, Accepted: 11 August 2025, Published: 04 October 2025

PDF (963.00 KB)

Article Metrics Cite This Article

Abstract

With the rapid development of multimodal large language models (MLLMs), the demand for structured event extraction (EE) in the field of scientific and technological intelligence is increasing. However, significant challenges remain in zero-shot multimodal and cross-language scenarios, including inconsistent cross-language outputs and the high computational cost of full-parameter fine-tuning. This study takes VideoLLaMA2 (VL2) and its improved version VL2.1 as the core models, and builds a multimodal annotated dataset covering English, Chinese, Spanish, and Russian (including 5,728 EE samples). It systematically evaluates the performance differences of zero-shot learning, and parameter-efficient fine-tuning (QLoRA) techniques. The experimental results show that for EE, by using the VL2 model and the VL2.1 in combination with QLoRA fine-tuning to it, the triggers accuracy rate can be increased to 65.48\%, the arguments accuracy rate to 60.54\%. The study confirms that fine-tuning significantly enhance model robustness.

Graphical Abstract

Keywords

event extraction

QLoRA

multimodal LLMs

multilingual NLP

Data Availability Statement

Data will be made available on request.

Funding

This work was supported by the National Key Research and Development Program under Grant 2022YFB3103602.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Mohammed, A., & Kora, R. (2025). A Comprehensive Overview and Analysis of Large Language Models: Trends and Challenges. IEEE Access, 13, 95851-95875.
[CrossRef] [Google Scholar]
Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., ... & Bing, L. (2024). Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476.
[Google Scholar]
Hládek, D., Staš, J., Juhár, J., & Koctúr, T. (2023). Slovak dataset for multilingual question answering. IEEE Access, 11, 32869-32881.
[CrossRef] [Google Scholar]
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., ... & Li, X. (2022, December). Few-shot learning with multilingual generative language models. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 9019-9052).
[CrossRef] [Google Scholar]
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
[Google Scholar]
Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019). Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546.
[Google Scholar]
Wu, J., Gan, W., Chen, Z., Wan, S., & Yu, P. S. (2023, December). Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2247-2256). IEEE.
[CrossRef] [Google Scholar]
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
[Google Scholar]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
[Google Scholar]
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
[Google Scholar]
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36, 10088-10115.
[Google Scholar]
Hong, S., Yue, T., You, Y., Lv, Z., Tang, X., Hu, J., & Yin, H. (2025). A Resilience Recovery Method for Complex Traffic Network Security Based on Trend Forecasting. International Journal of Intelligent Systems, 2025(1), 3715086.
[CrossRef] [Google Scholar]
Parthasarathy, V. B., Zafar, A., Khan, A., & Shahid, A. (2024). The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296.
[Google Scholar]
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., & Yuan, L. (2023). Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
[Google Scholar]
Zhang, H., Li, X., & Bing, L. (2023). Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
[Google Scholar]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., ... & Qiu, Z. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388.
[Google Scholar]
Chirkova, N., & Nikoulina, V. (2024). Zero-shot cross-lingual transfer in instruction tuning of large language models. arXiv preprint arXiv:2402.14778.
[Google Scholar]
Wang, J., Liu, Y., & Wang, X. E. (2021). Assessing multilingual fairness in pre-trained multimodal representations. arXiv preprint arXiv:2106.06683.
[Google Scholar]
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2022). Lora: Low-rank adaptation of large language models. ICLR, 1(2), 3.
[Google Scholar]
Xiang, W., & Wang, B. (2019). A survey of event extraction from text. IEEE Access, 7, 173111-173137.
[CrossRef] [Google Scholar]
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., ... & Qiao, Y. (2024). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2), 581-595.
[CrossRef] [Google Scholar]
Song, Z., Bies, A., Strassel, S., Riese, T., Mott, J., Ellis, J., ... & Ma, X. (2015, June). From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the 3rd workshop on EVENTS: Definition, detection, coreference, and representation (pp. 89-98).
[CrossRef] [Google Scholar]
Siriborvornratanakul, T. (2025, May). From Human Annotators to AI: The Transition and the Role of Synthetic Data in AI Development. In International Conference on Human-Computer Interaction (pp. 379-390). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., ... & Li, X. (2021). Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
[Google Scholar]
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
[Google Scholar]
Zhang, X., Wang, Z., & Li, P. (2023, June). Multimodal Chinese Event Extraction on Text and Audio. In 2023 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.
[CrossRef] [Google Scholar]
Joshi, R. (2025). Human-in-the-Loop AI in Financial Services: Data Engineering That Enables Judgment at Scale. Journal of Computer Science and Technology Studies, 7(7), 228-236.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Hong, S., Wang, X., Mei, Z., & Wickramaratne, T. B. (2025). Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning. ICCK Transactions on Intelligent Systematics, 2(4), 203–212. https://doi.org/10.62762/TIS.2025.610574

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 16

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Intelligent Systematics

ISSN: 3068-5079 (Online) | ISSN: 3069-003X (Print)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies