-
CiteScore
-
Impact Factor
Volume 2, Issue 4, ICCK Transactions on Intelligent Systematics
Volume 2, Issue 4, 2025
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Intelligent Systematics, Volume 2, Issue 4, 2025: 203-212

Free to Read | Research Article | 04 October 2025
Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning
1 School of Cyber Science and Technology, Beihang University, Beijing 100191, China
2 Nanchang University, Nanchang 330031, China
3 School of Information Engineering, Nanchang University, Nanchang 330031, China
4 International Business School, Beijing Foreign Studies University, Beijing 100089, China
5 School of Cyber Science and Technology, Beihang University, Beijing 100191, China
* Corresponding Author: Sheng Hong, [email protected]
Received: 07 June 2025, Accepted: 11 August 2025, Published: 04 October 2025  
Abstract
With the rapid development of multimodal large language models (MLLMs), the demand for structured event extraction (EE) in the field of scientific and technological intelligence is increasing. However, significant challenges remain in zero-shot multimodal and cross-language scenarios, including inconsistent cross-language outputs and the high computational cost of full-parameter fine-tuning. This study takes VideoLLaMA2 (VL2) and its improved version VL2.1 as the core models, and builds a multimodal annotated dataset covering English, Chinese, Spanish, and Russian (including 5,728 EE samples). It systematically evaluates the performance differences of zero-shot learning, and parameter-efficient fine-tuning (QLoRA) techniques. The experimental results show that for EE, by using the VL2 model and the VL2.1 in combination with QLoRA fine-tuning to it, the triggers accuracy rate can be increased to 65.48\%, the arguments accuracy rate to 60.54\%. The study confirms that fine-tuning significantly enhance model robustness.

Graphical Abstract
Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning

Keywords
event extraction
QLoRA
multimodal LLMs
multilingual NLP

Data Availability Statement
Data will be made available on request.

Funding
This work was supported by the National Key Research and Development Program under Grant 2022YFB3103602.

Conflicts of Interest
The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Mohammed, A., & Kora, R. (2025). A Comprehensive Overview and Analysis of Large Language Models: Trends and Challenges. IEEE Access, 13, 95851-95875.
    [CrossRef]   [Google Scholar]
  2. Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., ... & Bing, L. (2024). Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476.
    [Google Scholar]
  3. Hládek, D., Staš, J., Juhár, J., & Koctúr, T. (2023). Slovak dataset for multilingual question answering. IEEE Access, 11, 32869-32881.
    [CrossRef]   [Google Scholar]
  4. Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., ... & Li, X. (2022, December). Few-shot learning with multilingual generative language models. In Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 9019-9052).
    [CrossRef]   [Google Scholar]
  5. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
    [Google Scholar]
  6. Wadden, D., Wennberg, U., Luan, Y., & Hajishirzi, H. (2019). Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546.
    [Google Scholar]
  7. Wu, J., Gan, W., Chen, Z., Wan, S., & Yu, P. S. (2023, December). Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData) (pp. 2247-2256). IEEE.
    [CrossRef]   [Google Scholar]
  8. Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
    [Google Scholar]
  9. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
    [Google Scholar]
  10. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
    [Google Scholar]
  11. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36, 10088-10115.
    [Google Scholar]
  12. Hong, S., Yue, T., You, Y., Lv, Z., Tang, X., Hu, J., & Yin, H. (2025). A Resilience Recovery Method for Complex Traffic Network Security Based on Trend Forecasting. International Journal of Intelligent Systems, 2025(1), 3715086.
    [CrossRef]   [Google Scholar]
  13. Parthasarathy, V. B., Zafar, A., Khan, A., & Shahid, A. (2024). The ultimate guide to fine-tuning llms from basics to breakthroughs: An exhaustive review of technologies, research, best practices, applied research challenges and opportunities. arXiv preprint arXiv:2408.13296.
    [Google Scholar]
  14. Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., & Yuan, L. (2023). Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
    [Google Scholar]
  15. Zhang, H., Li, X., & Bing, L. (2023). Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
    [Google Scholar]
  16. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., ... & Qiu, Z. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388.
    [Google Scholar]
  17. Chirkova, N., & Nikoulina, V. (2024). Zero-shot cross-lingual transfer in instruction tuning of large language models. arXiv preprint arXiv:2402.14778.
    [Google Scholar]
  18. Wang, J., Liu, Y., & Wang, X. E. (2021). Assessing multilingual fairness in pre-trained multimodal representations. arXiv preprint arXiv:2106.06683.
    [Google Scholar]
  19. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2022). Lora: Low-rank adaptation of large language models. ICLR, 1(2), 3.
    [Google Scholar]
  20. Xiang, W., & Wang, B. (2019). A survey of event extraction from text. IEEE Access, 7, 173111-173137.
    [CrossRef]   [Google Scholar]
  21. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., ... & Qiao, Y. (2024). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2), 581-595.
    [CrossRef]   [Google Scholar]
  22. Song, Z., Bies, A., Strassel, S., Riese, T., Mott, J., Ellis, J., ... & Ma, X. (2015, June). From light to rich ERE: Annotation of entities, relations, and events. In Proceedings of the 3rd workshop on EVENTS: Definition, detection, coreference, and representation (pp. 89-98).
    [CrossRef]   [Google Scholar]
  23. Siriborvornratanakul, T. (2025, May). From Human Annotators to AI: The Transition and the Role of Synthetic Data in AI Development. In International Conference on Human-Computer Interaction (pp. 379-390). Cham: Springer Nature Switzerland.
    [CrossRef]   [Google Scholar]
  24. Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S., Simig, D., ... & Li, X. (2021). Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
    [Google Scholar]
  25. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
    [Google Scholar]
  26. Zhang, X., Wang, Z., & Li, P. (2023, June). Multimodal Chinese Event Extraction on Text and Audio. In 2023 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.
    [CrossRef]   [Google Scholar]
  27. Joshi, R. (2025). Human-in-the-Loop AI in Financial Services: Data Engineering That Enables Judgment at Scale. Journal of Computer Science and Technology Studies, 7(7), 228-236.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Hong, S., Wang, X., Mei, Z., & Wickramaratne, T. B. (2025). Cross-Lingual Multimodal Event Extraction: A Unified Framework for Parameter-Efficient Fine-Tuning. ICCK Transactions on Intelligent Systematics, 2(4), 203–212. https://doi.org/10.62762/TIS.2025.610574

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 38
PDF Downloads: 16

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Intelligent Systematics

ICCK Transactions on Intelligent Systematics

ISSN: 3068-5079 (Online) | ISSN: 3069-003X (Print)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/