Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal
Research Article  ·  Published: 27 July 2025
Issue cover
ICCK Transactions on Intelligent Systematics
Volume 2, Issue 3, 2025: 160-168
Research Article Free to Read

Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal

1 Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi 23460, Pakistan
2 Faculty of Computer Science, CECOS University of Information Technology and Emerging Sciences, Peshawar 25000, Pakistan
3 Graduate School of Information Science and Technology, Osaka University, Osaka 565-0871, Japan
4 School of Information Technology, Deakin University, Geelong, Victoria 3220, Australia
5 School of Computing and Digital Technology, Birmingham City University, West Midlands B5 5JU, United Kingdom
* Corresponding Authors: Junaid Yousaf, [email protected]; Iqra Pervaiz, [email protected]
Volume 2, Issue 3

Article Information

Abstract

Multimodal intelligent systems that integrate natural language processing with generative visual synthesis represent a frontier in intelligent information processing. This work addresses the design and evaluation of such a pipeline, using poetic content as a stress-test domain due to its high density of figurative language and abstract semantics. Building upon the PoemSum dataset, we construct a two-stage multimodal pipeline: first employing transformer-based models (BART and T5) for abstractive summarization, then leveraging Stable Diffusion for visual synthesis from the generated summaries. The summarization stage focuses on figurative interpretation that captures metaphorical and symbolic elements inherent in poetic language. Evaluation results show that the BART model outperforms T5 in summarization, achieving a ROUGE score of 41.90% and a BERTScore of 85.22. For image generation, the Inception Score (IS) of 7.63 $\pm$ 0.62 reflects high visual quality and diversity, while the CLIP Score of 29.48 indicates strong semantic alignment between textual summaries and generated images. The proposed architecture demonstrates a generalizable framework for multimodal intelligent systems, with potential applications in intelligent tutoring, automated content generation, and human-computer interaction.

Graphical Abstract

Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal

Keywords

multimodal intelligent systems abstractive summarization text-to-image synthesis diffusion models semantic alignment transformer architectures

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

  1. Mahbub, R., Khan, I., Anuva, S., Shahriar, M. S., Laskar, M. T. R., & Ahmed, S. (2023, December). Unveiling the essence of poetry: Introducing a comprehensive dataset and bench
    [Google Scholar]
  2. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems (NeurIPS), 33, 6840–6851.
    [CrossRef] [Google Scholar]
  3. Li, B., Qi, X., Lukasiewicz, T., & Torr, P. (2019). Controllable text-to-image generation. Advances in neural information processing systems, 32.
    [CrossRef] [Google Scholar]
  4. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (p. 7871). Association for Computational Linguistics.
    [CrossRef] [Google Scholar]
  5. Virmani, M., Pathak, M., Pai, K. S., & Prasad, V. B. (2023, May). Image synthesis from themes captured in poems using latent diffusion models. In 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC) (pp. 655-660). IEEE.
    [CrossRef] [Google Scholar]
  6. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), 1-67.
    [Google Scholar]
  7. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695).
    [Google Scholar]
  8. Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). https://aclanthology.org/W04-1013.pdf
    [Google Scholar]
  9. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675.
    [Google Scholar]
  10. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. Advances in neural information processing systems, 29.
    [Google Scholar]
  11. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PmLR.
    [Google Scholar]
  12. Nasfi, R., De Tré, G., & Bronselaer, A. (2025). Improving data cleaning by learning from unstructured textual data. IEEE Access.
    [CrossRef] [Google Scholar]
  13. Chu, X., Ilyas, I. F., Krishnan, S., & Wang, J. (2016, June). Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 international conference on management of data (pp. 2201-2206).
    [CrossRef] [Google Scholar]

Cited By (2)

  1. Danping Han, Tao Xu, Jue Li, Qingqing Yu, Wenbin Nie, Jiayan Li. Element mining, network associations and scene reconstruction of qiantang river poetry road literary allusion landscapes. npj Heritage Science, 2026 , 14 (1).
    [CrossRef]
  2. Md. Ismiel Hossen Abir, Nayeema Ferdous, Afsara Tasnim, Nabiha Mustaqeem. . 2026 5th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE), 2026 .
    [CrossRef]
* Citation data provided by Crossref Cited-by.

Cite This Article

APA Style
Yousaf, J., Iqbal, M., Pervaiz, I., Ismail, M., Islam, T. U., & Jadoon, K. K. (2025). Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal. ICCK Transactions on Intelligent Systematics, 2(3), 160–168. https://doi.org/10.62762/TIS.2025.405393
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
TY  - JOUR
AU  - Yousaf, Junaid
AU  - Iqbal, Mazhar
AU  - Pervaiz, Iqra
AU  - Ismail, Muhammad
AU  - Islam, Toqeer Ul
AU  - Jadoon, Khurram Khan
PY  - 2025
DA  - 2025/07/27
TI  - Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal
JO  - ICCK Transactions on Intelligent Systematics
T2  - ICCK Transactions on Intelligent Systematics
JF  - ICCK Transactions on Intelligent Systematics
VL  - 2
IS  - 3
SP  - 160
EP  - 168
DO  - 10.62762/TIS.2025.405393
UR  - https://www.icck.org/article/abs/TIS.2025.405393
KW  - multimodal intelligent systems
KW  - abstractive summarization
KW  - text-to-image synthesis
KW  - diffusion models
KW  - semantic alignment
KW  - transformer architectures
AB  - Multimodal intelligent systems that integrate natural language processing with generative visual synthesis represent a frontier in intelligent information processing. This work addresses the design and evaluation of such a pipeline, using poetic content as a stress-test domain due to its high density of figurative language and abstract semantics. Building upon the PoemSum dataset, we construct a two-stage multimodal pipeline: first employing transformer-based models (BART and T5) for abstractive summarization, then leveraging Stable Diffusion for visual synthesis from the generated summaries. The summarization stage focuses on figurative interpretation that captures metaphorical and symbolic elements inherent in poetic language. Evaluation results show that the BART model outperforms T5 in summarization, achieving a ROUGE score of 41.90% and a BERTScore of 85.22. For image generation, the Inception Score (IS) of 7.63 $\pm$ 0.62 reflects high visual quality and diversity, while the CLIP Score of 29.48 indicates strong semantic alignment between textual summaries and generated images. The proposed architecture demonstrates a generalizable framework for multimodal intelligent systems, with potential applications in intelligent tutoring, automated content generation, and human-computer interaction.
SN  - 3068-5079
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
@article{Yousaf2025Capturing,
  author = {Junaid Yousaf and Mazhar Iqbal and Iqra Pervaiz and Muhammad Ismail and Toqeer Ul Islam and Khurram Khan Jadoon},
  title = {Capturing Poetic Essence: Text Summarization and Visual Generation via Multimodal},
  journal = {ICCK Transactions on Intelligent Systematics},
  year = {2025},
  volume = {2},
  number = {3},
  pages = {160-168},
  doi = {10.62762/TIS.2025.405393},
  url = {https://www.icck.org/article/abs/TIS.2025.405393},
  abstract = {Multimodal intelligent systems that integrate natural language processing with generative visual synthesis represent a frontier in intelligent information processing. This work addresses the design and evaluation of such a pipeline, using poetic content as a stress-test domain due to its high density of figurative language and abstract semantics. Building upon the PoemSum dataset, we construct a two-stage multimodal pipeline: first employing transformer-based models (BART and T5) for abstractive summarization, then leveraging Stable Diffusion for visual synthesis from the generated summaries. The summarization stage focuses on figurative interpretation that captures metaphorical and symbolic elements inherent in poetic language. Evaluation results show that the BART model outperforms T5 in summarization, achieving a ROUGE score of 41.90\% and a BERTScore of 85.22. For image generation, the Inception Score (IS) of 7.63 \$\pm\$ 0.62 reflects high visual quality and diversity, while the CLIP Score of 29.48 indicates strong semantic alignment between textual summaries and generated images. The proposed architecture demonstrates a generalizable framework for multimodal intelligent systems, with potential applications in intelligent tutoring, automated content generation, and human-computer interaction.},
  keywords = {multimodal intelligent systems, abstractive summarization, text-to-image synthesis, diffusion models, semantic alignment, transformer architectures},
  issn = {3068-5079},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics

Citations
Views
1326
PDF Downloads
719

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Intelligent Systematics
ICCK Transactions on Intelligent Systematics
ISSN: 3068-5079 (Online) | ISSN: 3069-003X (Print)
Portico
Preserved at
Portico