Volume 3, Issue 2, ICCK Transactions on Emerging Topics in Artificial Intelligence
Volume 3, Issue 2, 2026
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Emerging Topics in Artificial Intelligence, Volume 3, Issue 2, 2026: 76-85

Open Access | Research Article | 01 February 2026
An NLP-Based Evaluation of LLMs Across Creativity, Factual Accuracy, Open-Ended and Technical Explanations
1 Institute of Computer Science, University of Potsdam, Potsdam 14476, Germany
* Corresponding Author: Qazi Novera Tansue Nasa, [email protected]
ARK: ark:/57805/tetai.2025.264517
Received: 10 November 2025, Accepted: 01 December 2025, Published: 01 February 2026  
Abstract
The rapid advancement of AI-based language models has transformed the field of Natural Language Processing (NLP) into a powerful tool for text generation. This study evaluates the performance of models in different categories such as factual accuracy, creative writing, open-ended writing, and technical explanation. We have considered three popular and advanced large language models (LLMs) for this analysis. To quantify their performance, we have applied a combination of statistical and linguistic metrics. We have used Dale-Chall to analyze the readability score of the responses. For lexical diversity, we have used the type-token ratio technique. In addition, a cosine similarity with TF-IDF is used for semantic similarity. Furthermore, sentiment polarity and grammatical correctness are also analyzed. Moreover, we have conducted an F-test to determine whether the differences in performance among the LLMs are statistically significant (p < 0.05). We have found minimal differences between LLMs, with ChatGPT showing slightly better performance compared to the others.

Graphical Abstract
An NLP-Based Evaluation of LLMs Across Creativity, Factual Accuracy, Open-Ended and Technical Explanations

Keywords
LLMs evaluation
NLP
ChatGPT
Gemini
DeepSeek
ANOVA

Data Availability Statement
The data and code supporting the findings of this study are publicly available at the following repository: https://github.com/acdas10/NLP-based-LLM-analysis

Funding
This work was supported without any funding.

Conflicts of Interest
The authors declare no conflicts of interest.

AI Use Statement
The authors declare that no generative AI was used in the preparation of this manuscript.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Bommasani, R. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258}.
    [Google Scholar]
  2. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55}(12), 1-38.
    [CrossRef]   [Google Scholar]
  3. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948}.
    [Google Scholar]
  4. Google DeepMind. (2023). Introducing Gemini: our largest and most capable AI model. Retrieved from \href{https://blog.google/technology/ai/google-gemini-ai/#sundar-note}{https://www.deepmind.com/blog/}
    [Google Scholar]
  5. Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P. S., ... & Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359}.
    [Google Scholar]
  6. Boden, M. A. (2004). The creative mind: Myths and mechanisms}. Routledge.
    [Google Scholar]
  7. Oltețeanu, A. M. (2020). Cognition and the Creative Machine: Cognitive AI for Creative Problem Solving}. Springer Nature.
    [CrossRef]   [Google Scholar]
  8. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33}, 9459-9474.
    [Google Scholar]
  9. Das, S. (2022). The meaning of creativity through the ages: from inspiration to artificial intelligence. In Creative business education: exploring the contours of pedagogical praxis} (pp. 27-53). Cham: Springer International Publishing.
    [CrossRef]   [Google Scholar]
  10. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W. T., Koh, P., ... & Hajishirzi, H. (2023, December). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing} (pp. 12076-12100).
    [CrossRef]   [Google Scholar]
  11. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30}.
    [Google Scholar]
  12. OpenAI. (2024). OpenAI API Documentation. Retrieved from \href{https://platform.openai.com/docs/overview}{https://platform.openai.com/docs/overview}
    [Google Scholar]
  13. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971}.
    [Google Scholar]
  14. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35}, 27730-27744.
    [Google Scholar]
  15. Sheng, E., Chang, K. W., Natarajan, P., & Peng, N. (2019, November). The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)} (pp. 3407-3412).
    [CrossRef]   [Google Scholar]
  16. Weidinger, L., Uesato, J., Bielecki, J., van den Driessche, G., Chrzanowski, M., Krasheninnikov, D., ... & Tréger, R. (2021). Ethical and social risks of Large Language Models. arXiv preprint arXiv:2112.04359}.
    [Google Scholar]
  17. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, G. (2023). Survey of Hallucination in Large Language Models. ACM Computing Surveys, 56}(2), 1-38.
    [Google Scholar]
  18. Mehrabi, N., Morstatter, B., Saxena, V., Lerman, K., & Narayanan, M. G. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR)}, 54(6), 1-35.
    [CrossRef]   [Google Scholar]
  19. Marcus, G., & Davis, E. (2019). Rebooting AI: Building artificial intelligence we can trust}. Pantheon.
    [Google Scholar]

Cite This Article
APA Style
Nasa, Q. N. T., & Das, A. C. (2026). An NLP-Based Evaluation of LLMs Across Creativity, Factual Accuracy, Open-Ended and Technical Explanations. ICCK Transactions on Emerging Topics in Artificial Intelligence, 3(2), 76–85. https://doi.org/10.62762/TETAI.2025.264517
Export Citation
RIS Format
Compatible with EndNote, Zotero, Mendeley, and other reference managers
RIS format data for reference managers
TY  - JOUR
AU  - Nasa, Qazi Novera Tansue
AU  - Das, Ashik Chandra
PY  - 2026
DA  - 2026/02/01
TI  - An NLP-Based Evaluation of LLMs Across Creativity, Factual Accuracy, Open-Ended and Technical Explanations
JO  - ICCK Transactions on Emerging Topics in Artificial Intelligence
T2  - ICCK Transactions on Emerging Topics in Artificial Intelligence
JF  - ICCK Transactions on Emerging Topics in Artificial Intelligence
VL  - 3
IS  - 2
SP  - 76
EP  - 85
DO  - 10.62762/TETAI.2025.264517
UR  - https://www.icck.org/article/abs/TETAI.2025.264517
KW  - LLMs evaluation
KW  - NLP
KW  - ChatGPT
KW  - Gemini
KW  - DeepSeek
KW  - ANOVA
AB  - The rapid advancement of AI-based language models has transformed the field of Natural Language Processing (NLP) into a powerful tool for text generation. This study evaluates the performance of models in different categories such as factual accuracy, creative writing, open-ended writing, and technical explanation. We have considered three popular and advanced large language models (LLMs) for this analysis. To quantify their performance, we have applied a combination of statistical and linguistic metrics. We have used Dale-Chall to analyze the readability score of the responses. For lexical diversity, we have used the type-token ratio technique. In addition, a cosine similarity with TF-IDF is used for semantic similarity. Furthermore, sentiment polarity and grammatical correctness are also analyzed. Moreover, we have conducted an F-test to determine whether the differences in performance among the LLMs are statistically significant (p < 0.05). We have found minimal differences between LLMs, with ChatGPT showing slightly better performance compared to the others.
SN  - 3068-6652
PB  - Institute of Central Computation and Knowledge
LA  - English
ER  - 
BibTeX Format
Compatible with LaTeX, BibTeX, and other reference managers
BibTeX format data for LaTeX and reference managers
@article{Nasa2026An,
  author = {Qazi Novera Tansue Nasa and Ashik Chandra Das},
  title = {An NLP-Based Evaluation of LLMs Across Creativity, Factual Accuracy, Open-Ended and Technical Explanations},
  journal = {ICCK Transactions on Emerging Topics in Artificial Intelligence},
  year = {2026},
  volume = {3},
  number = {2},
  pages = {76-85},
  doi = {10.62762/TETAI.2025.264517},
  url = {https://www.icck.org/article/abs/TETAI.2025.264517},
  abstract = {The rapid advancement of AI-based language models has transformed the field of Natural Language Processing (NLP) into a powerful tool for text generation. This study evaluates the performance of models in different categories such as factual accuracy, creative writing, open-ended writing, and technical explanation. We have considered three popular and advanced large language models (LLMs) for this analysis. To quantify their performance, we have applied a combination of statistical and linguistic metrics. We have used Dale-Chall to analyze the readability score of the responses. For lexical diversity, we have used the type-token ratio technique. In addition, a cosine similarity with TF-IDF is used for semantic similarity. Furthermore, sentiment polarity and grammatical correctness are also analyzed. Moreover, we have conducted an F-test to determine whether the differences in performance among the LLMs are statistically significant (p < 0.05). We have found minimal differences between LLMs, with ChatGPT showing slightly better performance compared to the others.},
  keywords = {LLMs evaluation, NLP, ChatGPT, Gemini, DeepSeek, ANOVA},
  issn = {3068-6652},
  publisher = {Institute of Central Computation and Knowledge}
}

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 156
PDF Downloads: 19

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
CC BY Copyright © 2026 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
ICCK Transactions on Emerging Topics in Artificial Intelligence

ICCK Transactions on Emerging Topics in Artificial Intelligence

ISSN: 3068-6652 (Online)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/