Enhancing Sentiment Analysis of Roman Urdu Using Augmentation Techniques and Deep Learning Models

Muhammad Owais Khan; Wahab Khan; Yanan Wang; Aziz Ur Rehman; Muhammad Alamzeb Khan

doi:10.62762/TACS.2025.190575

CiteScore

Impact Factor

Volume 2, Issue 2, ICCK Transactions on Advanced Computing and Systems

Volume 2, Issue 2, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Inaugural Editorial of the Chinese Journal of Information Fusion YOLOv8-Lite: A Lightweight Object Detection Model for Real-time Autonomous Driving Systems Short and Long-Term Renewable Electricity Demand Forecasting Based on CNN-Bi-GRU Model Simultaneous Spatiotemporal Bias Compensation and Data Fusion for Asynchronous Multisensor Systems

ICCK Transactions on Advanced Computing and Systems, Volume 2, Issue 2, 2025: 1-16

Open Access | Research Article | 17 May 2025

Enhancing Sentiment Analysis of Roman Urdu Using Augmentation Techniques and Deep Learning Models

Muhammad Owais Khan 1

Wahab Khan 1 *

Yanan Wang 2

Aziz Ur Rehman 3

Muhammad Alamzeb Khan 1

1 Department of Computer Science, University of Science and Technology, Bannu, Khyber Pakhtunkhwa, Pakistan

2 Department of Computer Science and Engineering, Sejong University, Seoul 05006, Republic of Korea

3 Department of Computer Science, Islamia College University, Peshawar, Khyber Pakhtunkhwa, Pakistan

* Corresponding Author: Wahab Khan, [email protected]

DOI: 10.62762/TACS.2025.190575

Received: 29 February 2025, Accepted: 09 May 2025, Published: 17 May 2025

PDF (879.94 KB)

Article Metrics Cite This Article

Abstract

Roman Urdu sentiment analysis faces significant challenges due to transliteration inconsistencies, informal language usage, and the lack of labeled datasets. This study proposes a novel framework that addresses these challenges by combining advanced data preprocessing techniques and data augmentation strategies such as synonym replacement, back-translation, and random word insertion. These methods enhance dataset diversity, improving the model’s generalization ability. A rich Roman Urdu dataset was collected from diverse sources, including social media platforms (Facebook, Twitter, YouTube), blogs, forums, and e-commerce sites, to capture a wide range of user opinions. Three deep learning models, Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM), were evaluated for sentiment classification. The results show that the LSTM model outperforms the others with an accuracy of 94%, compared to 90% for RNN and 92% for GRU. The LSTM’s ability to capture long-term dependencies and contextual nuances in Roman Urdu text makes it the most effective model for this task, demonstrating a significant improvement over the traditional method.

Graphical Abstract

Keywords

oman Urdu

sentiment analysis

deep learning

data augmentation

text classification

GRU

LSTM

RNN

Data Availability Statement

The dataset used in this study is publicly available at: https://github.com/awais1992/RomanUrdu-Sentiment-Aug. It contains Roman Urdu sentiment-annotated data, which can be accessed and utilized under the terms specified in the repository.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Huang, H., Zavareh, A. A., & Mustafa, M. B. (2023). Sentiment analysis in e-commerce platforms: A review of current techniques and future directions. IEEE Access, 11, 90367-90382.
[CrossRef] [Google Scholar]
Wankhade, M., Rao, A. C. S., & Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7), 5731–5780.
[CrossRef] [Google Scholar]
Al-Jarf, R. (2023). Non-conventional spelling in informal, colloquial Arabic writing on Facebook. International Journal of Linguistics, Literature and Translation, 6(4), 35–47.
[CrossRef] [Google Scholar]
Iqbal, Z., Khan, F. M., Khan, I. U., & Khan, I. U. (2024). Fake news identification in Urdu tweets using machine learning models. Asian Bulletin of Big Data Management, 4(1).
[Google Scholar]
Chandio, B. A., Imran, A. S., Bakhtyar, M., Daudpota, S. M., & Baber, J. (2022). Attention-based RU-BiLSTM sentiment analysis model for roman Urdu. Applied Sciences, 12(7), 3641.
[CrossRef] [Google Scholar]
Kirov, C., Johny, C., Katanova, A., Gutkin, A., & Roark, B. (2024). Context-aware transliteration of romanized South Asian languages. Computational Linguistics, 50(2), 475-534.
[CrossRef] [Google Scholar]
Muhammad, K. B., & Burney, S. A. (2023). Innovations in urdu sentiment analysis using machine and deep learning techniques for two-class classification of symmetric datasets. Symmetry, 15(5), 1027.
[CrossRef] [Google Scholar]
Khan, M., Khan, A., Khan, W., Su’ud, M. M., Alam, M. M., Subhan, F., & Asghar, M. Z. (2021). A review of Urdu sentiment analysis with multilingual perspective: A case of Urdu and roman Urdu language. Computers, 11(1), 3.
[CrossRef] [Google Scholar]
Bilal, M., Khan, A., Jan, S., & Musa, S. (2022). Context-aware deep learning model for detection of roman Urdu hate speech on social media platform. IEEE Access, 10, 121133–121151.
[CrossRef] [Google Scholar]
Hussain, R., Iqbal, M., & Saleem, A. (2022). The linguistic landscape of Peshawar: Social hierarchies of English and its transliterations. University of Chitral Journal of Linguistics and Literature, 6(I), 223-239.
[CrossRef] [Google Scholar]
Din, S. U., Khusro, S., Khan, F. A., Ahmad, M., Ali, O., & Ghazal, T. M. (2025). An automatic approach for the identification of offensive language in Perso-Arabic Urdu Language: Dataset Creation and Evaluation. IEEE Access, 13, 19755-19769.
[CrossRef] [Google Scholar]
Dewani, A., Memon, M. A., & Bhatti, S. (2021). Development of computational linguistic resources for automated detection of textual cyberbullying threats in Roman Urdu language. 3 c TIC: cuadernos de desarrollo aplicados a las TIC, 10(2), 101-121.
[Google Scholar]
Ahmad, U. J., & Malkani, Y. A. (2024, January). Roman Urdu Slang Dictionary Development for Facebook Comment Sentiment Analysis. In 2024 IEEE 1st Karachi Section Humanitarian Technology Conference (KHI-HTC) (pp. 1-4). IEEE.
[CrossRef] [Google Scholar]
Ilyas, A., Shahzad, K., & Kamran Malik, M. (2023). Emotion detection in code-mixed roman urdu-english text. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2), 1-28.
[CrossRef] [Google Scholar]
Dongare, P. (2024, May). Creating corpus of low resource Indian languages for natural language processing: Challenges and opportunities. In Proceedings of the 7th workshop on Indian language data: Resources and evaluation (pp. 54-58).
[Google Scholar]
Mohamed, Y., & Menzel, W. (2023, October). Transfer of Models and Resources for Under-Resourced Languages Semantic Role Labeling. In Pan African Conference on Artificial Intelligence (pp. 141-153). Cham: Springer Nature Switzerland.
[CrossRef] [Google Scholar]
Li, D., Ahmed, K., Zheng, Z., Mohsan, S. A. H., Alsharif, M. H., Hadjouni, M., ... & Mostafa, S. M. (2022). Roman Urdu sentiment analysis using transfer learning. Applied Sciences, 12(20), 10344.
[CrossRef] [Google Scholar]
Malik, M., Ghous, H., Ali, M. I., Ismail, M., Ali, Z. H., & Amin, H. M. (2023). Sentiment analysis of roman text: challenges, opportunities, and future directions. International Journal of Information Systems and Computer Technologies, 2(2), 1-16.
[CrossRef] [Google Scholar]
Londhe, D. D., Kumari, A., & Emmanuel, M. (2021, April). Challenges in multilingual and mixed script sentiment analysis. In 2021 6Th international conference for convergence in technology (i2CT) (pp. 1-6). IEEE.
[CrossRef] [Google Scholar]
Jawad, K., Ahmad, M., Alvi, M., & Alvi, M. B. (2024). RUSAS: Roman Urdu Sentiment Analysis System. Computers, Materials and Continua, 79(1), 1463-1480.
[CrossRef] [Google Scholar]
Khan, L., Amjad, A., Afaq, K. M., & Chang, H. T. (2022). Deep sentiment analysis using CNN-LSTM architecture of English and Roman Urdu text shared in social media. Applied Sciences, 12(5), 2694.
[CrossRef] [Google Scholar]
Ali, A., Khan, M., Khan, K., Khan, R. U., & Aloraini, A. (2024). Sentiment Analysis of Low-Resource Language Literature Using Data Processing and Deep Learning. Computers, Materials and Continua, 79(1).
[CrossRef] [Google Scholar]
Aslam, M. A., Khan, K., Khan, W., Khan, S. U., Albanyan, A., & Algamdi, S. A. (2025). Paraphrase detection for Urdu language text using fine-tune BiLSTM framework. Scientific Reports, 15(1), 15383.
[CrossRef] [Google Scholar]
Khattak, A., Asghar, M. Z., Saeed, A., Hameed, I. A., Hassan, S. A., & Ahmad, S. (2021). A survey on sentiment analysis in Urdu: A resource-poor language. Egyptian Informatics Journal, 22(1), 53-74.
[CrossRef] [Google Scholar]
Maqbool, F., Spahiu, B., & Maurino, A. (2024). Impact of data augmentation on hate speech detection in Roman Urdu.
[Google Scholar]
Safder, I., Abu Bakar, M., Zaman, F., Waheed, H., Aljohani, N. R., Nawaz, R., & Hassan, S. U. (2024). Transforming language translation: A deep learning approach to Urdu–English translation. Journal of Ambient Intelligence and Humanized Computing, 15(10), 3651-3662.
[CrossRef] [Google Scholar]
Ehsan, S. (2024). Bi-directional Roman-Urdu transliteration system.
[Google Scholar]
Ali, S., Jamil, U., Younas, M., Zafar, B., & Hanif, M. K. (2024). Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques. IEEE Access, 13, 1-25.
[CrossRef] [Google Scholar]
Sehar, U., Kanwal, S., Allheeib, N. I., Almari, S., Khan, F., Dashtipur, K., ... & Khashan, O. A. (2023). A hybrid dependency-based approach for Urdu sentiment analysis. Scientific Reports, 13(1), 22075.
[CrossRef] [Google Scholar]
Khadim, K., Asghar, M. Z., Saeed, A., & Ahmad, S. (2024). Sentiment analysis of social media content in Roman Urdu language using data mining techniques. Research Consortium Archive, 2(4), 230–244.
[CrossRef] [Google Scholar]
Ashraf, M. R., Hussain, M., Jaffar, M. A., Ramay, W. Y., & Faheem, M. (2024). Revolutionizing Urdu Sentiment Analysis: Harnessing the Power of XLM-R and GPT-2. IEEE Access, 12, 99779-99793.
[CrossRef] [Google Scholar]
Ullah, K., Aslam, M., Khan, M. U. G., Alamri, F. S., & Khan, A. R. (2025). UEF-HOCUrdu: unified embeddings ensemble framework for hate and offensive text classification in Urdu. IEEE Access, 13, 21853-21869.
[CrossRef] [Google Scholar]
Luo, Q., Zeng, W., Chen, M., Peng, G., Yuan, X., & Yin, Q. (2023, July). Self-attention and transformers: Driving the evolution of large language models. In 2023 IEEE 6th International conference on electronic information and communication technology (ICEICT) (pp. 401-405). IEEE.
[CrossRef] [Google Scholar]
Ashraf, M. R., Jana, Y., Umer, Q., Jaffar, M. A., Chung, S., & Ramay, W. Y. (2023). BERT-based sentiment analysis for low-resourced languages: A case study of Urdu language. IEEE Access, 11, 110245-110259.
[CrossRef] [Google Scholar]
Bello, A., Ng, S. C., & Leung, M. F. (2023). A BERT framework to sentiment analysis of tweets. Sensors, 23(1), 506.
[CrossRef] [Google Scholar]
Jahin, M. A. J., Shovon, M. S. H., Mridha, M. F., Islam, M. R., & Watanobe, Y. (2024). A hybrid transformer and attention-based recurrent neural network for robust and interpretable sentiment analysis of tweets. Scientific Reports, 14(1), 24882.
[CrossRef] [Google Scholar]
Azam, U., Rizwan, H., & Karim, A. (2022). Exploring data augmentation strategies for hate speech detection in Roman Urdu. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 4523–4531).
[Google Scholar]
Nazir, S., Faisal, C. N., Habib, M. A., & Ahmad, H. (2025). Leveraging multilingual transformer for multiclass sentiment analysis in code-mixed data of low-resource languages. IEEE Access.
[CrossRef] [Google Scholar]
Li, L. B., Hou, Y., & Che, W. (2022). Data augmentation approaches in natural language processing: A survey. AI Open, 3, 71–90.
[CrossRef] [Google Scholar]
Khenglawt, V., Laskar, S. R., Pakray, P., & Khan, A. K. (2024). Addressing data scarcity issue for English–Mizo neural machine translation using data augmentation and language model. Journal of Intelligent & Fuzzy Systems, 46(3), 6313-6323.
[CrossRef] [Google Scholar]
Xylogiannopoulos, K. F., Xanthopoulos, P., Karampelas, P., & Bakamitsos, Y. Is Ai-Assisted Paraphrase the New Tool for Fake Review Creation? Challenges and Remedies. Challenges and Remedies. https://dx.doi.org/10.2139/ssrn.4853659
[Google Scholar]
Pahari, N. (2024). Sentiment analysis on code switched and low resource settings.
[Google Scholar]
Chandio, B. A., Shaikh, A., Bakhtyar, M., Alrizq, M., Baber, J., Sulaiman, A., & Noor, W. (2022). Sentiment analysis of Roman Urdu on e-commerce reviews using machine learning. CMES-Computer Modeling in Engineering & Sciences, 131(3), 1263–1287.
[Google Scholar]
Xu, Q. A., Chang, V., & Jayne, C. (2022). A systematic review of social media-based sentiment analysis: Emerging trends and challenges. Decision Analytics Journal, 3, 100073.
[CrossRef] [Google Scholar]
Malik, M., & Ghous, H. (2023). Sentiment Analysis of Roman Urdu Text Using Machine Learning Techniques. Innovative Computing Review, 3(2), 56-74.
[CrossRef] [Google Scholar]
Ahmad, G. I., & Singla, J. (2022). (LISACMT) Language identification and sentiment analysis of English-Urdu ‘code-mixed’ text using LSTM. In 2022 International Conference on Inventive Computation Technologies (ICICT) (pp. 430–435). IEEE.
[CrossRef] [Google Scholar]
Doddapaneni, S., Ramesh, G., Khapra, M., Kunchukuttan, A., & Kumar, P. (2025). A primer on pretrained multilingual language models. ACM Computing Surveys, 57(9), 1-39.
[CrossRef] [Google Scholar]
Kaur, M., & Saini, M. (2024). Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(9), 1-23.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Khan, M. O., Khan, W., Wang, Y., Rehman, A. U, & Khan, M. A. (2025). Enhancing Sentiment Analysis of Roman Urdu Using Augmentation Techniques and Deep Learning Models. ICCK Transactions on Advanced Computing and Systems, 2(2), 1–16. https://doi.org/10.62762/TACS.2025.190575

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 8

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Copyright © 2025 by the Author(s). Published by Institute of Central Computation and Knowledge. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

ICCK Transactions on Advanced Computing and Systems

ISSN: pending (Online)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies