A Novel Image Captioning Technique Using Deep Learning Methodology

Abdullah Khan; Jaswinder Singh

doi:10.62762/TMI.2025.886122

CiteScore

Impact Factor

Volume 1, Issue 2, ICCK Transactions on Machine Intelligence

Volume 1, Issue 2, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval Enhancing Fake News Detection with a Hybrid NLP-Machine Learning Framework A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Plant Disease Detection Using Deep Learning Techniques Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Analyzing the Translation and Impact of Popular Science Literature in China: A Case Study Approach

ICCK Transactions on Machine Intelligence, Volume 1, Issue 2, 2025: 52-68

Free to Read | Research Article | 01 August 2025

A Novel Image Captioning Technique Using Deep Learning Methodology

Abdullah Khan 1

Jaswinder Singh 1 *

1 Department of the AIML-CSE Apex Institute of Technology, Chandigarh University, Mohali, India

* Corresponding Author: Jaswinder Singh, [email protected]

DOI: 10.62762/TMI.2025.886122

Received: 11 March 2025, Accepted: 23 May 2025, Published: 01 August 2025

PDF (2.16 MB)

Article Metrics Cite This Article

Abstract

The capacity of robots to produce captions for images independently is a big step forward in the field of artificial intelligence and language understanding. This paper looks at an advanced picture captioning system that uses deep learning techniques, notably convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to provide contextually appropriate and meaningful descriptions of visual content. The suggested technique extracts features using the DenseNet201 model, which allows for a more thorough and hierarchical comprehension of picture components. These collected characteristics are subsequently processed by a long short-term memory (LSTM) network, a specific RNN variation designed to capture sequential dependencies in language, resulting in captions that are coherent and fluent.The model is trained and assessed on the well-known Flickr8k dataset, attaining competitive performance as judged by BLEU score metrics and proving its capacity to provide humanlike descriptions. This use of CNNs and RNNs demonstrates the value of merging computer vision and natural language processing for automated caption development. This approach has the potential to be applied in a range of industries, including assistive technology for the visually impaired, automated content production for digital media, enhanced indexing and retrieval of multimedia assets, and improved human-computer interaction. Furthermore, advances in attention processes and transformer-based models offer opportunities to improve the accuracy and contextual relevance of picture captioning models. The study emphasizes machine-generated captions’ larger implications for increasing accessibility, boosting searchability in large-scale databases, and enabling seamless AI-human cooperation in content interpretation and storytelling.

Graphical Abstract

A Novel Image Captioning Technique Using Deep Learning Methodology

Keywords

convolutional neural networks (CNN)

recurrent neural networks (RNN)

deep learning

image captioning

LSTM

DenseNet201

attention mechanism

BLEU scor

natural language processing (NLP)

multimodal learning

content retrieval

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5561–5570.
[CrossRef] [Google Scholar]
Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304.
[CrossRef] [Google Scholar]
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-up and top-down attention for image captioning and vqa. arXiv preprint arXiv:1707.07998, 2(4), 8.
[Google Scholar]
Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2422–2431.
[CrossRef] [Google Scholar]
Ghandi, T., Pourreza, H., & Mahyar, H. (2023). Deep learning approaches on image captioning: A review. ACM Computing Surveys, 56(3), 1–39.
[CrossRef] [Google Scholar]
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
[CrossRef] [Google Scholar]
Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., ... & Others. (2024). ShareGPT4Video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems, 37, 19472–19495.
[Google Scholar]
Rastogi, R., Rawat, V., & Kaushal, S. (2024). Demonstration and analysing the performance of image caption generator: Efforts for visually impaired candidates for Smart Cities 5.0. International Journal of Advanced Mechatronic Systems, 11(3), 161–178.
[CrossRef] [Google Scholar]
Jamil, A., Mahmood, K., Villar, M. G., Prola, T., Diez, I. D. L. T., Samad, M. A., & Ashraf, I. (2024). Deep learning approaches for image captioning: Opportunities, challenges and future potential. IEEE Access, 12, 12345–12367.
[CrossRef] [Google Scholar]
Vo-Ho, V. K., Luong, Q. A., Nguyen, D. T., Tran, M. K., & Tran, M. T. (2019). A smart system for text-lifelog generation from wearable cameras in smart environment using concept-augmented image captioning with modified beam search strategy. Applied Sciences, 9(9), 1886.
[CrossRef] [Google Scholar]
Cornia, M., Baraldi, L., & Cucchiara, R. (2019). Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8307–8316.
[CrossRef] [Google Scholar]
Chen, L., Jiang, Z., Xiao, J., & Liu, W. (2021). Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16846–16856.
[CrossRef] [Google Scholar]
Yang, X., Yang, Y., Ma, S., Li, Z., Dong, W., & Woz´niak, M. (2024). SAMT-generator: A second-attention for image captioning based on multi-stage transformer network. Neurocomputing, 593, 127823.
[CrossRef] [Google Scholar]
Zhao, W., Du, S., & Emery, W. J. (2017). Object-based convolutional neural network for high-resolution imagery classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(7), 3386–3396.
[CrossRef] [Google Scholar]
Wang, Q., Deng, H., Wu, X., Yang, Z., Liu, Y., Wang, Y., & Hao, G. (2023). LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text. Neural Networks, 162, 318–329.
[CrossRef] [Google Scholar]
Nag, I. (2021). Systematic literature review of deep visual and audio captioning (Technical Report No. 133909). Tampere University.
[Google Scholar]
Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G., & Xu, J. (2019). COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Transactions on Multimedia, 21(9), 2347–2360.
[CrossRef] [Google Scholar]
Hoxha, G. (2022). Image captioning for remote sensing image analysis [Doctoral dissertation, University of Trento].
[Google Scholar]
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663.
[CrossRef] [Google Scholar]
Jaiswal, S., Pallthadka, H., Chinchewadi, P., & Jaiswal, T. (2023). An extensive analysis of image captioning models, evaluation measures, and datasets. International Journal of Multidisciplinary Science Research Review, 1(1), 21–37.
[Google Scholar]

Cite This Article

APA Style

Khan, A., & Singh, J. (2025). A Novel Image Captioning Technique Using Deep Learning Methodology. ICCK Transactions on Machine Intelligence, 1(2), 52–68. https://doi.org/10.62762/TMI.2025.886122

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 116

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Machine Intelligence

ISSN: 3068-7403 (Online)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies