Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs

Shreya Tiwari; Devansh Kumar; Akshit Mahajan; Silky Sachar

doi:10.62762/TMI.2025.306750

CiteScore

Impact Factor

Volume 1, Issue 2, ICCK Transactions on Machine Intelligence

Volume 1, Issue 2, 2025

Submit Manuscript Edit a Special Issue

Article QR Code

Scan the QR code for reading

Popular articles

Case Studies on Integrating Artificial Intelligence in Finance to Transform Decision Making and Risk Management for Enhanced Financial Outcomes Research on A Ship Trajectory Classification Method Based on Deep Learning Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval Enhancing Fake News Detection with a Hybrid NLP-Machine Learning Framework A Mimic Fusion Algorithm for Dual Channel Video Based on Possibility Distribution Synthesis Theory YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image Deep Prediction Network Based on Covariance Intersection Fusion for Sensor Data Plant Disease Detection Using Deep Learning Techniques Visual Feature Extraction and Tracking Method Based on Corner Flow Detection Analyzing the Translation and Impact of Popular Science Literature in China: A Case Study Approach

ICCK Transactions on Machine Intelligence, Volume 1, Issue 2, 2025: 80-89

Free to Read | Research Article | 14 September 2025

Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs

Shreya Tiwari 1 *

Devansh Kumar 1

Akshit Mahajan 1

Silky Sachar 1

1 Amity School of Engineering and Technology, Amity University Punjab, Mohali 140306, India

* Corresponding Author: Shreya Tiwari, [email protected]

DOI: 10.62762/TMI.2025.306750

Received: 25 June 2025, Accepted: 30 July 2025, Published: 14 September 2025

PDF (5.27 MB)

Article Metrics Cite This Article

Abstract

In the age of increasing machine-mediated communication, the ability to detect emotional nuances in speech has become a critical competency for intelligent systems. This paper presents a robust Speech Emotion Recognition (SER) framework that integrates a hybrid deep learning architecture with a real-time web-based inference interface. Utilizing the RAVDESS dataset, the proposed pipeline encompasses comprehensive preprocessing, data augmentation techniques, and feature extraction based on Mel-Frequency Cepstral Coefficients (MFCCs), Chroma features, and Mel-spectrograms. A comparative experiment was run against a standard machine learning classifier such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest, and XGBoost. The experimental results indicate that the CNN-BiLSTM-Conv1D model proposed is much better as compared to conventional models with a state-of-the-art classification accuracy of 94%. The model was further evaluated using ROC-AUC curves and per-class performance metrics. It was subsequently deployed using a Flask-based web interface that enables users to upload voice inputs and receive real-time emotion predictions. This end-to-end system addresses the shortcomings of earlier SER approaches---such as limited temporal modeling and reduced generalization---and showcases practical applicability in domains like mental health monitoring, virtual assistants, and affective computing.

Graphical Abstract

Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs

Keywords

speech emotion recognition

deep learning

CNN-BiLSTM

RAVDESS

MFCC

real-time prediction

human-computer interaction

audio processing

web deployment

affective computing

Data Availability Statement

Data will be made available on request.

Funding

This work was supported without any funding.

Conflicts of Interest

The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate

Not applicable.

References

Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92, 60-68.
[CrossRef] [Google Scholar]
Singla, C., Singh, S., Sharma, P., Mittal, N., & Gared, F. (2024). Emotion recognition for human–computer interaction using high-level descriptors. Scientific reports, 14(1), 12122.
[CrossRef] [Google Scholar]
Devillers, L., Vidrascu, L., & Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4), 407-422.
[CrossRef] [Google Scholar]
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech communication, 48(9), 1162-1181.
[CrossRef] [Google Scholar]
Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462).
[CrossRef] [Google Scholar]
Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical signal processing and control, 47, 312-323.
[CrossRef] [Google Scholar]
Zhang, Y., Du, J., Wang, Z., Zhang, J., & Tu, Y. (2018, November). Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1771-1775). IEEE.
[CrossRef] [Google Scholar]
RAVDESS Emotional Speech Audio Dataset. (2025, July 13). RAVDESS Emotional Speech Audio [Dataset]. Retrieved from https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio
[Google Scholar]
scikit-learn. (n.d.). LabelEncoder. Retrieved July 13, 2025, from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
[Google Scholar]
Data augmentation using pitch shifting. (2023). Applied Acoustics. Retrieved July 13, 2025, from https://waywithwords.net/resource/speech-data-augmentation-voice-audio/
[Google Scholar]
Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of selected topics in signal processing, 11(8), 1301-1309.
[CrossRef] [Google Scholar]
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., ... & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359.
[CrossRef] [Google Scholar]
Batliner, A., Steidl, S., & Nöth, E. (2008). Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus.
[Google Scholar]
Shyam, R., Ayachit, S. S., Patil, V., & Singh, A. (2020, December). Competitive analysis of the top gradient boosting machine learning algorithms. In 2020 2nd international conference on advances in computing, communication control and networking (ICACCCN) (pp. 191-196). IEEE.
[CrossRef] [Google Scholar]
Kumar, M., Singhal, S., Shekhar, S., Sharma, B., & Srivastava, G. (2022). Optimized stacking ensemble learning model for breast cancer detection and classification using machine learning. Sustainability, 14(21), 13998.
[CrossRef] [Google Scholar]
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016, March). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE.
[CrossRef] [Google Scholar]
Guo, Y., Xiong, X., Liu, Y., Xu, L., & Li, Q. (2022). A novel speech emotion recognition method based on feature construction and ensemble learning. PLoS One, 17(8), e0267132.
[CrossRef] [Google Scholar]
Barhoumi, C., & BenAyed, Y. (2024). Real-time speech emotion recognition using deep learning and data augmentation. Artificial Intelligence Review, 58(2), 49.
[CrossRef] [Google Scholar]
Askari, M. H., Shahzad, A., Faraz, A., Fuzail, M., Aslam, N., & Tariq, M. A. (2025). EFFECTIVE SPEECH EMOTION RECOGNITION USING R-CNN & BLSTM. Kashf Journal of Multidisciplinary Research, 2(06), 293-309.
[CrossRef] [Google Scholar]

Cite This Article

APA Style

Tiwari, S., Kumar, D., Mahajan, A., & Sachar, S. (2025). Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs. ICCK Transactions on Machine Intelligence, 1(2), 80–89. https://doi.org/10.62762/TMI.2025.306750

Article Metrics

Citations:

Google Scholar

Crossref

Scopus

Web of Science

Article Access Statistics:

PDF Downloads: 66

Publisher's Note

ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions

Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

ICCK Transactions on Machine Intelligence

ISSN: 3068-7403 (Online)

Email: [email protected]

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/

Google Scholar

Crossref

Scopus

Web of Science

We use cookies