-
CiteScore
-
Impact Factor
Volume 1, Issue 2, ICCK Transactions on Machine Intelligence
Volume 1, Issue 2, 2025
Submit Manuscript Edit a Special Issue
Article QR Code
Article QR Code
Scan the QR code for reading
Popular articles
ICCK Transactions on Machine Intelligence, Volume 1, Issue 2, 2025: 80-89

Free to Read | Research Article | 14 September 2025
Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs
1 Amity School of Engineering and Technology, Amity University Punjab, Mohali 140306, India
* Corresponding Author: Shreya Tiwari, [email protected]
Received: 25 June 2025, Accepted: 30 July 2025, Published: 14 September 2025  
Abstract
In the age of increasing machine-mediated communication, the ability to detect emotional nuances in speech has become a critical competency for intelligent systems. This paper presents a robust Speech Emotion Recognition (SER) framework that integrates a hybrid deep learning architecture with a real-time web-based inference interface. Utilizing the RAVDESS dataset, the proposed pipeline encompasses comprehensive preprocessing, data augmentation techniques, and feature extraction based on Mel-Frequency Cepstral Coefficients (MFCCs), Chroma features, and Mel-spectrograms. A comparative experiment was run against a standard machine learning classifier such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest, and XGBoost. The experimental results indicate that the CNN-BiLSTM-Conv1D model proposed is much better as compared to conventional models with a state-of-the-art classification accuracy of 94%. The model was further evaluated using ROC-AUC curves and per-class performance metrics. It was subsequently deployed using a Flask-based web interface that enables users to upload voice inputs and receive real-time emotion predictions. This end-to-end system addresses the shortcomings of earlier SER approaches---such as limited temporal modeling and reduced generalization---and showcases practical applicability in domains like mental health monitoring, virtual assistants, and affective computing.

Graphical Abstract
Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs

Keywords
speech emotion recognition
deep learning
CNN-BiLSTM
RAVDESS
MFCC
real-time prediction
human-computer interaction
audio processing
web deployment
affective computing

Data Availability Statement
Data will be made available on request.

Funding
This work was supported without any funding.

Conflicts of Interest
The authors declare no conflicts of interest.

Ethical Approval and Consent to Participate
Not applicable.

References
  1. Fayek, H. M., Lech, M., & Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92, 60-68.
    [CrossRef]   [Google Scholar]
  2. Singla, C., Singh, S., Sharma, P., Mittal, N., & Gared, F. (2024). Emotion recognition for human–computer interaction using high-level descriptors. Scientific reports, 14(1), 12122.
    [CrossRef]   [Google Scholar]
  3. Devillers, L., Vidrascu, L., & Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4), 407-422.
    [CrossRef]   [Google Scholar]
  4. Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech communication, 48(9), 1162-1181.
    [CrossRef]   [Google Scholar]
  5. Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia (pp. 1459-1462).
    [CrossRef]   [Google Scholar]
  6. Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical signal processing and control, 47, 312-323.
    [CrossRef]   [Google Scholar]
  7. Zhang, Y., Du, J., Wang, Z., Zhang, J., & Tu, Y. (2018, November). Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 1771-1775). IEEE.
    [CrossRef]   [Google Scholar]
  8. RAVDESS Emotional Speech Audio Dataset. (2025, July 13). RAVDESS Emotional Speech Audio [Dataset]. Retrieved from https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio
    [Google Scholar]
  9. scikit-learn. (n.d.). LabelEncoder. Retrieved July 13, 2025, from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
    [Google Scholar]
  10. Data augmentation using pitch shifting. (2023). Applied Acoustics. Retrieved July 13, 2025, from https://waywithwords.net/resource/speech-data-augmentation-voice-audio/
    [Google Scholar]
  11. Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of selected topics in signal processing, 11(8), 1301-1309.
    [CrossRef]   [Google Scholar]
  12. Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., ... & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4), 335-359.
    [CrossRef]   [Google Scholar]
  13. Batliner, A., Steidl, S., & Nöth, E. (2008). Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus.
    [Google Scholar]
  14. Shyam, R., Ayachit, S. S., Patil, V., & Singh, A. (2020, December). Competitive analysis of the top gradient boosting machine learning algorithms. In 2020 2nd international conference on advances in computing, communication control and networking (ICACCCN) (pp. 191-196). IEEE.
    [CrossRef]   [Google Scholar]
  15. Kumar, M., Singhal, S., Shekhar, S., Sharma, B., & Srivastava, G. (2022). Optimized stacking ensemble learning model for breast cancer detection and classification using machine learning. Sustainability, 14(21), 13998.
    [CrossRef]   [Google Scholar]
  16. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., & Zafeiriou, S. (2016, March). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5200-5204). IEEE.
    [CrossRef]   [Google Scholar]
  17. Guo, Y., Xiong, X., Liu, Y., Xu, L., & Li, Q. (2022). A novel speech emotion recognition method based on feature construction and ensemble learning. PLoS One, 17(8), e0267132.
    [CrossRef]   [Google Scholar]
  18. Barhoumi, C., & BenAyed, Y. (2024). Real-time speech emotion recognition using deep learning and data augmentation. Artificial Intelligence Review, 58(2), 49.
    [CrossRef]   [Google Scholar]
  19. Askari, M. H., Shahzad, A., Faraz, A., Fuzail, M., Aslam, N., & Tariq, M. A. (2025). EFFECTIVE SPEECH EMOTION RECOGNITION USING R-CNN & BLSTM. Kashf Journal of Multidisciplinary Research, 2(06), 293-309.
    [CrossRef]   [Google Scholar]

Cite This Article
APA Style
Tiwari, S., Kumar, D., Mahajan, A., & Sachar, S. (2025). Emotion Detection from Speech Using CNN-BiLSTM with Feature Rich Audio Inputs. ICCK Transactions on Machine Intelligence, 1(2), 80–89. https://doi.org/10.62762/TMI.2025.306750

Article Metrics
Citations:

Crossref

0

Scopus

0

Web of Science

0
Article Access Statistics:
Views: 313
PDF Downloads: 66

Publisher's Note
ICCK stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and Permissions
Institute of Central Computation and Knowledge (ICCK) or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
ICCK Transactions on Machine Intelligence

ICCK Transactions on Machine Intelligence

ISSN: 3068-7403 (Online)

Email: [email protected]

Portico

Portico

All published articles are preserved here permanently:
https://www.portico.org/publishers/icck/