Multimodal Emotion Recognition Using Deep Learning with Voice, Text, and Facial Expression Analysis

P. Praveenkumar; S. Yogithaa; R. Soundarya; M. Harshini

doi:10.2991/978-94-6239-616-6_21

<Previous Article In Volume

Next Article In Volume>

Multimodal Emotion Recognition Using Deep Learning with Voice, Text, and Facial Expression Analysis

Authors

P. Praveenkumar¹, S. Yogithaa¹^{, *}, R. Soundarya¹, M. Harshini¹

¹Sri Manakula Vinayagar Engineering College, Madagadipet, Puducherry, 605107, India

^*Corresponding author. Email: yogithaasendhil@gmail.com

Corresponding Author

S. Yogithaa

Available Online 31 March 2026.

DOI: 10.2991/978-94-6239-616-6_21 How to use a DOI?
Keywords: Multimodal emotion recognition; BiLSTM; CNN-RNN; ResNet-101; feature-level fusion; audio-text-visual integration; affective computing; human–computer interaction
Abstract: Emotion recognition plays a crucial role in intelligent systems, as emotions influence communication, decision-making, and human–machine interaction. Audio-only methods such as CNN-BiLSTM often perform poorly because emotional expression varies across speech, facial cues, and textual semantics. This study proposes a multimodal framework integrating text, audio, and facial expressions for robust emotion detection. Text is modeled with BiLSTM to capture contextual meaning, audio is processed through a CNN-RNN hybrid to learn spectral–temporal cues, and visual data is analyzed using ResNet-101 for deep facial feature extraction. Feature-level fusion combines all modalities into a unified emotional representation, improving accuracy and stability across real-world conditions. The approach benefits applications in HCI, e-learning, affective computing, and mental-health monitoring.
Copyright: © 2026 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025)
Series: Advances in Intelligent Systems Research
Publication Date: 31 March 2026
ISBN: 978-94-6239-616-6
ISSN: 1951-6851
DOI: 10.2991/978-94-6239-616-6_21 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - P. Praveenkumar
AU  - S. Yogithaa
AU  - R. Soundarya
AU  - M. Harshini
PY  - 2026
DA  - 2026/03/31
TI  - Multimodal Emotion Recognition Using Deep Learning with Voice, Text, and Facial Expression Analysis
BT  - Proceedings of the International Conference on Artificial Intelligence and Secure Data Analytics (ICAISDA 2025)
PB  - Atlantis Press
SP  - 249
EP  - 261
SN  - 1951-6851
UR  - https://doi.org/10.2991/978-94-6239-616-6_21
DO  - 10.2991/978-94-6239-616-6_21
ID  - Praveenkumar2026
ER  -

download .riscopy to clipboard