ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help


Dutta, S and Ganapathy, S (2022) MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION. In: 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022, 23 - 27 May 2022, Virtual, Online at Singapore, pp. 5932-5936.

[img] PDF
IEEE_ICASSP 2022_2022_5932-5936_2022.pdf - Published Version
Restricted to Registered users only

Download (1MB)
Official URL: https://doi.org/10.1109/ICASSP43922.2022.9747723


In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMOCAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.

Item Type: Conference Paper
Publication: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publisher: Institute of Electrical and Electronics Engineers Inc.
Additional Information: The copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Character recognition; Computer vision; Recurrent neural networks; Speech recognition, Attention model; Audio representation; Emotion recognition; Front end; Learnable front-end; Multi-modal; Multi-modal emotion recognition; Recurrent networks; Self-attention model; Transformer network, Emotion Recognition
Department/Centre: Division of Electrical Sciences > Electrical Engineering
Date Deposited: 05 Aug 2022 09:06
Last Modified: 05 Aug 2022 09:06
URI: https://eprints.iisc.ac.in/id/eprint/75355

Actions (login required)

View Item View Item