Speaker conditioned acoustic modeling for multi-speaker conversational ASR

Chetupalli, SR and Ganapathy, S (2022) Speaker conditioned acoustic modeling for multi-speaker conversational ASR. In: 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, 18 - 22 September 2022, Incheon, pp. 3834-3838.

Preview

PDF
INTERSPEECH_2022.pdf - Published Version
Download (600kB) | Preview

Official URL: https://doi.org/10.21437/Interspeech.2022-11267

Abstract

In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel speech recordings. The proposed model is a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system. The speaker conditioned acoustic model (SCAM) in the ASR system consists of a series of embedding layers which use the speaker activity inputs from the diarization system to derive speaker specific embeddings. The output of the SCAM are speaker specific senones that are used for decoding the transcripts for each speaker in the conversation. In this work, we experiment with the automatic speaker activity decisions generated using an end-to-end speaker diarization system. A joint learning approach is also proposed where the diarization model and the ASR acoustic model are jointly optimized. The experiments are performed on the mixed-channel two speaker recordings from the Switchboard corpus of telephone conversations. In these experiments, we show that the proposed acoustic model, incorporating speaker activity decisions and joint optimization, improves significantly over the ASR system with explicit source filtering (relative improvements of 12 in word error rate (WER) over the baseline system).

Item Type:	Conference Paper
Publication:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher:	International Speech Communication Association
Additional Information:	The copyright for this article belongs to the Author(s).
Keywords:	Acoustic Modeling; Audio recordings; Embeddings; Speech communication, Acoustics model; Automatic speech recognition; Automatic speech recognition system; Embeddings; Joint learning; Multi-speaker automatic speech recognition; Single channels; Speaker diarization; Speech recording, Speech recognition
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	10 Nov 2022 06:21
Last Modified:	10 Nov 2022 06:21
URI:	https://eprints.iisc.ac.in/id/eprint/77857

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India