ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Pseudo-Label Based Supervised Contrastive Loss for Robust Speech Representations

Krishna, V and Ganapathy, S (2023) Pseudo-Label Based Supervised Contrastive Loss for Robust Speech Representations. In: UNSPECIFIED.

[img] PDF
Aut_spe_rec_und_wor_asr_2023. pdf - Published Version
Restricted to Registered users only

Download (334kB) | Request a copy
Official URL: https://doi.org/10.1109/ASRU57964.2023.10389725

Abstract

The self supervised learning (SSL) of speech, with discrete tokenization (pseudo-labels), while illustrating performance improvements in low-resource speech recognition, has faced challenges in achieving context invariant and noise robust representations. In this paper, we propose a self-supervised framework based on contrastive loss of the pseudo-labels, obtained from an offline k-means quantizer (tokenizer). We refer to the proposed setting as pseudo-con. The pseudo-con loss, within a batch of training, allows the model to cluster the instances of the same pseudo-label while separating the instances of a different pseudo-label. The proposed pseudo-con loss can also be combined with the cross entropy loss, commonly used in self-supervised learning schemes. We demonstrate the effectiveness of the pseudo-con loss applied for various SSL techniques, like hidden unit bidirectional encoder representations from transformers (HuBERT), best random quantizer (BEST-RQ) and hidden unit clustering (HUC). Our evaluations using the proposed pseudo-con framework achieves state of art results on various sub-tasks of ZeroSpeech 2021 challenge as well as on the context invariance benchmarks. Further, we show significant performance improvements over existing SSL approaches on the TIMIT phoneme recognition task as well as the Librispeech (100h) ASR experiments. © 2023 IEEE.

Item Type: Conference Paper
Publication: 2023 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023
Publisher: Institute of Electrical and Electronics Engineers Inc.
Additional Information: The copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Computer vision; K-means clustering; Supervised learning, Context invariance; Hidden units; Performance; Pre-training; Quantizers; Robust speech; Self-supervised pre-training; Supervised contrastive loss; Tokenization; Zerospeech, Speech recognition
Department/Centre: Others
Date Deposited: 16 May 2024 11:12
Last Modified: 16 May 2024 11:12
URI: https://eprints.iisc.ac.in/id/eprint/84557

Actions (login required)

View Item View Item