Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer

Agarwal, S and Ganapathy, S and Takahashi, N (2022) Leveraging Symmetrical Convolutional Transformer Networks for Speech to Singing Voice Style Transfer. In: 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, 18 - 22 September 2022, Incheon, pp. 3013-3017.

Preview

PDF
INTERSPEECH_2022.pdf - Published Version
Download (1MB) | Preview

Official URL: https://doi.org/10.21437/Interspeech.2022-11256

Abstract

In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody while preserving the speaker identity and naturalness. The proposed SymNet model is comprised of symmetrical stack of three types of layers - convolutional, transformer, and self-attention layers. The paper also explores novel data augmentation and generative loss annealing methods to facilitate the model training. Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice. In these experiments, we show that the proposed SymNet model improves the objective reconstruction quality significantly over the previously published methods and baseline architectures. Further, a subjective listening test confirms the improved quality of the audio obtained using the proposed approach (absolute improvement of 0.37 in mean opinion score measure over the baseline system).

Item Type:	Conference Paper
Publication:	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publisher:	International Speech Communication Association
Additional Information:	The copyright for this article belongs to the Author(s).
Keywords:	Convolution; Music; Network architecture; Neural networks; Speech communication, Data-driven approach; High quality; Natural speech; Neural-networks; Signal-processing; Singing styles; Singing voices; Speech to singing style transfer; Symmetrical neural network; Transformer network, Signal processing
Department/Centre:	Others
Date Deposited:	10 Nov 2022 06:56
Last Modified:	10 Nov 2022 06:56
URI:	https://eprints.iisc.ac.in/id/eprint/77863

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India