A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech

Kalluri, SB and Vijayasenan, D and Ganapathy, S (2019) A Deep Neural Network Based End to End Model for Joint Height and Age Estimation from Short Duration Speech. In: 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, 12 May 2019 -17 May 2019, Brighton, pp. 6580-6584.

PDF
ICASSP_2019.pdf - Published Version
Restricted to Registered users only
Download (12MB) | Request a copy

Official URL: https://doi.org/10.1109/ICASSP.2019.8683397

Abstract

Automatic height and age prediction of a speaker has a wide variety of applications in speaker profiling, forensics etc. Often in such applications only a few seconds of speech data is available to reliably estimate the speaker parameters. Traditionally, age and height were predicted separately using different estimation algorithms. In this work, we propose a unified DNN architecture to predict both height and age of a speaker for short durations of speech. A novel initialization scheme for the deep neural architecture is introduced, that avoids the requirement for a large training dataset. We evaluate the system on TIMIT dataset where the mean duration of speech segments is around 2.5s. The DNN system is able to improve the age RMSE by at least 0.6 years as compared to a conventional support vector regression system trained on Gaussian Mixture Model mean supervectors. The system achieves an RMSE error of 6.85 and 6.29 cm for male and female height prediction. In case of age estimation, the RMSE errors are 7.60 and 8.63 years for male and female respectively. Analysis of shorter speech segments reveals that even with 1 second speech input the performance degradation is at most 3 compared to the full duration speech files.

Item Type:	Conference Paper
Publication:	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Publisher:	Institute of Electrical and Electronics Engineers Inc.
Additional Information:	The copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords:	Audio signal processing; Forecasting; Gaussian distribution; Large dataset; Network architecture; Speech; Speech analysis; Speech communication, Age estimation; End-to-end models; Estimation algorithm; Gaussian Mixture Model; Neural architectures; Performance degradation; Short durations; Support vector regression (SVR), Deep neural networks
Department/Centre:	Division of Electrical Sciences > Electrical Engineering
Date Deposited:	30 Nov 2022 06:46
Last Modified:	30 Nov 2022 06:46
URI:	https://eprints.iisc.ac.in/id/eprint/78380

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India