ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Speaker identification using multi-modal i-vector approach for varying length speech in voice interactive systems

Tiwari, Varun and Hashmi, Mohammad Farukh and Keskar, Avinash and Shivaprakash, N C (2019) Speaker identification using multi-modal i-vector approach for varying length speech in voice interactive systems. In: COGNITIVE SYSTEMS RESEARCH, 57 . pp. 66-77.

[img] PDF
CON_SYS_RES_57_66-77_2019.pdf
Restricted to Registered users only

Download (3MB) | Request a copy
Official URL: https://doi.org/ 10.1016/j.cogsys.2018.09.028

Abstract

The development in the interface of smart devices has lead to voice interactive systems. An additional step in this direction is to enable the devices to recognize the speaker. But this is a challenging task because the interaction involves short duration speech utterances. The traditional Gaussian mixture models (GMM) based systems have achieved satisfactory results for speaker recognition only when the speech lengths are sufficiently long. The current state-of-the-art method utilizes i-vector based approach using a GMM based universal background model (GMM-UBM). It prepares an i-vector speaker model from a speaker's enrollment data and uses it to recognize any new test speech. In this work, we propose a multi-model i-vector system for short speech lengths. We use an open database THUYG-20 for the analysis and development of short speech speaker verification and identification system. By using an optimum set of mel-frequency cepstrum coefficients (MFCC) based features we are able to achieve an equal error rate (EER) of 3.21% as compared to the previous benchmark score of EER 4.01% on the THUYG-20 database. Experiments are conducted for speech lengths as short as 0.25 s and the results are presented. The proposed method shows improvement as compared to the current i-vector based approach for shorter speech lengths. We are able to achieve improvement of around 28% even for 0.25 s speech samples. We also prepared and tested the proposed approach on our own database with 2500 speech recordings in English language consisting of actual short speech commands used in any voice interactive system.

Item Type: Journal Article
Publication: COGNITIVE SYSTEMS RESEARCH
Publisher: ELSEVIER SCIENCE BV
Additional Information: copyright for this article belongs to Elsevier B.V.
Keywords: Gaussian mixture models; i-Vectors; Mel-frequency cepstrum coefficients; Speaker verification; Speaker identification; Short speech; Voice interactive systems
Department/Centre: Division of Physical & Mathematical Sciences > Instrumentation Appiled Physics
Date Deposited: 10 Jul 2019 05:36
Last Modified: 10 Jul 2019 05:36
URI: http://eprints.iisc.ac.in/id/eprint/63019

Actions (login required)

View Item View Item