ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Balancing Stragglers Against Staleness in Distributed Deep Learning

Basu, S and Saxena, V and Panja, R and Verma, A (2019) Balancing Stragglers Against Staleness in Distributed Deep Learning. In: UNSPECIFIED, pp. 12-21.

[img] PDF
Ieee­_Int_Con_Hig_Per_Com_12_2019.pdf - Published Version
Restricted to Registered users only

Download (1MB) | Request a copy
Official URL: https://dx.doi.org/10.1109/HiPC.2018.00011

Abstract

Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300 delay in a node), progressive grouping of available workers still finishes the training within 20 of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (� 2.5 compared to the no-delay case). © 2018 IEEE.

Item Type: Conference Proceedings
Additional Information: copyright for this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Program processors, Asynchronous update; Computational overheads; Computational performance; Learning models; Multiple-group; staleness; straggler; Training speed, Deep learning
Department/Centre: Division of Interdisciplinary Research > Supercomputer Education & Research Centre
Depositing User: Id for Latest eprints
Date Deposited: 06 May 2019 12:51
Last Modified: 06 May 2019 12:51
URI: http://eprints.iisc.ac.in/id/eprint/62161

Actions (login required)

View Item View Item