An efficient and scalable checkpointing and recovery algorithm for distributed systems

Krishna Kumar, KP and Hansdah, RC (2006) An efficient and scalable checkpointing and recovery algorithm for distributed systems. In: 8th International Conference on Distributed Computing and Networking,, Dec 27-30, 2006, Guwahati, India, pp. 94-99.

PDF
fulltext.pdf12345.pdf - Published Version
Restricted to Registered users only
Download (133kB) | Request a copy

Official URL: http://www.springerlink.com/content/n67j7n16715hg4...

Abstract

In this paper, we describe an efficient coordinated-checkpointing and recovery algorithm which can work even when the channels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking checkpoints. Based on the local conditions, any process can request the previous coordinator for the 'permission' to initiate a new checkpoint. Allowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.

Item Type:	Conference Paper
Series.:	LECTURE NOTES IN COMPUTER SCIENCE
Publisher:	Springer
Additional Information:	Copyright of this article belongs to Springer.
Department/Centre:	Division of Electrical Sciences > Computer Science & Automation
Date Deposited:	01 Sep 2010 05:31
Last Modified:	19 Sep 2010 06:12
URI:	http://eprints.iisc.ac.in/id/eprint/30507

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India