ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Fault Tolerance on Large Scale Systems using Adaptive Process Replication

George, Cijo and Vadhiyar, Sathish (2015) Fault Tolerance on Large Scale Systems using Adaptive Process Replication. In: IEEE TRANSACTIONS ON COMPUTERS, 64 (8). pp. 2213-2225.

[img] PDF
IEEE_Tra_On_Com_64-8_2213_2015.pdf - Published Version
Restricted to Registered users only

Download (1MB) | Request a copy
Official URL: http://dx.doi.org/10.1109/TC.2014.2360536


Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in low efficiency because of the high number of application failures resulting in large amount of lost work due to rollbacks. In such scenarios, it is highly necessary to have proactive fault tolerance mechanisms that can help avoid significant number of failures. In this work, we have developed a mechanism for proactive fault tolerance using partial replication of a set of application processes. Our fault tolerance framework adaptively changes the set of replicated processes periodically based on failure predictions to avoid failures. We have developed an MPI prototype implementation, PAREP-MPI that allows changing the replica set. We have shown that our strategy involving adaptive process replication significantly outperforms existing mechanisms providing up to 20 percent improvement in application efficiency even for exascale systems.

Item Type: Journal Article
Additional Information: Copy right for this article belongs to the IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1314 USA
Keywords: Fault tolerance; process replication; exascale systems
Department/Centre: Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 11 Aug 2015 06:29
Last Modified: 11 Aug 2015 06:29
URI: http://eprints.iisc.ac.in/id/eprint/52054

Actions (login required)

View Item View Item