ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

AdFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability

George, Cijo and Vadhiyar, Sathish S (2012) AdFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability. In: International Conference on Computational Science (ICCS) , JUN 04-06, 2012 , Omaha, Nebraska, USA , pp. 166-175.

[img] PDF
pro_com_sci_9_166_2012.pdf - Published Version
Restricted to Registered users only

Download (273kB) | Request a copy
Official URL: http://dx.doi.org/10.1016/j.procs.2012.04.018

Abstract

Exascale systems of the future are predicted to have mean time between failures (MTBF) of less than one hour. Malleable applications, where the number of processors on which the applications execute can be changed during executions, can make use of their malleability to better tolerate high failure rates. We present AdFT, an adaptive fault tolerance framework for long running malleable applications to maximize application performance in the presence of failures. AdFT framework includes cost models for evaluating the benefits of various fault tolerance actions including checkpointing, live-migration and rescheduling, and runtime decisions for dynamically selecting the fault tolerance actions at different points of application execution to maximize performance. Simulations with real and synthetic failure traces show that our approach outperforms existing fault tolerance mechanisms for malleable applications yielding up to 23% improvement in application performance, and is effective even for petascale systems and beyond.

Item Type: Conference Proceedings
Series.: Procedia Computer Science
Publisher: ELSEVIER SCIENCE BV
Additional Information: Copyright for this article belongs to Elsevier Science
Keywords: Fault Tolerance; HPC; Malleable Parallel Applications; Large Scale Systems; Rescheduling
Department/Centre: Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 17 Oct 2012 06:59
Last Modified: 17 Oct 2012 06:59
URI: http://eprints.iisc.ac.in/id/eprint/45184

Actions (login required)

View Item View Item