ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

EMF: Disaggregated GPUs in Datacenters for Efficiency, Modularity and Flexibility

Guleria, A and Lakshmi, J and Padala, C (2019) EMF: Disaggregated GPUs in Datacenters for Efficiency, Modularity and Flexibility. In: 8th IEEE International Conference on Cloud Computing in Emerging Markets, 19-20 September 2019, Bengaluru, India, pp. 1-8.

[img] PDF
CCEM_2019.pdf - Published Version
Restricted to Registered users only

Download (623kB) | Request a copy
Official URL: https://dx.doi.org/10.1109/CCEM48484.2019.000-5

Abstract

Disaggregating expensive and power-hungry GPUs enable a cost-efficient and adaptive ecosystem for cloud deployment, particularly for emerging markets, wherein AI applications are some of the dominant ones using them. This paper motivates GPU disaggregation and highlights key properties useful in resource management of disaggregated resource frameworks. An evaluation of current design approaches to GPU disaggregation is made and analysis of the NVIDIA GPU stack is done to identify various abstract layers of the stack for disaggregating. Further, based on this analysis the paper proposes a rack-level, opensource based, and backward compatible GPU disaggregation system called EMF. Key design decisions of EMF and how these choices enable scalability, efficiency, and fault-tolerance are discussed. EMF design is evaluated using an analytical model derived from low-level interactions between proprietary NVIDIA host driver and NVIDIA GPUs over PCIe. The worst-case latency analysis indicate that overheads in proposed design could vary from 7.6 to 20.2 depending on the application characteristics, justifying the practicality of this design for cloud setups. © 2019 IEEE.

Item Type: Conference Paper
Publication: Proceedings - 2019 8th IEEE International Conference on Cloud Computing in Emerging Markets, CCEM 2019
Publisher: Institute of Electrical and Electronics Engineers Inc.
Additional Information: The copyright of this article belongs to Institute of Electrical and Electronics Engineers Inc.
Keywords: Cloud computing; Commerce; Efficiency; Fault tolerance, AI applications; Backward compatible; Cloud deployments; Design approaches; Design decisions; Emerging markets; Resource management; Worst-case latencies, Program processors
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 23 Aug 2021 06:48
Last Modified: 23 Aug 2021 06:48
URI: http://eprints.iisc.ac.in/id/eprint/65331

Actions (login required)

View Item View Item