Automatic Data Allocation and Buffer Management for Multi-GPU Machines

Ramashekar, Thejas and Bondhugula, Uday (2013) Automatic Data Allocation and Buffer Management for Multi-GPU Machines. In: ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 10 (4).

PDF
acm_tra_arc_cod_opt_10-4_2013.pdf - Published Version
Restricted to Registered users only
Download (2MB) | Request a copy

Official URL: http://dx.doi.org/10.1145/2544100

Abstract

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Item Type:	Journal Article
Publication:	ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
Publisher:	ASSOC COMPUTING MACHINERY
Additional Information:	Copyright for this article belongs to the ASSOC COMPUTING MACHINERY, USA
Keywords:	Compilers; Algorithms; Scalability; GPU; memory management; data scaling; weak scaling; OpenCL; polyhedral model
Department/Centre:	Division of Electrical Sciences > Computer Science & Automation
Date Deposited:	19 May 2014 05:22
Last Modified:	19 May 2014 05:22
URI:	http://eprints.iisc.ac.in/id/eprint/49084

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India