ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Dathathri, Roshan and Reddy, Chandan and Ramashekar, Thejas and Bondhugula, Uday (2013) Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory. In: 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT), SEP 07-11, 2013, Edinburgh, SCOTLAND, pp. 375-386.

[img] PDF
int_con_par_arc_aom_tec_375_2013.pdf - Published Version
Restricted to Registered users only

Download (859kB) | Request a copy
Official URL: http://dl.acm.org/citation.cfm?id=2523771

Abstract

Programming for parallel architectures that do not have a shared address space is extremely difficult due to the need for explicit communication between memories of different compute devices. A heterogeneous system with CPUs and multiple GPUs, or a distributed-memory cluster are examples of such systems. Past works that try to automate data movement for distributed-memory architectures can lead to excessive redundant communication. In this paper, we propose an automatic data movement scheme that minimizes the volume of communication between compute devices in heterogeneous and distributed-memory systems. We show that by partitioning data dependences in a particular non-trivial way, one can generate data movement code that results in the minimum volume for a vast majority of cases. The techniques are applicable to any sequence of affine loop nests and works on top of any choice of loop transformations, parallelization, and computation placement. The data movement code generated minimizes the volume of communication for a particular configuration of these. We use a combination of powerful static analyses relying on the polyhedral compiler framework and lightweight runtime routines they generate, to build a source-to-source transformation tool that automatically generates communication code. We demonstrate that the tool is scalable and leads to substantial gains in efficiency. On a heterogeneous system, the communication volume is reduced by a factor of 11X to 83X over state-of-the-art, translating into a mean execution time speedup of 1.53X. On a distributed-memory cluster, our scheme reduces the communication volume by a factor of 1.4X to 63.5X over state-of-the-art, resulting in a mean speedup of 1.55X. In addition, our scheme yields a mean speedup of 2.19X over hand-optimized UPC codes.

Item Type: Conference Proceedings
Series.: International Conference on Parallel Architectures and Compilation Techniques
Publisher: IEEE
Additional Information: Copyright for this article belongs to the IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 19 Aug 2014 11:03
Last Modified: 19 Aug 2014 11:03
URI: http://eprints.iisc.ac.in/id/eprint/49607

Actions (login required)

View Item View Item