ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

PRO: Progress Aware GPU Warp Scheduling Algorithm

Anantpur, Jayvant and Govindarajan, R (2015) PRO: Progress Aware GPU Warp Scheduling Algorithm. In: 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), MAY 25-29, 2015, Hyderabad, INDIA, pp. 979-988.

[img] PDF
IEEE_IPDPS_979_2015.pdf - Published Version
Restricted to Registered users only

Download (235kB) | Request a copy
Official URL: http://dx.doi.org/10.1109/IPDPS.2015.26

Abstract

Graphics Processing Units (GPUs) contain multiple SIMD cores and each core can run a large number of threads concurrently. Threads in a core are scheduled and executed in fixed sized groups, called warps. Each core contains one or more warp schedulers that select and execute warps from a pool of ready warps. In spite of having a large number of concurrent warps - 48 on NVIDIA Fermi architecture GPU - on many GPGPU applications, current warp scheduling algorithms can not effectively utilize the hardware resources, resulting in stall cycles and loss in performance. The main reason for this is current warp scheduling algorithms mostly focus on long latency operations, especially global memory accesses, and do not take into account factors such as the progress of each thread block and the number of ready warps. In this paper, we propose, PRO, a progress warp scheduling algorithm that not only focuses on finishing individual thread blocks faster but also on reducing the overall execution time. These goals are achieved by dynamically prioritizing thread blocks and warps, based on their progress. We implemented our proposed algorithm in the GPGPU-SIM simulator and evaluated on various applications from GPGPU-SIM, Rodinia and CUDA SDK benchmark suites. We achieved an average speedup of 1.12x and a maximum speedup of 1.94x over the commonly used Loose Round Robin warp scheduling algorithm. Over the Two Level warp scheduler, our algorithm showed an average speedup of 1.13x and a maximum speedup of 1.6x. Our proposed solution requires only a very small increase in the GPU hardware.

Item Type: Conference Proceedings
Series.: International Parallel and Distributed Processing Symposium IPDPS
Additional Information: Copy right for this article belongs to the IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA
Department/Centre: Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 28 Oct 2016 06:51
Last Modified: 28 Oct 2016 06:51
URI: http://eprints.iisc.ac.in/id/eprint/54963

Actions (login required)

View Item View Item