ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Neighborhood-aware address translation for irregular GPU applications

Shin, Seunghee and LeBeane, Michael and Solihin, Yan and Basu, Arkaprava (2018) Neighborhood-aware address translation for irregular GPU applications. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), OCT 20-24, 2018, Fukuoka, JAPAN, pp. 352-363.

[img] PDF
Ieee_Acm_2018.pdf - Published Version
Restricted to Registered users only

Download (1MB) | Request a copy
Official URL: https://doi.org/10.1109/MICRO.2018.00036


Recent studies on commercial hardware demonstrated that irregular GPU workloads could bottleneck on virtual-to-physical address translations. GPU's single-instruction multiple-thread (SIMT) execution can generate many concurrent memory accesses, all of which require address translation before accesses can complete. Unfortunately, many of these address translation requests often miss in the TLB, generating many concurrent page table walks. In this work, we investigate how to reduce address translation overheads for such applications. We observe that many of these concurrent page walk requests, while irregular from the perspective of a single GPU wavefront, still fall on neighboring virtual page addresses. The address mappings for these neighboring pages are typically stored in the same 64-byte cache line. Since cache lines are the smallest granularity of memory access, the page table walker implicitly reads address mappings (i.e., page table entries or PTEs) of many neighboring pages during the page walk of a single virtual address (VA). However, in the conventional hardware, mappings not associated with the original request are simply discarded. In this work, we propose mechanisms to coalesce the address translation needs of all pending page table walks in the same neighborhood that happen to have their address mappings fall on the same cache line. This is almost free; the page table walker (PTW) already reads a full cache line containing address mappings of all pages in the same neighborhood. We find this simple scheme can reduce the number of accesses to the in memory page table by 37% on average. This speeds up a set of GPU workloads by an average of 1.7x.

Item Type: Conference Proceedings
Publisher: IEEE
Additional Information: 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Fukuoka, JAPAN, OCT 20-24, 2018
Keywords: Computer architecture; GPU; virtual address
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 01 Feb 2019 05:19
Last Modified: 01 Feb 2019 06:25
URI: http://eprints.iisc.ac.in/id/eprint/61666

Actions (login required)

View Item View Item