ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

Merchant, Farhad and Maity, Arka and Mahadurkar, Mahesh and Vatwani, Kapil and Munje, Ishan and Krishna, Madhava and Nalesh, S and Gopalan, Nandhini and Raha, Soumyendu and Nandy, SK and Narayan, Ranjani (2015) Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations. In: 28th International Conference on VLSI Design (VLSID) / 14th International Conference on Embedded Systems, JAN 03-07, 2015, Bangalore, INDIA, pp. 153-158.

[img] PDF
28TH _VLSID_153_2015.pdf - Published Version
Restricted to Registered users only

Download (543kB) | Request a copy
Official URL: http://dx.doi.org/10.1109/VLSID.2015.31


LU and QR factorizations are the computationally dear part of many applications ranging from large scale simulations (e.g. computational fluid dynamics) to augmented reality. These factorizations exhibit time complexity of O(n(3)) and are difficult to accelerate due to presence of bandwidth bound kernels, BLAS-1 or BLAS-2 (level-1 or level-2 Basic Linear Algebra Subprograms) along with compute bound kernels (BLAS-3, level-3 BLAS). On the other hand, Coarse Grained Reconfigurable Architectures (CGRAs) have gained tremendous popularity as accelerators in embedded systems due to their flexibility and ease of use. Provisioning these accelerators in High Performance Computing (HPC) platforms is the research challenge wrestled by the computer scientists. We consider a CGRA environment in which several Compute Elements (CEs) enhanced with Custom Functional Units (CFUs) are interconnected over a Network-on-Chip (NoC). In this paper, we carry out extensive microarchitectural exploration for accelerating core kernels like Matrix Multiplication (MM) (BLAS-3) for LU and QR factorizations. Our 5 different design enhancements lead to the reduction in the latency of BLAS-3 kernels. On a stand-alone CFU, we achieve up to 8x speed-up for MM. A commensurate improvement is observed for MM in a CGRA environment. We achieve better GFLOPS/mm(2) compared to recent implementations.

Item Type: Conference Paper
Series.: International Conference on VLSI Design
Additional Information: copy right for this article belongs to the IEEE COMPUTER SOC, 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
Department/Centre: Division of Interdisciplinary Sciences > Supercomputer Education & Research Centre
Date Deposited: 07 Dec 2016 04:33
Last Modified: 07 Dec 2016 04:33
URI: http://eprints.iisc.ac.in/id/eprint/55446

Actions (login required)

View Item View Item