ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems

Panigrahi, G and Kodali, N and Panda, D and Motamarri, P (2024) Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems. In: Journal of Parallel and Distributed Computing, 192 .

[img] PDF
Jou_par_dis_com_192_2024.pdf - Published Version
Restricted to Registered users only until October 2024.

Download (2MB) | Request a copy
Official URL: https://doi.org/10.1016/j.jpdc.2024.104925

Abstract

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively. © 2024 Elsevier Inc.

Item Type: Journal Article
Publication: Journal of Parallel and Distributed Computing
Publisher: Academic Press Inc.
Additional Information: The copyright for this article belongs to Academic Press Inc.
Keywords: Benchmarking; Digital arithmetic; Eigenvalues and eigenfunctions; Graphics processing unit; Iterative methods; Matrix algebra; Memory architecture; Program processors, Data access; Heterogeneous architectures; High-order finite elements; matrix; Matrix free; Multi-nodes; Multivectors; Scalable algorithm for heterogeneous architecture; Scalable algorithms; Sum factorization, Finite element method
Department/Centre: Division of Interdisciplinary Sciences > Computational and Data Sciences
Date Deposited: 13 Aug 2024 12:08
Last Modified: 13 Aug 2024 12:08
URI: http://eprints.iisc.ac.in/id/eprint/85447

Actions (login required)

View Item View Item