Programming model support for GPU and Accelerators

Overview

General purpose Graphical Processing Units (GPUs) are becoming an integral part of modern system architectures. They are pushing the peak performance of the fastest supercomputers in the world and are speeding up a wide spectrum of applications. While the GPUs provide very high peak flops, data movement between host and GPU, and between GPUs continues to remain a bottleneck for both performance and programmer productivity. MPI has been the de-facto standard for parallel application development in the High Performance Computing domain. Many of the MPI applications are being ported to run on clusters with GPUs for higher performance. Our project aims to simplify this task by supporting standard Message Passing Interface (MPI) from GPU device memory through the MVAPICH2 MPI library. While supporting the advanced features of MPI like collective communication, user-defined datatypes and one-sided communication among others, MVAPICH2 aims to optimize the data movement between host and GPU, and between GPUs in the best way possible with minimal or no overhead to the application developer.

Description

Support for MPI communication from GPUs has been available in public releases of MVAPICH2 starting from version 1.8. The OSU Micro Benchmarks (OMB) have been extended to evaluate MPI communication between GPU and host, and between two GPUs. Some performance results using OMB and the latest release of MVAPICH2 are presented here. This effort is funded by NVIDIA Corporation.

Journals (5)

1 H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation , IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
2 DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science , Oct 2020.
3 C. Chu, X. Lu, Ammar Awan, H. Subramoni, Bracy Elton, and DK Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 3, pp. 575-588, 1 March 2019 , .
4 K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters , ParCo: Elsevier Parallel Computing Journal , .
5 Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects , IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020. , .

Conferences & Workshops (29)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Ph.D. Disserations (5)

1 M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
2 C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
3 J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
4 Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
5 S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014

M.S. Thesis (3)

1 S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2 N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3 A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012