Overview

General purpose Graphical Processing Units (GPUs) are becoming an integral part of modern system architectures. They are pushing the peak performance of the fastest supercomputers in the world and are speeding up a wide spectrum of applications. While the GPUs provide very high peak flops, data movement between host and GPU, and between GPUs continues to remain a bottleneck for both performance and programmer productivity. MPI has been the de-facto standard for parallel application development in the High Performance Computing domain. Many of the MPI applications are being ported to run on clusters with GPUs for higher performance. Our project aims to simplify this task by supporting standard Message Passing Interface (MPI) from GPU device memory through the MVAPICH2 MPI library. While supporting the advanced features of MPI like collective communication, user-defined datatypes and one-sided communication among others, MVAPICH2 aims to optimize the data movement between host and GPU, and between GPUs in the best way possible with minimal or no overhead to the application developer.

Description

Support for MPI communication from GPUs has been available in public releases of MVAPICH2 starting from version 1.8. The OSU Micro Benchmarks (OMB) have been extended to evaluate MPI communication between GPU and host, and between two GPUs. Some performance results using OMB and the latest release of MVAPICH2 are presented here. This effort is funded by NVIDIA Corporation.

Software Distribution

Link to Programming model support for GPU and Accelerators

Journals (6)
1	K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, High Performance MPI over the Slingshot Interconnect, Special Issue of Journal of Computer Science and Technology (JCST), Feb 2023.
2	H. Wang, S. Potluri, D. Bureddy, and DK Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation, IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605, Oct 2014.
3	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
4	C. Chu, X. Lu, Ammar Awan, H. Subramoni, Bracy Elton, and DK Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast, IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 30, no. 3, pp. 575-588, 1 March 2019,
5	K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters, ParCo: Elsevier Parallel Computing Journal ,
6	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,

Conferences & Workshops (38)
1	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
2	DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs B. Michalowicz, K. Suresh, H. Subramoni, DK Panda, and S. Poole, Practice and Experience in Advanced Research Computing 23, Jul 2023 [Bib - Plain]
3	Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication N. Contini, B. Ramesh, K. Suresh, T. Tran, B. Michalowicz, M. Abduljabbar, H. Subramoni, and DK Panda, International Conference on Supercomputing 2023, Jun 2023 [Bib - Plain]
4	Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc K. Khorassani, C. Chen, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
5	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
6	Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads Q. Zhou, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
7	High Performance MPI over the Slingshot Interconnect: Early Experiences K. Khorassani, C. Chen, B. Ramesh, A. Shafi, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2022 [Best Student Paper Award] [Bib - Plain]
8	Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems C. Chen, K. Khorassani, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, Heterogeneity in Computing Workshop (HCW 2022), May 2022 [held in conjunction with IPDPS'22] [Bib - Plain]
9	Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters Q. Zhou, P. Kousha, Q. Anthony, K. Khorassani, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
10	Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences K. Khorassani, J. Hashmi, C. Chu, C. Chen, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2021, Jun 2021 [Bib - Plain]
11	Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences Q. Anthony, L. Xu, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel And Distributed Infrastructures, May 2021 [Bib - Plain]
12	Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR M. Ghazimirsaeed, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 6th Workshop on Machine Learning in HPC Environments, Nov 2020 [Bib - Plain]
13	Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters C. Chu, K. Khorassani, Q. Zhou, H. Subramoni, and DK Panda, 22nd IEEE International Conference on Cluster Computing (IEEE Cluster 2020), Sep 2020 [Bib - Plain]
14	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
15	Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR Q. Anthony, Ammar Awan, A. Jain, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel and Distributed Infrastructures (ScaDL) at IPDPS '20, May 2020 [Bib - Plain]
16	High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems C. Chu, J. Hashmi, K. Khorassani, H. Subramoni, and DK Panda, 26th IEEE International Conference on High Performance Computing, Data, Analytics and Data Science (HiPC '19), Dec 2019 [Bib - Plain]
17	OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks K. Vadambacheri Manian, C. Chu, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, 10th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Nov 2019 [Bib - Plain]
18	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
19	Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences K. Khorassani, C. Chu, H. Subramoni, and DK Panda, International Workshop on OpenPOWER for HPC, held in conjunction with ISC'19, Jun 2019 [Bib - Plain]
20	C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks J. Zhang, X. Lu, C. Chu, and DK Panda, 33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '19), May 2019 [Bib - Plain]
21	Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures K. Vadambacheri Manian, Ammar Awan, A. Ruhela, C. Chu, and DK Panda, 12th Workshop on General Purpose Processing Using GPU (GPGPU 2019) @ ASPLOS 2019, Apr 2019 [Bib - Plain]
22	Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 [Bib - Plain]
23	Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors J. Hashmi, K. Hamidouche, H. Subramoni, and DK Panda, 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC'17), Dec 2017 [Bib - Plain]
24	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
25	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
26	CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC K. Hamidouche, Ammar Awan, A. Venkatesh, and DK Panda, 23rd IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2016 [Bib - Plain]
27	Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters D. Banerjee, K. Hamidouche, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
28	Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications C. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and DK Panda, First Workshop on Optimization of Communication in HPC runtime systems (COMHPC, SC Workshop), Nov 2016 [Bib - Plain]
29	Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee, H. Subramoni, and DK Panda, The 30th IEEE International Parallel & Distributed Processing Symposium (IPDPS '16), May 2016 [Bib - Plain]
30	CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters C. Chu, K. Hamidouche, A. Venkatesh, Ammar Awan, and DK Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'16), May 2016 [Bib - Plain]
31	Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters A. Venkatesh, K. Hamidouche, H. Subramoni, and DK Panda, 22nd IEEE International Conference on High Performance Computing, Dec 2015 [Bib - Plain]
32	Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters K. Hamidouche, A. Venkatesh, Ammar Awan, H. Subramoni, and DK Panda, IEEE Cluster 2015, Sep 2015 [Bib - Plain]
33	High Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters M. Li, K. Hamidouche, X. Lu, J. Lin, and DK Panda, Euro-Par '2015, Aug 2015 [Bib - Plain]
34	Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences D. Banerjee, K. Hamidouche, and DK Panda, General Purpose GPU (GPGPU-9), Mar 2015 [Bib - Plain]
35	OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and DK Panda, EuroMPI 2012, Sep 2012 [Bib - Plain]
36	Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication S. Potluri, H. Wang, D. Bureddy, A. Singh, C. Rosales, and DK Panda, International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), May 2012 [Slides] [Bib - Plain]
37	MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and DK Panda, International Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), Sep 2011 [Slides] [Bib - Plain]
38	MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and DK Panda, International Supercomputing Conference '11 (ISC'11), Jun 2011 [Slides] [Bib - Plain]

Ph.D. Disserations (5)
1	M. Bayatpour, Designing High Performance Hardware-assisted Communication Middlewares for Next-Generation HPC Systems, May 2021
2	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
3	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
4	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020
5	S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014

M.S. Thesis (3)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3	A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012

NOWLAB: Network Based Computing Lab

Programming model support for GPU and Accelerators

Overview

Description

Software Distribution

Journals (6)

Conferences & Workshops (38)

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

High Performance MPI over the Slingshot Interconnect: Early Experiences

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Designing a ROCm-aware MPI Library for AMD GPUs: Early Experiences

Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences

C-GDR: High-Performance Container-aware GPUDirect MPI Communication Schemes on RDMA Networks

Characterizing CUDA Unified Memory (UM)-AwareMPI Designs on Modern GPU Architectures

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores

Kernel-assisted Communication Engine for MPI on Emerging Manycore Processors

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC

Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters

Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled System

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

High Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters

Designing High Performance Communication Runtime for GPU Managed Memory: Early Experiences

OMB-GPU: A Micro-benchmark suite for Evaluating MPI Libraries on GPU Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters

Ph.D. Disserations (5)

M.S. Thesis (3)