Programming model support for Co-Processor (GPU & MIC)


General purpose Graphical Processing Units (GPUs) are becoming an integral part of modern system architectures. They are pushing the peak performance of the fastest supercomputers in the world and are speeding up a wide spectrum of applications. While the GPUs provide very high peak flops, data movement between host and GPU, and between GPUs continues to remain a bottleneck for both performance and programmer productivity. MPI has been the de-facto standard for parallel application development in the High Performance Computing domain. Many of the MPI applications are being ported to run on clusters with GPUs for higher performance. Our project aims to simplify this task by supporting standard Message Passing Interface (MPI) from GPU device memory through the MVAPICH2 MPI library. While supporting the advanced features of MPI like collective communication, user-defined datatypes and one-sided communication among others, MVAPICH2 aims to optimize the data movement between host and GPU, and between GPUs in the best way possible with minimal or no overhead to the application developer.


Support for MPI communication from GPUs has been available in public releases of MVAPICH2 starting from version 1.8. The OSU Micro Benchmarks (OMB) have been extended to evaluate MPI communication between GPU and host, and between two GPUs. Some performance results using OMB and the latest release of MVAPICH2 are presented here. This effort is funded by NVIDIA Corporation.

Journals (3)

1 H. Wang, S. Potluri, D. Bureddy, and D. K. Panda, GPU-Aware MPI on RDMA-Enabled Cluster: Design, Implementation and Evaluation , IEEE Transactions on Parallel & Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
2 C. Chu, X. Lu, A. Awan, H. Subramoni, Bracy Elton, and D. K. Panda, Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , to appear in IEEE Transactions on Parallel and Distributed Systems , .
3 K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, and D. K. Panda, CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters , ParCo: Elsevier Parallel Computing Journal , .

Conferences & Workshops (17)


Ph.D. Disserations (1)

1 S. Potluri, Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects, May 2014

M.S. Thesis (1)

1 A. Singh, Optimizing All-to-all and Allgather Communications on GPGPU Clusters, Apr 2012