Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design Implementation and Evaluation with MVAPICH2
H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur, D. Panda
IEEE Cluster '11,
Sep 2011.